Above is the flow chart of the “Structured Troubleshooting” method, which takes into consideration who is authorized to implement network changes and fixes, and also focuses on documenting those steps to ensure the fix is implemented correctly and documentation exists that can be referenced at a later time when needed.
There are several different methods that can be used to troubleshoot and resolve an issue that will be reviewed, but the base of troubleshooting methods to know is the “Simplified” method which then is expanded in the “Structured” method.
“Simplified” and “Structured” troubleshooting methods reviewed in detail
The “Simplified Method” consists of 3 steps that expand into the “Structured Method” :
- Problem Report – Network problem is reported by end users in vague terms such as “The internet is not working” or “FTP is not working for this group of users”
- Problem Diagnosis – Gather further detailed information from end users, previous device configurations, and baseline network info to determine potential causes for the problem reported
- Problem Resolution – Confirm the problem is resolved for impacted end users, take network device config backups, and document all steps / changes made on the network to resolve the issue for later use
The “Structured Method” takes step 2 of the “Simplified Method” and expands it into 5 different steps of Problem Diagnosis, giving the following 7 steps for problem resolution:
- Problem Report – Network problem is reported by end users in vague terms such as “The internet is not working” or “FTP is not working for this group of users”
- Collect Information – This includes interviewing users impacted by the problem, collect network device configs (past and present), and use any network documentation available such as procedures, old tickets, etc.
- Examine Information – The two main goals when examining collected information is to “Identify information that points to possible causes” and “Identify information that can be used to eliminate possible causes” of the problem, most commonly done by asking the question, “What is currently happening on the network, and what should be happening on the network?”
- Eliminate Potential Causes – Before concluding what fixes should be deployed to fix the problem, we need to identify what is NOT the cause of the problem (based on the collect / examined information), so we are not wasting time trying to solve a problem that doesn’t exist or cause further problems with unnecessary fixes
- Propose a Hypothesis – After eliminating potential causes the network admin can form a hypothesis on what the most probable causes are of the network issue, they can confirm if they are authorized to implement the fixes, and if they are not authorized / cannot reach authorized personnel to implement the fix they must then determine if a workaround can be implemented to alleviate impact based on how urgent the issue is to the company
- Verify Hypothesis – During this stage it is critical to consider when the fix can / should be implemented (ASAP for network outage, after hours for network slowness, etc), to define the steps to implement the solution by documenting them to ensure nothing gets skipped while deploying the solution, and if the most probable hypothesized fix does not work move on to the next most probable fix for the issue and document / implement the fix – This is why a maintenance window is critical if multiple possible fixes for the issue are identified!
- Problem Resolution – Confirm the problem is resolved for impacted end users, take network device config backups, and document all steps / changes made on the network to resolve the issue for later use
Some important things to note in these steps:
- Who is authorized to make changes is an important consideration, and if you are not authorized to make the change, determine if a workaround is warranted and can be implemented until the authorized personnel can make the changes (do not make changes you are not authorized to make!)
- Document, document, document. Use existing documentation when “Examining Info” / “Eliminating Causes” steps, document the steps of implementing the fix before implementation to ensure you don’t miss any steps, and document all changes made once the issue is verified as resolved.
- “Baseline data” can be things like a packet capture of network traffic when the network was working, backup data taken before there was a problem, and is part of proactive maintenance to be discussed in a later post
As discussed this model is largely popular because considerations such as MAC (Make / Add / Change) authorization is taken into consideration, documentation is being referenced and created along the way, and if consistently followed by all members of a network team can dramatically improve the teams ability to quickly resolve issues with good documentation and defined steps to take until problem resolution.
So to conclude, the “Structured” method is basically expanding the “Simplified” methods Diagnosis step to include more detailed analysis of the problem cause / possible fix, and considering both authorization and documentation along the way.
Example of “Shoot from the Hip” troubleshooting method:
This method is often deployed by network admins familiar enough with the network that they feel comfortable proposing a hypothesis without analyzing data to eliminate possible causes first, which may lead to causing more issues or wasting time implementing fixes that are not required (and could have been eliminated using a structured troubleshooting approach).
This also skips reviewing currently available documentation, thus introducing the issue of possibly not updating information for future reference by other network admins.
The flow of this method is performed as illustrated in the flow chart below:
It can work to resolve an issue, but is far from ideal, and risks making updates to network devices / implementing fixes that are not documented for future use.
Other popular troubleshooting methods described by the OSI Model / Flow chart
Some other popular troubleshooting methods that reference more practical strategies for exam day rather than a robust step by step approach, and may be more appropriate for the TSHOOT exam or to quickly identify where an issue is located on a network.
The first few methods mention can be defined by the OSI Model, as shown below.
Bottom-Up Troubleshooting Method:
This method starts at Layer 1 (Physical Layer), ensuring first that cables are good and that frames are being received by the switch from the Host (mac-address or ARP table on switch), then moving up the OSI model from Layer 1 to Layer 7:
This is not very time effective, and should only really be deployed for a single host experiencing a problem, as if only a single user is experiencing an issue it is more likely a single bad cable or port level issue.
Top-Down Troubleshooting Method:
This method goes in the reverse order of troubleshooting the OSI model, which may be appropriate based on the information gathered, for example if the problem is largely reported as an issue with a specific Application rather than a network service or communication with a certain IP Address or IP Network.
This goes the opposite direction starting at Layer 7 of the OSI model down to Layer 1:
This would be appropriate both for a single host issue or a subset of users, as an Application issue could range from a misconfigured application on a users Desktop, to a licensing issue for a department that uses a specific software to perform their duties.
Divide and Conquer Troubleshooting Method:
In situations where the problem / data information is not sufficient to determine a Top-Down or Bottom-Up approach, the Divide and Conquer method is deployed by pinging the destination IP Address from the source IP Address (from the Host machine), which if the ping is successful it can be assumed that layers 1-3 are working across the data path and then a Bottom-Up or Top-Down approach for Layer 4-7 can be deployed based on the best option:
This is easily the most widely used troubleshooting strategy to narrow down an issue, however if the ping does not successfully respond from Host to Destination, the next troubleshooting method should be deployed to pinpoint where the issue resides.
Follow the Traffic Path Troubleshooting Method:
If the Divide and Conquer method reveals that the Host machine does not have Layer 3 connectivity to its destination Host IP Address, this method will narrow down which network device / segment is having the issue:
In the above example the issue does not necessarily indicate an issue at Layer 3 on SW2, however L3 communication is failing from this device, and the network admin can now focus on this network device or segment to look deeper into issues that may be resolved using the last two troubleshooting methods described below.
Comparing Configuration Troubleshooting Method:
This is what I call the “Stare and Compare” method, where you can either take a backup running configuration of a device and compare it to the current running configuration to spot possible changes that weren’t documented, or you can review the running configuration of the local and directly connected devices to spot differences in the configs that could cause an issue (like missing security config on one side, incorrect password / key for a security config, wrong interface IP configured, etc).
Component Swapping Troubleshooting Method:
If the “Stare and Compare” does not produce any obvious configuration issues, replacing a cable or the entire device may be appropriate, based on the scale of the issue.
For example if a single user on a switch is having an issue then it may be a bad cable is causing issues to that single host, however if an entire group of users are having issues that are all local to a single switch, the issue may be that the switch is going bad and needs to replaced entirely (either after hours or ASAP depending on the issue severity).
That wraps up the different approaches to troubleshooting a network issue!
Though this article describes the different approaches and when to deploy them, another large portion of network troubleshooting is the maintenance, which will be discussed in another post as this is robust enough of a topic for tonight 🙂
Next up will be a review of network maintenance models, also an important topic for the TSHOOT exam, before diving into troubleshooting ROUTE and SWITCH topics!