Network Admins hard at work maintaining and fine tuning the network proactively, to make reactive troubleshooting easier when a network problem is reported!
Overview of Proactive Network Maintenance and what it might involve
Network maintenance is part of any network administrators set of duties in maintaining a healthy network, which often maintenance items are uncovered during troubleshooting (dated firmware, extremely high device uptime, etc), which can add an extra step in troubleshooting an emergency or even be the cause of the problem.
For this reason any items that can be scheduled outside of business hours to have little to no user impact should be performed on a scheduled basis (Proactive Maintenance), some tasks that are considered “Network Maintenance” are listed below:
- Hardware / Software updates / installation / configuration
- Troubleshooting network problems that can be scheduled after hours
- Planning network expansion / network design to optimize performance
- Deploying QoS changes, Security changes, and any new configurations to devices
- Documenting the network (physical, logical, etc) along with any changes made
- Backups of network device configs, servers, workstations, etc
- Ensuring compliance with legal federal or corporate policies or regulations
Basically anything you can do after hours to keep the network as up to date as possible should be done, as having to upgrade firmware as part of troubleshooting with Cisco TAC or other vendors should be avoided if possible through “Proactive Maintenance” rather then being discovered as part of troubleshooting or “Reactive Maintenance.”
Proactive Maintenance vs Reactive Maintenance
Pretty self explanatory, but there are some terms to know for exam day (and on the job), which is comprised of how these two different types of maintenance vary:
- Structured Tasks – Predefined / Scheduled tasks, part of Proactive Maintenance
- Interrupt-Driven Tasks – Maintenance performed as part of troubleshooting a problem report, often prolongs troubleshooting in real world scenarios
Structured Tasks are essentially the list of items above in the “Network Maintenance” section, these tasks reduce down time because the network is already updated and documented, so “Interrupt-Driven Tasks” will generally take less time to resolve because things like firmware and QoS have already been addressed during off hours and documentation has been reviewed and kept up to date with fix procedures / running configurations / base data samples / backups of devices / etc.
Interrupt-Driven Tasks can never be completely avoided, however by forming a predefined / proactive maintenance routine, these tasks will create much less downtime when it comes to troubleshooting possible firmware bugs or finding documentation of which tools are required to implement a fix or diagnose the network.
Well known Network Maintenance models – Important for exam day!
Below are a list of well known Network Maintenance models as defined by some of the most recognized organizations, which you will want to know (and love!) for exam day:
- FCAPS (Fault management, Configuration management, Accounting management, Performance management, and Security Management) – Defined by the International Organization for Standardization (ISO), which you might think should be “IOS” for the acronym but “ISO” is the correct acronym for this model
- ITIL (IT Infrastructure Library) – Defines best practice recommendations driven to meet an organizations IT business management goals
- Cisco Lifecycle Services (PPDIOO) – Defines the “life” of a Cisco device in the network in phases including: Prepare / Plan / Design / Implement / Operate / Optimize (Hence the acronym PPDIOO)
The FCAPS model is robust maintenance model to cover all bases of network operation, from auditing phone billing to certain departments (Accounting) to setting up QoS (Performance), along with updating network device Firmware versions regularly to mitigate possible bugs / security threats (Configuration / Security).
The ITIL model is not a one-size-fits-all model in the way that what I define as my companies needs will probably not be the same as your companies needs, because it is defined specifically by an organizations business goals, which drives the how the Network Management is performed, maintained, troubleshot, and documented.
Cisco’s Lifecycle Services is a very straight forward deployment strategy that begins before a network device is even purchased within the plan and design phases, which then leads to network device implementation and finally optimization or fine tuning of a device once in production to meet the companies needs.
Creating and Maintaining current Network Documentation
Network Documentation should both be kept up to date, but also the older or outdated information should be archived for later reference, as the original design of the network may one day help a network administrator understand what the original plan was for future network expansion.
Network Documentation should include the following information:
- Logical Topology – A logical Topology consisting of VLANs, logical connections such as Trunks and Wireless AP Coverage, Trunks, and other logical links showing how users connect to the network through protocols and data paths
- Physical Topology – A physical Topology showing where devices are located in the building structure, typically will include pictures of the physical racks for reference
- Interconnection Listings – Ideally this would have “all” ports on network devices mapped out, but as that is not realistic unless automated, this should at least include ports that connect Trunks / Uplinks / EtherChannels / WAN Circuits
- IP Address Assignments – This should contain information in regards to the IP Address spaces used for different VLANs, such as Voice / Data / Management / etc
- Equipment Inventory – Documentation consisting of the physical inventory of network equipment and any related information to that equipment, such as serial numbers / Model number / Inventory Asset Tag / Licensing information
- Configuration info – Copies of all the running configurations on the network, a good combination from a Cisco IOS CLI is “sh run” / “sh inventory” / “sh version” to get both Inventory and Configuration related information
- Design Blueprints – All versions from the original to the current, so the evolution of the network design can be assessed by future network admins based on old designs
- Any known fixes for specific issues – Any time a fix is found that is not yet documented, it should be documented and saved for future use
This is a pretty exhaustive list that most network admins should already be familiar with, as these are pretty standard pieces of information to maintain for your network.
To keep network documentation up to date, you can:
- Require documentation – Make adding documentation of device changes / fixes a required part of a trouble ticket before ticket closure
- Schedule documentation review – As part of a daily, weekly, monthly, etc part of proactive maintenance documentation should be scheduled to be reviewed
- Automate documentation – Tools for exam day knowledge include Cisco IOS contains Config Release and Config Rollback software, in addition to Embedded Event Manager. For real world, these will more commonly include Syslog and SNMP servers that maintain
Creating and maintaining Network a “Baseline” as part of your documentation
Baseline Data consists of data that was taken while the network was working as expected, so that it can be compared to the current state of the network, to help in troubleshooting the problem / forming a Hypothesis of possible fixes based on the contrast in the information.
For example this can be taking a backup copy of the running configuration of a router and the current configuration, and performing the “Stare and Compare” to spot any differences in the configuration that may not have been documented.
This may also be a Wireshark packet capture sample of what data looked like when the network was functioning properly, or even a “show process cpu” on a Cisco router to get the CPU utilization graph shown on the CLI to compare to the current CPU utilization to see if there is a contrast in utilization which may indicate the issue.
Anything that requires a metric check such as network bandwidth on the LAN (Wireshark, CPU Utilization on network devices), as well as the WAN (Outside interface statistics / VPN Licensing and information / WAN Speeds to ensure they are meeting ISP SLA agreements) should all be reviewed and documented.
Baseline Data is an important piece of troubleshooting, and should be part of any maintenance plan for any efficient network troubleshooting team!
Disaster Recovery / Device Failure planning / preparation to minimize downtime
In the event of network device failure, there are a few steps a network admin should take to mitigate downtime to end users including:
- Having Backup Duplicate Devices – As the name implies this is a matching piece of hardware that is deployed across the network, to replace mission critical pieces of network gear in the event of failure (You can also get these via RMA [Hardware Warranty] from most vendors with a valid support contract)
- Backup of Operating System / Licensing – Though software is generally available with a valid support contract from vendor websites, keeping a copy of the exact OS running on switches / routers / firewalls / servers is best practice, along with licensing information for those devices so the license can be transferred to a new device in the event of device failure
- Backup of Device Configurations – This should ideally be off-site backups in the event a fire starts within a building, and if the fire does not damage the electronics, the sprinkler system does! For this reason not only having backups taken, but offline and offsite backups stored at a remote location or a business cloud like AWS
Note that all these different considerations are tied back to network documentation!
This is why the documentation and maintenance of a network is very important to troubleshooting, because without duplicate devices / backup configs / licensing information / an inventory list / a logical and physical and connection topology, getting the network back up and running would be VERY lengthy and less efficient without a predefined and robust maintenance and documentation requirement.
So reactive troubleshooting and proactive maintenance and documentation is really synonymous, as you cannot do one efficiently without the other, also the more efficiently you do one you will do the other – Bottom line troubleshoot / documentation / maintenance of the network are fully intertwined into each other 110%!
Change Management and Communication about Network Maintenance
When making changes to the network you don’t just want to quietly perform a change after hours and document the changes, you want to communicate within your organization with other teams or managers that may have devices / services impacted by your change (that may trigger offline alerts, take users offline, etc), and make sure to coordinate these changes with them and anyone who may need to authorize the changes and find out if anyone else with authorization or device access needs to be involved in the maintenance until it is documented and completed.
Affected users will also need a heads up for network downtime after hours in case there are any teleworkers catching up on work after hours, this is also why scheduled weekly or monthly maintenance windows can be beneficial, so users can expect proactive maintenance windows to occur at specific times each week or month on a certain day.
In conclusion, GREAT network maintenance and documentation is a must!
It may be hard or perhaps hardly possible to touch all bases of the ideal network documentation and maintenance models in a busy role, but knowing what ideal looks like is half the battle, and taking small steps to become more efficient on one side (troubleshooting or proactive maintenance) will eventually begin to make more time for the other side of network management.
Next up is will be a review of network tools (ping, extended ping, traceroute) at our disposal to use in pinpointing the point of failure while troubleshooting a network!