Multi-Layer Switch architecture, and how SDM can play a role in relieving the issue, and as this will only be a reference to commands to use / method of using them the following articles will need to be read or understood for the commands to make sense:
Detailed review of MLS Architecture
Complete explanation of Packet Switching methods and SDM
The first issue to review is the TCAM punting packets to the CPU
Some traffic needs to be sent to the CPU for processing such as “Control Plane” protocols (dynamic routing protocols, STP) as well as broadcast and multicast traffic, and traffic that needs encryption like a GRE tunnel traffic from one of the topologies published by Cisco as being part of the TSHOOT exam!
A real life note, most modern switches have dedicated memory modules for every set of ports, so if congestion is experienced on a switch it can sometimes be resolved by spreading the ports used out across the entire span of the switch rather than one side.
Below will be a following not of commands, with a brief explanation of their purpose, and their full explanations can be viewed in the above links where I wrote about them in much more depth:
- “sh platform tcam util” – This command will show if TCAM memory is being maxed out, if maxed out you will want to check the SDM Template being used
- “sh sdm prefer” – This will show you the SDM template in use, and which amount of resources it is dedicating to what processes
- “sdm prefer (template)” – This will set the appropriate template to allocate resources needed, switch will need a reload to boot into the new template to fix the issue
Troubleshooting High CPU Utilization issues
These come from excessive punts to the CPU based on the Packet Switching method, which IP CEF aka Hardware Switching aka Topology Based Packet Switching is the best method for hardware resource utilization, because it caches known routes in TCAM. The only time the CPU is invoked is if the destination route is not cached in the CEF table / Adjacency Table, and of course if control plane data is coming into the device (STP, routing, ACL, encryption, QoS, etc).
That being said, you will want to know the differences in Packet Switching methods:
- Process-Switching = When a data flow enters an interface, the Route Processor (the CPU) must be involved in every packet forwarding decision
- Fast-Switching = When a data flow enters an interface, if the destination is not stored in the “Route Cache” for that interface, it is “Punted” to the “Route Processor” (CPU) to check the IP Route table for a destination
- Cisco Express Forwarding (CEF) – Uses dedicated hardware resources, that build a Packet Switching Database from L3 IP Route Table and L2 ARP Table info (AT)
These three different packet switching (NOT frame switching @ layer 2!) will take different tolls on the CPU as can be seen, and the optimal choice is obviously IP CEF!
A review of CPU killers hidden in device configurations!
There are a few huge killers of CPU / Punts to the CPU to process that are unnecessary:
- ARPs – As mentioned using IP Cef (optimal) every new data stream not cached has to have the first packet punted to the CPU, ACLs, QoS, etc also consumes CPU
- Net Background Processes – When an interface packet buffer / queue becomes filled and is overflowing, the CPU you will help pick up the slack, this can be seen in the “sh int giX/X” output for throttles, overruns, and ignores fields are incrementing
- IP Background Process – This Process works with interfaces states to detect when they are Up / Down, and can consume a lot of CPU resources on flapping interfaces
- TCP Timer Process – This happens when a 3-way TCP Handshake is not completed on the last step, leaving the initiating device waiting for a final ACK it never receives, which can be malicious but will consume CPU resources
***One of the biggest CPU thieves, is a default route configured with an exit interface!***
“ip route 0.0.0.0 0.0.0.0 fa0/1” entry requires an ARP for the destination IP in the packet header EVERY SINGLE PACKET, this configuration should never be used!!!
Instead, use “ip route 0.0.0.0 0.0.0.0 192.168.10.1” so IP CEF has a static entry telling it who that next hop is, rather than “whoever is out interface fa0/1” as that neighbor could change at any time to a different IP Address / Adjacency Table information.
Some commands to help troubleshoot CPU utilization is:
- “sh ip arp” – A lot of incomplete entries in the ARP table can be caused by the above mentioned or that a malicious event may be happening like a network IP sweep
- “sh inter (interface)” – Gives all statistics on that interface to help asses where the issues may reside
- “sh tcp statistics” – Shows a list of current, inactive, and all current TCP sessions. These can utilize CPU if not in use, and actually can be malicious traffic if they are just random connections, can be cleared with a simple “clear connections” and all current sessions should resume immediately
- “sh proc cpu” – Shows a HUGE table of CPU processes, but does have a column for CPU % used for each process, so you can scroll through to see if any specific process is really hammering on the CPU
- “sh proc cpu hist” – This shows a graph on the CLI of the CPU’s utilization in 3 metrics which are 60 second, 1 hour, and 3 days (I believe these values vary)
Lightning round of IP CEF commands:
- “sh ip cef”
- “sh ip interface (interface)”
- “sh ip cache”
- “sh proc cpu | i IP Input (the pipe / include statements are case sensitive!)
- “sh ip cef adj (egress int id) (next hop ip) detail”
- “sh adjacency detail”
Again lease review the links posted above to get really understand what the commands show, what the output means, etc!
Integrating CPU / Memory troubleshooting into “follow the path” method
- Use traceroute to determine which router in the path is having network issues
- On that router issue “sh proc cpu” to eliminate high CPU utilization
- Issue “sh ip route” to ensure it has a route to the destionation
- Issue “sh ip cef” to confirm CEF is running
- Issue “sh ip cef x.x.x.x 255.255.255.255” command to verify the full CEF entry for that route, like exit interface, and next hop IP
- Issue “show adjacency (interface) detail” to confirm it has L2 info for CEF’s L3 info
- Issue “sh ip arp” to confirm this device has the next hop devices IP / MAC address
- Go to the next hop device, and repeat!
One last word about “Leak issues” with memory, and BGP CPU Process hogs
A memory leak is when a process borrows memory from the memory pool, but is unable to return it once the process is finished using it, and is caused by a bug in the Cisco IOS (that a simple reboot can fix in the short term until the IOS can be upgraded).
- Memory Leak – When a router allocates a block of memory for a process, and once that process completes, the memory cannot be returned to the memory pool
- “sh mem allocating-proc table” to review memory utilization
- Memory Allocation Failure / MEMALLOCFAIL – This happens when a process tries to allocate a block of memory but is unable to, usually caused by a bug, which will need to ultimately be fixed by firmware upgrade but may be fixed with a reboot
- Buffer Leak – Same deal as Memory Leak, unable to return Buffer resources borrowed for processes once they complete
BGP CPU Hog processes
I’m not going to touch on BGP here except to say that if you find this issue, you will need to find a way to filter or summarize routes to reduce the CPU overhead.
To check this, issue:
“sh proc mem | i BGP (modifier)” – You can leave the modifier off to review all BGP services and what is consuming what.
Aaaaand there goes my Monday night!
Its been fun, later 🙂