Debugging ARP Broadcast Storms and High CPU Usage on Cisco 3750X Switches: A Network Engineer’s Guide


4 views

When facing network instability, we identified several critical symptoms:

show processes cpu sorted | exc 0.00%
CPU utilization for five seconds: 99%/12%; one minute: 99%; five minutes: 99%

PID Runtime(ms)     Invoked      uSecs   5Sec   1Min   5Min TTY Process
12   111438973    18587995       5995 44.47% 43.88% 43.96%   0 ARP Input
174    59541847     5198737      11453 22.39% 23.47% 23.62%   0 Hulc LED Process

Multiple MAC addresses appearing on single ports was a red flag:

Vlan    Mac Address       Type        Ports
----    -----------       --------    -----
1    001c.c06c.d620    DYNAMIC     Gi1/1/3
1    001c.c06c.d694    DYNAMIC     Gi1/1/3
1    001c.c06c.d6ac    DYNAMIC     Gi1/1/3

Before diving deeper, we verified several configuration aspects:

  • Confirmed STP configuration with show spanning-tree
  • Checked VLAN assignments with show vlan brief
  • Verified no TCAM exhaustion with show platform tcam utilization

Using Wireshark, we identified the ARP storm pattern. This Python snippet helps analyze packet captures:

from scapy.all import *

def analyze_arp(pcap_file):
    packets = rdpcap(pcap_file)
    arp_count = {}
    
    for pkt in packets:
        if ARP in pkt:
            src_mac = pkt[ARP].hwsrc
            arp_count[src_mac] = arp_count.get(src_mac, 0) + 1
    
    return sorted(arp_count.items(), key=lambda x: x[1], reverse=True)

To mitigate the broadcast traffic impact:

interface GigabitEthernet1/0/1
 storm-control broadcast level 20.00
 storm-control action trap

This script helps track MAC address movements across ports:

#!/bin/bash
while true; do
    date >> mac_movement.log
    ssh switch "show mac address-table" | grep -i "001c.c06c" >> mac_movement.log
    sleep 10
done

These commands proved valuable during troubleshooting:

show platform cpu packet statistics
show platform hardware forward drops
show ip arp inspection statistics

The actual solution involved multiple layers:

  1. Implemented port security on critical access ports
  2. Reduced ARP timeout to 240 seconds
  3. Enabled DHCP snooping with ARP inspection
  4. Created smaller VLANs to reduce broadcast domains

After implementing changes, we verified improvements:

show processes cpu | include ARP
show interfaces | include broadcast
show mac address-table count vlan 1

When dealing with network instability characterized by ARP broadcast storms and high CPU utilization on Cisco 3750X switches, we're typically facing one of these scenarios:

  • Layer 2 loops in the network topology
  • Misconfigured or malfunctioning network devices
  • VLAN design issues (particularly with large broadcast domains)
  • ARP cache poisoning or other security incidents

Several key commands help identify the root cause:

# Show ARP-related CPU utilization
show processes cpu sorted | exclude 0.00%

# Monitor broadcast traffic patterns
show interfaces | include line|broadcast

# Check MAC address table stability
show mac address-table dynamic count
show mac address-table dynamic vlan 1

# Verify TCAM utilization
show platform tcam utilization

Here are concrete steps to mitigate the issue:

! Enable storm control on affected VLANs
interface range GigabitEthernet1/0/1-24
 storm-control broadcast level 20.00
 storm-control action trap
end

! Implement port security where possible
interface GigabitEthernet1/0/1
 switchport port-security maximum 2
 switchport port-security violation restrict
 switchport port-security mac-address sticky
end

! Adjust ARP timers (example for VLAN 1)
interface Vlan1
 arp timeout 300
end

For persistent issues, consider this Python script to monitor MAC flaps:

import paramiko
import time
from collections import defaultdict

def monitor_mac_flaps(switch_ip, username, password, interval=60):
    ssh = paramiko.SSHClient()
    ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
    ssh.connect(switch_ip, username=username, password=password)
    
    mac_history = defaultdict(list)
    
    while True:
        stdin, stdout, stderr = ssh.exec_command('show mac address-table dynamic')
        output = stdout.read().decode()
        
        current_macs = {}
        for line in output.splitlines()[4:]:  # Skip headers
            parts = line.split()
            if len(parts) >= 4:
                vlan, mac, _, port = parts[:4]
                current_macs[mac] = port
        
        for mac, port in current_macs.items():
            if mac in mac_history:
                if mac_history[mac][-1] != port:
                    print(f"MAC FLAP: {mac} moved from {mac_history[mac][-1]} to {port}")
            mac_history[mac].append(port)
        
        time.sleep(interval)

# Usage example
monitor_mac_flaps('192.168.1.1', 'admin', 'cisco123')

Key architectural considerations:

  • Segment large VLANs (/20 is too broad - consider /24 or smaller)
  • Implement Private VLANs where appropriate
  • Enable DHCP snooping and ARP inspection
  • Consider implementing VRF-lite for different departments

For extreme cases where standard troubleshooting fails:

! Completely disable ARP on an interface for testing
interface GigabitEthernet1/0/1
 no ip proxy-arp
 no arp
end

! Create an ACL to block ARP temporarily
access-list 100 deny udp any any eq 67
access-list 100 deny udp any any eq 68
access-list 100 permit ip any any

interface Vlan1
 ip access-group 100 in
end

Implement SNMP monitoring for these critical OIDs:

1.3.6.1.2.1.4.22.1.1  # ipNetToMediaPhysAddress (ARP table)
1.3.6.1.2.1.17.4.3.1.2  # dot1dTpFdbPort (MAC address table)
1.3.6.1.4.1.9.9.109.1.1.1.1.6  # cpmCPUTotal1minRev (CPU utilization)