Understanding ARP Cache State Transitions: When Does a STALE Entry Become FAILED Without Usage?


2 views

The ARP cache in Linux maintains several states for neighbor entries, with STALE and FAILED being particularly important for understanding unreachable hosts. Let's examine the specific case where an entry remains STALE despite exceeding the gc_stale_time threshold.

From the provided output, we see:

10.64.42.121 lladdr b8:20:00:00:00:00 used 6387/6341/6313 probes 1 STALE
10.64.42.157 lladdr b8:20:00:00:00:00 used 24/813/19 probes 1 STALE
10.64.42.12 used 29066/30229/29063 probes 6 FAILED

The key parameters are:

gc_interval = 30
gc_stale_time = 60

The gc_stale_time (60 seconds) determines when stale entries become candidates for garbage collection, but doesn't immediately change their state. The actual transition depends on:

  1. The garbage collector runs every gc_interval (30 seconds)
  2. Only during GC are stale entries older than gc_stale_time processed
  3. The entry must also pass other criteria before state change

For completely unused entries, the timeline is:

1. REACHABLE → STALE (when reachability confirmation times out)
2. STALE → FAILED (after gc_stale_time + GC cycle processing)
3. FAILED → Removed (after additional GC cycles)

To monitor the transition, you can use:

# Continuous monitoring command
watch -n 1 "ip -s -s neigh show dev lan | grep 10.64.42.121"

Or programmatically check with Python:

import subprocess
import time

def check_arp_state(ip, interface):
    while True:
        output = subprocess.check_output(["ip", "-s", "-s", "neigh", "show", "dev", interface])
        for line in output.decode().split('\n'):
            if ip in line:
                print(line.strip())
                if "FAILED" in line:
                    return
        time.sleep(5)

check_arp_state("10.64.42.121", "lan")

The actual transition happens in the kernel's neighbor subsystem. Relevant code (simplified):

// In net/core/neighbour.c
void neigh_periodic_work(struct work_struct *work)
{
    if (time_after(now, n->used + gc_stale_time)) {
        if (n->nud_state & NUD_VALID) {
            neigh_suspect(n);
        } else if (n->nud_state & NUD_STALE) {
            neigh_invalidate(n);
        }
    }
}

To manually trigger the garbage collector:

echo 1 > /proc/sys/net/ipv4/neigh/lan/gc_stale_time
echo 30 > /proc/sys/net/ipv4/neigh/lan/gc_interval
ip -s -s neigh flush dev lan

Remember that these changes are temporary. For permanent changes, modify /etc/sysctl.conf:

net.ipv4.neigh.lan.gc_stale_time = 60
net.ipv4.neigh.lan.gc_interval = 30

The ARP cache in Linux follows a specific state machine for entries:

REACHABLE -> STALE -> DELAY -> PROBE -> FAILED

However, the actual transition depends on multiple factors including system configuration and network activity.

From the provided configuration:

gc_stale_time = 60 seconds
gc_interval = 30 seconds

These parameters control garbage collection behavior:

  • gc_stale_time: How long a STALE entry remains before being considered for removal
  • gc_interval: How often the garbage collector runs

The entry for 10.64.42.121 remains STALE despite exceeding gc_stale_time because:

  1. The garbage collector hasn't processed this entry yet (randomized selection)
  2. System under memory pressure would accelerate collection
  3. No explicit trigger (like neighbor solicitation) has occurred

Without any network activity, the transition timeline would be:

1. Entry becomes STALE when reachability timeout expires
2. GC runs every 30 seconds (gc_interval)
3. During GC, entries older than gc_stale_time (60s) are candidates for removal
4. Actual removal depends on memory pressure and randomization

In practice, you can force immediate processing:

# Force ARP cache cleanup
ip -s -s neigh flush all

To monitor state transitions in real-time:

watch -n 1 'ip -4 neigh show nud all'

For programmatic handling, consider this Python snippet:

import subprocess

def check_arp_state(ip):
    result = subprocess.run(['ip', '-4', 'neigh', 'show', ip],
                          capture_output=True, text=True)
    if 'FAILED' in result.stdout:
        return 'FAILED'
    elif 'STALE' in result.stdout:
        return 'STALE'
    return 'UNKNOWN'

print(check_arp_state('10.64.42.121'))

To modify garbage collection behavior:

# Set more aggressive GC (values in seconds)
echo 10 > /proc/sys/net/ipv4/neigh/default/gc_stale_time
echo 5 > /proc/sys/net/ipv4/neigh/default/gc_interval

Remember that too aggressive settings may cause premature removal of valid entries.