Troubleshooting Linux Network Connectivity Loss: ARP Cache and Gateway Resolution Issues


5 views

When a Linux server becomes unreachable except through console access, and network functionality returns after a service network restart, we're typically dealing with one of several fundamental networking issues. Based on the log excerpts and diagnostic outputs, this appears to be primarily an ARP cache and gateway resolution problem.

The critical evidence comes from the arp -an output showing incomplete MAC address resolution for the gateway during failure:

? (xx.xx.xx.62) at <incomplete> on eth0

This indicates the server is unable to properly resolve the gateway's MAC address, which explains why all network communication fails.

1. ARP Cache Corruption: The kernel's ARP cache might be getting poisoned or corrupted
2. Network Driver Issues: The NIC driver might be failing to properly maintain ARP entries
3. Switch/Gateway Problems: The upstream network device might be misbehaving
4. Duplicate IP Conflicts: Another device might be responding with the gateway's IP

When the issue occurs again, run these commands before restarting the network service:

# Check current ARP cache
arp -an

# Verify physical link status
ethtool eth0

# Check kernel ring buffer for NIC errors
dmesg | grep eth0

# Monitor ARP traffic
tcpdump -i eth0 -nn arp

1. ARP Cache Maintenance

Add a cron job to periodically verify the gateway's ARP entry:

*/5 * * * * /sbin/arping -f -I eth0 xx.xx.xx.62 >/dev/null 2>&1

2. Network Interface Configuration

Add these parameters to your network interface configuration (in /etc/sysconfig/network-scripts/ifcfg-eth0):

ARPCHECK=no
ARPUPDATE=no

3. Kernel Parameter Tuning

Add these to /etc/sysctl.conf:

net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
net.ipv4.conf.all.arp_filter = 1

Then apply with sysctl -p

4. Monitoring Script

Create a watchdog script to detect and recover from the issue:

#!/bin/bash

GATEWAY_IP="xx.xx.xx.62"
INTERFACE="eth0"

if ! ping -c 3 -I $INTERFACE 8.8.8.8 >/dev/null; then
    if [[ $(arp -an | grep "$GATEWAY_IP") == *"incomplete"* ]]; then
        logger "Network issue detected - flushing ARP cache"
        ip neigh flush dev $INTERFACE
        systemctl restart network
    fi
fi

If these measures don't resolve the issue, it's time to contact your hosting provider because:

  • The problem might be with their network equipment
  • There could be MAC address conflicts in their switching infrastructure
  • Their gateway device might have ARP caching issues

Implement monitoring for these specific metrics:

# ARP cache health
arp -an | grep incomplete | wc -l

# Gateway reachability
ping -c 1 xx.xx.xx.62 >/dev/null; echo $?

# Network interface errors
ethtool -S eth0 | grep error

When your Linux server becomes unreachable except via console, and network service restarts temporarily fix the issue, the most likely culprit is ARP cache corruption. Your logs showing at <incomplete> on eth0 for the gateway address confirm this diagnosis.

During network outages, immediately check ARP cache status:

arp -an
ip neigh show
cat /proc/net/arp

Compare these outputs between working and failed states. The incomplete MAC address for your gateway indicates ARP resolution failure.

Add these cron jobs to periodically verify and refresh ARP cache:

# Check gateway ARP entry every 5 minutes
*/5 * * * * ping -c 1 your.gateway.ip || (arp -d your.gateway.ip && ping -c 1 your.gateway.ip)

Or create a persistent ARP entry (though this may cause issues if gateway MAC changes):

arp -s your.gateway.ip 00:00:0C:9F:F0:30

When the issue occurs, collect these diagnostics before restarting services:

ethtool eth0
ethtool -S eth0
mii-tool eth0
dmesg | tail -50
journalctl -xe -n 50

Adjust these sysctl parameters in /etc/sysctl.conf:

net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.base_reachable_time_ms = 30000
net.ipv4.conf.all.arp_accept = 1

Apply changes with sysctl -p.

While your ethtool output shows proper link detection, intermittent failures could indicate:

  • Faulty network cable or switch port
  • NIC hardware issues
  • Switch misconfiguration

Request your datacenter to:

  • Check switch port errors
  • Test with alternative port/cable
  • Verify switch ARP timeout settings

Implement this bash script to log network state changes:

#!/bin/bash
LOG_FILE="/var/log/network_monitor.log"
GATEWAY="your.gateway.ip"

check_network() {
  if ! ping -c 1 $GATEWAY &> /dev/null; then
    echo "$(date) - Network outage detected" >> $LOG_FILE
    arp -an >> $LOG_FILE
    ip route show >> $LOG_FILE
    return 1
  fi
  return 0
}

while true; do
  check_network || {
    # Try to recover automatically
    ip link set eth0 down && ip link set eth0 up
    sleep 5
    check_network || service network restart
  }
  sleep 60
done

If using NetworkManager instead of legacy network scripts, configure more aggressive connection checking:

[connection]
ipv4.dad-timeout=5
ipv4.may-fail=no
connection.retries=3
connection.retry-timeout=10