When a Linux server becomes unreachable except through console access, and network functionality returns after a service network restart
, we're typically dealing with one of several fundamental networking issues. Based on the log excerpts and diagnostic outputs, this appears to be primarily an ARP cache and gateway resolution problem.
The critical evidence comes from the arp -an
output showing incomplete MAC address resolution for the gateway during failure:
? (xx.xx.xx.62) at <incomplete> on eth0
This indicates the server is unable to properly resolve the gateway's MAC address, which explains why all network communication fails.
1. ARP Cache Corruption: The kernel's ARP cache might be getting poisoned or corrupted
2. Network Driver Issues: The NIC driver might be failing to properly maintain ARP entries
3. Switch/Gateway Problems: The upstream network device might be misbehaving
4. Duplicate IP Conflicts: Another device might be responding with the gateway's IP
When the issue occurs again, run these commands before restarting the network service:
# Check current ARP cache arp -an # Verify physical link status ethtool eth0 # Check kernel ring buffer for NIC errors dmesg | grep eth0 # Monitor ARP traffic tcpdump -i eth0 -nn arp
1. ARP Cache Maintenance
Add a cron job to periodically verify the gateway's ARP entry:
*/5 * * * * /sbin/arping -f -I eth0 xx.xx.xx.62 >/dev/null 2>&1
2. Network Interface Configuration
Add these parameters to your network interface configuration (in /etc/sysconfig/network-scripts/ifcfg-eth0):
ARPCHECK=no ARPUPDATE=no
3. Kernel Parameter Tuning
Add these to /etc/sysctl.conf:
net.ipv4.conf.all.arp_ignore = 1 net.ipv4.conf.all.arp_announce = 2 net.ipv4.conf.all.arp_filter = 1
Then apply with sysctl -p
4. Monitoring Script
Create a watchdog script to detect and recover from the issue:
#!/bin/bash GATEWAY_IP="xx.xx.xx.62" INTERFACE="eth0" if ! ping -c 3 -I $INTERFACE 8.8.8.8 >/dev/null; then if [[ $(arp -an | grep "$GATEWAY_IP") == *"incomplete"* ]]; then logger "Network issue detected - flushing ARP cache" ip neigh flush dev $INTERFACE systemctl restart network fi fi
If these measures don't resolve the issue, it's time to contact your hosting provider because:
- The problem might be with their network equipment
- There could be MAC address conflicts in their switching infrastructure
- Their gateway device might have ARP caching issues
Implement monitoring for these specific metrics:
# ARP cache health arp -an | grep incomplete | wc -l # Gateway reachability ping -c 1 xx.xx.xx.62 >/dev/null; echo $? # Network interface errors ethtool -S eth0 | grep error
When your Linux server becomes unreachable except via console, and network service restarts temporarily fix the issue, the most likely culprit is ARP cache corruption. Your logs showing at <incomplete> on eth0
for the gateway address confirm this diagnosis.
During network outages, immediately check ARP cache status:
arp -an
ip neigh show
cat /proc/net/arp
Compare these outputs between working and failed states. The incomplete MAC address for your gateway indicates ARP resolution failure.
Add these cron jobs to periodically verify and refresh ARP cache:
# Check gateway ARP entry every 5 minutes
*/5 * * * * ping -c 1 your.gateway.ip || (arp -d your.gateway.ip && ping -c 1 your.gateway.ip)
Or create a persistent ARP entry (though this may cause issues if gateway MAC changes):
arp -s your.gateway.ip 00:00:0C:9F:F0:30
When the issue occurs, collect these diagnostics before restarting services:
ethtool eth0
ethtool -S eth0
mii-tool eth0
dmesg | tail -50
journalctl -xe -n 50
Adjust these sysctl parameters in /etc/sysctl.conf
:
net.ipv4.neigh.default.gc_thresh1 = 1024
net.ipv4.neigh.default.gc_thresh2 = 2048
net.ipv4.neigh.default.gc_thresh3 = 4096
net.ipv4.neigh.default.base_reachable_time_ms = 30000
net.ipv4.conf.all.arp_accept = 1
Apply changes with sysctl -p
.
While your ethtool
output shows proper link detection, intermittent failures could indicate:
- Faulty network cable or switch port
- NIC hardware issues
- Switch misconfiguration
Request your datacenter to:
- Check switch port errors
- Test with alternative port/cable
- Verify switch ARP timeout settings
Implement this bash script to log network state changes:
#!/bin/bash
LOG_FILE="/var/log/network_monitor.log"
GATEWAY="your.gateway.ip"
check_network() {
if ! ping -c 1 $GATEWAY &> /dev/null; then
echo "$(date) - Network outage detected" >> $LOG_FILE
arp -an >> $LOG_FILE
ip route show >> $LOG_FILE
return 1
fi
return 0
}
while true; do
check_network || {
# Try to recover automatically
ip link set eth0 down && ip link set eth0 up
sleep 5
check_network || service network restart
}
sleep 60
done
If using NetworkManager instead of legacy network scripts, configure more aggressive connection checking:
[connection]
ipv4.dad-timeout=5
ipv4.may-fail=no
connection.retries=3
connection.retry-timeout=10