On Dell 1U servers running Ubuntu (kernel 3.13.0-32) with Intel Xeon L5420 processors and e1000e network interfaces, we're observing frequent hardware unit hangs followed by unexpected adapter resets. The pattern typically shows bursts of errors every few seconds, interspersed with brief periods of normal operation.
// Example kernel log pattern
[timestamp] e1000e 0000:00:19.0 eth0: Detected Hardware Unit Hang
[timestamp] e1000e 0000:00:19.0 eth0: Reset adapter unexpectedly
The hardware hang messages reveal several critical data points:
- TDH (Transmit Descriptor Head) and TDT (Transmit Descriptor Tail) values showing queue position
- Significant time_stamp and jiffies discrepancies
- MAC/PHY status registers indicating potential link layer issues
Based on numerous reports in the Linux kernel mailing lists and bug trackers, this appears to stem from multiple potential factors:
// Common contributing factors:
1. e1000e driver race conditions (particularly in versions 2.3.x)
2. Hardware flow control misconfiguration
3. TCP segmentation offloading (TSO) issues
4. DMA buffer management problems
5. Power management interference
Try this comprehensive troubleshooting approach:
- Driver Update: Upgrade to e1000e 3.8+ if possible:
sudo apt-get install linux-firmware sudo modprobe -r e1000e sudo modprobe e1000e
- Parameter Tuning: Add these kernel module options:
options e1000e InterruptThrottleRate=3000,3000,3000,3000 options e1000e SmartPowerDownEnable=0
- Disable Advanced Features:
sudo ethtool -K eth0 tso off gso off gro off lro off sudo ethtool -A eth0 autoneg off rx off tx off
After applying changes, monitor with:
watch -n 1 "ethtool -S eth0 | grep -E 'err|drop|miss'"
dmesg -T -w -H | grep -i e1000e
For persistent issues, consider collecting full diagnostic data:
sudo ethtool -d eth0
sudo lspci -vvv -s 00:19.0
sudo cat /proc/interrupts | grep eth0
If problems continue after all software mitigations, test with:
- A different network cable
- Alternative switch port
- PCIe slot reseating
- Cross-testing with different NIC models
The "Detected Hardware Unit Hang" followed by "Reset adapter unexpectedly" messages in your kernel logs indicate serious communication issues between your Intel e1000e network interface and the system. The TDH (Transmit Descriptor Head) and TDT (Transmit Descriptor Tail) values being out of sync (45 vs 50) suggests a transmit queue stall.
The e1000e driver implements a watchdog timer that monitors transmit progress. When packets remain in the queue beyond a timeout period (typically 2 seconds), the driver triggers a hardware reset. The MAC Status value 0x80283 translates to:
# MAC Status Register Breakdown
0x80000 - Link Up
0x20000 - Full Duplex
0x80 - Tx Enabled
0x3 - Speed 1000Mbps
First, verify your current driver parameters:
# Check current module parameters
grep e1000e /etc/modprobe.d/*
# View real-time statistics
watch -n 1 'ethtool -S eth0 | grep -e "err" -e "drop" -e "miss"'
Try these kernel module parameter adjustments in /etc/modprobe.d/e1000e.conf:
options e1000e InterruptThrottleRate=3000
options e1000e IntMode=1
options e1000e TxIntDelay=1
options e1000e RxIntDelay=1
For a deeper diagnosis, collect packet traces during failure:
# Capture control plane traffic
tcpdump -i eth0 -s0 -w /tmp/e1000e_debug.pcap 'ip proto 47 || (udp port 67 || udp port 68)'
The firmware version 1.4-0 shown in your output is quite old. Current versions for most Intel NICs are in the 5.x+ range. Check for updates:
# Identify exact hardware
lspci -nn -d 8086: -vvv | grep -A 10 Ethernet
If you're stuck with this kernel version, apply this patch to the e1000e driver:
static void e1000e_update_itr(struct e1000_adapter *adapter,
u16 itr_setting)
{
/* Add minimum delay threshold */
if (itr_setting != 0 && itr_setting < 4000)
itr_setting = 4000;
adapter->itr = itr_setting;
}
For critical systems, consider these options:
# Disable energy efficient Ethernet
ethtool --set-eee eth0 eee off
# Force 100Mbps operation if 1000Mbps unstable
ethtool -s eth0 speed 100 duplex full autoneg off