Linux Network Packet Drops in __netif_receive_skb_core: Diagnosis and Solutions for RX Packet Loss on Ubuntu Servers


2 views

When examining the netstat -ni output, we see persistent RX-DRP counters incrementing on the physical interface (eno1) while bridge and virtual interfaces show no drops. The ethtool -S eno1 reveals specific queue drops in rx_queue_2_drops, suggesting a potential bottleneck in CPU core affinity or interrupt handling.

The dropwatch utility pinpoints the exact kernel function where drops occur:

sudo ./dropwatch -l kas
Initalizing kallsyms db
dropwatch> start
Enabling monitoring...
12 drops at __netif_receive_skb_core+4a0 (0xffffffff979002d0)
6 drops at ip_forward+1b5 (0xffffffff97978615)

The /proc/net/softnet_stat shows significant values in the third column (SoftIRQ misses):

008bcbf0 00000000 0000355d 00000000 00000000
004875d8 00000000 00002408 00000000 00000000

First, increase network processing budget and backlog:

# sysctl -w net.core.netdev_budget=600
# sysctl -w net.core.netdev_max_backlog=3000

For Intel I210 NICs specifically, adjust ring buffer sizes:

# ethtool -G eno1 rx 2048 tx 2048
# ethtool -C eno1 rx-usecs 100 rx-frames 50

Check current IRQ assignments:

cat /proc/interrupts | grep eno1
echo "2-3" > /proc/irq/24/smp_affinity_list  # Example for 4-core system

Install and configure irqbalance:

sudo apt install irqbalance
sudo systemctl enable --now irqbalance

For systems running multiple containers, optimize bridge settings:

sudo brctl setfd br-f4e34 0
sudo sysctl -w net.bridge.bridge-nf-call-iptables=0

Capture detailed packet processing metrics:

sudo perf probe -a '__netif_receive_skb_core'
sudo perf stat -e 'probe:__netif_receive_skb_core' -a sleep 10

Monitor softirq distribution across CPUs:

watch -n1 'cat /proc/softirqs | grep NET_RX'

After applying changes, verify improvements with:

watch -n1 'cat /proc/net/softnet_stat; ethtool -S eno1 | grep drops'

Consider updating network driver if issues persist:

sudo apt install --reinstall linux-modules-extra-$(uname -r)

When monitoring network interfaces using netstat -ni, we see consistent RX-DRP increments on the physical interface (eno1) while other bridge/veth interfaces show zero drops. The drops occur even during light traffic conditions (~2 packets/sec during SSH sessions).

# Continuous monitoring command:
watch -n 1 "netstat -ni | grep eno1"

The Intel I210 NIC (igb driver) shows queue-specific drops in queue 2 according to ethtool:

# Check NIC-specific drops:
ethtool -S eno1 | grep -E 'rx_queue.*drops'
rx_queue_2_drops: 35  # This increments over time

Key findings from hardware diagnostics:

  • RX checksum offloading disabled (confirmed via ethtool -k)
  • Ring buffers at default 256 (max 4096)
  • No apparent hardware errors (CRC, alignment, etc.)

Using dropwatch reveals the primary drop location:

# Build and run dropwatch:
git clone https://github.com/pavel-odintsov/drop_watch
cd drop_watch && make
sudo ./dropwatch -l kas

The output consistently points to __netif_receive_skb_core as the main drop point, indicating potential issues with:

  • SoftIRQ processing capacity
  • Backlog queue limitations
  • Packet filtering at the core networking layer

Implemented the following adjustments without resolving drops:

# Increased backlog and budget parameters
echo 4096 > /proc/sys/net/core/netdev_max_backlog
echo 600 > /proc/sys/net/core/netdev_budget
echo 600 > /proc/sys/net/core/netdev_budget_usecs

# Disabled various offloading features
ethtool -K eno1 gro off lro off gso off tso off

The /proc/net/softnet_stat continues showing incrementing counters in column 3, suggesting unprocessed packets despite increased budgets.

With multiple bridge networks and veth pairs, we examined potential namespace-related drops:

# Check interface drops across all namespaces:
for ns in $(ip netns list | awk '{print $1}'); do
    ip netns exec $ns netstat -ni
done

Key findings:

  • Drops only occur on physical interface, not virtual interfaces
  • No correlation between container traffic and drop rate
  • iptables rules (including Docker's) don't show matching drop counters

Implemented kernel tracing to capture drop events:

# Trace packet drops in real-time:
sudo perf probe --add '__netif_receive_skb_core skb->len'
sudo perf record -e probe:__netif_receive_skb_core -a -g -- sleep 30
sudo perf script

After comprehensive testing, the resolution involved:

# Apply final working configuration:
# 1. Increase ring buffers
ethtool -G eno1 rx 2048 tx 2048

# 2. Adjust IRQ balancing
sudo apt install irqbalance
sudo systemctl enable --now irqbalance

# 3. CPU affinity for NIC interrupts
for irq in $(grep eno1 /proc/interrupts | awk -F: '{print $1}'); do
    echo 0-3 > /proc/irq/$irq/smp_affinity_list
done

# 4. Disable problematic power management
echo 'performance' | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor

The root cause was ultimately identified as CPU contention between Docker's network stack processing and the NIC's interrupt handling on the same cores.