Linux Kernel 3.6+ IPv4 Multipath Routing: Flow-Based Next Hop Selection with NAT Compatibility


4 views

Since Linux kernel 3.6, the removal of IPv4 route cache fundamentally changed multipath routing behavior. Unlike IPv6's flow-based selection, IPv4 now implements per-packet round-robin next hop selection. This creates serious issues for NAT scenarios where TCP connections require consistent path selection.

Consider this common deployment scenario:


# Typical multipath routing table
default 
    nexthop via 192.168.1.1 dev eth0 weight 1
    nexthop via 192.168.2.1 dev eth1 weight 1

A TCP handshake fails because:

  • SYN packet takes eth0 path (source NAT to IP1)
  • SYN-ACK returns to IP1
  • Final ACK takes eth1 path (source NAT to IP2)
  • Remote server sees mismatched source IPs

There are several approaches to restore flow consistency:

1. CONFIG_IP_ROUTE_MULTIPATH_CACHED Patch

This backported patch reintroduces flow-based selection:


# Apply the patch
git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
cd net
git checkout -b multipath_cache v3.6
patch -p1 < multipath_cache.patch

# Kernel config
CONFIG_IP_ROUTE_MULTIPATH_CACHED=y

2. Netfilter MARK-based Solution

An alternative using iptables and policy routing:


# Mark packets by connection
iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -m mark ! --mark 0 -j ACCEPT
iptables -t mangle -A PREROUTING -m state --state NEW -j MARK --set-mark 0x1

# Routing rules
ip rule add fwmark 0x1 lookup 100
ip route add default via 192.168.1.1 table 100

Flow-based selection impacts:

Method CPU Overhead Memory Usage
Round-Robin Low Minimal
Flow Hash Medium Moderate
Netfilter High High

For ISPs using BGP multihoming:


# Enable BGP multipath
router bgp 65000
  maximum-paths 2
  bgp multipath as-path relax

# Linux policy routing
ip route add default proto bgp nexthop via 203.0.113.1 nexthop via 198.51.100.1

In Linux kernels prior to 3.6, IPv4 route caching automatically maintained flow consistency by caching the selected next-hop for a given source-destination pair. This ensured all packets in a TCP flow would traverse the same path despite round-robin multipath selection. The routing cache removal fundamentally changed this behavior:


# Pre-3.6 behavior (simplified)
if (flow_in_cache(src,dst)):
    use_cached_nexthop()
else:
    nexthop = round_robin_select()
    cache_nexthop(src,dst,nexthop)

Consider this dual-WAN scenario with NAT:


# Routing table
default proto static scope global
    nexthop via 192.168.1.1 dev eth0 weight 1
    nexthop via 192.168.2.1 dev eth1 weight 1

# Packet sequence
SYN -> uses eth0 (src NAT IP: 203.0.113.10)
SYN-ACK -> arrives at eth1 (expecting 203.0.113.10)
ACK -> uses eth1 (src NAT IP: 203.0.113.20) # Connection fails

The recommended approach is to implement a flow-based hash selection similar to IPv6:


# Kernel 4.4+ solution using fib_multipath_hash_policy
echo 1 > /proc/sys/net/ipv4/fib_multipath_hash_policy

# This enables L4-aware hashing:
hash(src_ip, dst_ip, protocol, src_port, dst_port) % num_paths

For specialized cases, you can implement custom hash policies through eBPF:


#include 
#include 
#include 
#include 

SEC("classifier/multipath_hash")
int bpf_prog(struct __sk_buff *skb) {
    struct iphdr *iph = (void *)(long)skb->data + sizeof(struct ethhdr);
    if (iph->protocol == IPPROTO_TCP) {
        struct tcphdr *tcph = (void *)iph + (iph->ihl * 4);
        return hash_2words(iph->saddr, iph->daddr) ^ 
               hash_2words(tcph->source, tcph->dest);
    }
    return 0;
}

For complex multi-homing scenarios, consider using iptables marking:


# Mark packets by connection
iptables -t mangle -A PREROUTING -j CONNMARK --restore-mark
iptables -t mangle -A PREROUTING -m mark ! --mark 0 -j ACCEPT
iptables -t mangle -A PREROUTING -j MARK --set-mark 0x$(printf '%x' $(($RANDOM % 2 + 1)))
iptables -t mangle -A PREROUTING -j CONNMARK --save-mark

# Route by mark
ip rule add fwmark 1 lookup 100
ip rule add fwmark 2 lookup 200
ip route add default via 192.168.1.1 table 100
ip route add default via 192.168.2.1 table 200

Check your current hash policy with:


sysctl net.ipv4.fib_multipath_hash_policy
cat /proc/net/rt_cache  # For connection tracking