Investigating and Resolving 5-Second fsync Latency Spikes in ESXi NFS Datastores


2 views

During performance testing using Ted Ts'o's fsync-tester and ioping tools on an 8GB virtual disk, we consistently observe fsync latencies spiking to exactly 5 seconds:

Linux 2.6.33-grml64:
root@dynip211 /mnt/sda # ./fsync-tester
fsync time: 5.0391
fsync time: 5.0438
fsync time: 5.0300
fsync time: 0.0231

These latency spikes propagate across VMs sharing the same NFS datastore:

root@grml /mnt/sda/ioping-0.5 # ./ioping -i 0.3 -p 20 .
4096 bytes from . (reiserfs /dev/sda): request=5 time=4809.0 ms
4096 bytes from . (reiserfs /dev/sda): request=12 time=4950.0 ms

The issue appears consistently when:

  • Using SCSI/SAS virtual disks (not IDE)
  • With modern Linux kernels (2.6.32+)
  • When applications perform multiple small writes before fsync

Here's the strace output showing the problematic pattern:

pwrite(3, "aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa"..., 1048576, 0) = 1048576
fsync(3)                                = 0

Wireshark captures reveal TCP window behavior differences between problematic and working configurations. FreeBSD-based NFS servers show regular TCP window updates (29127 bytes), while OpenIndiana uses larger default window sizes.

No.  Time        Source                Destination           Protocol Info
1082 16.164096   192.168.250.10        192.168.250.20        NFS      V3 WRITE Call
1085 16.167678   192.168.250.20        192.168.250.10        NFS      V3 WRITE Reply

We've identified several approaches with varying impacts:

1. TCP Parameter Adjustment:

ndd -set /dev/tcp tcp_recv_hiwat 8192  # From default 128000
ndd -set /dev/tcp tcp_max_buf 1048575  # From default 1048576

This eliminates latency spikes but reduces throughput from 170MB/s to 80MB/s.

2. Storage Protocol Alternatives:

iSCSI doesn't exhibit the problem but lacks NFS's convenient VMDK management features.

The issue appears related to:

  • NFS v3 implementation in ESXi
  • Interaction between VM storage controllers and NFS datastores
  • TCP window size negotiation with certain NFS servers

To better understand the write pattern impact, here's a modified version of fsync-tester that uses smaller writes:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>

#define FILE_SIZE (1024*1024)
#define BLOCK_SIZE 4096
#define NUM_WRITES (FILE_SIZE/BLOCK_SIZE)

double gettime() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}

int main() {
    int fd = open("testfile", O_WRONLY|O_CREAT|O_TRUNC|O_SYNC, 0666);
    char *buf = malloc(BLOCK_SIZE);
    memset(buf, 'a', BLOCK_SIZE);
    
    double start, end;
    
    for(int i=0; i<NUM_WRITES; i++) {
        pwrite(fd, buf, BLOCK_SIZE, i*BLOCK_SIZE);
    }
    
    start = gettime();
    fsync(fd);
    end = gettime();
    
    printf("fsync time: %.4f\n", end-start);
    
    close(fd);
    free(buf);
    return 0;
}

The issue persists across multiple ESXi builds:

  • 381591
  • 348481
  • 260247
  • 4.1.0.433742

Based on testing, we recommend:

  1. For performance-critical systems: Use iSCSI instead of NFS
  2. When NFS is required: Implement TCP window size tuning with awareness of throughput impact
  3. Monitor for updates to ESXi's NFS client implementation

When working with NFS datastores in VMware ESXi environments, we encountered severe latency spikes lasting exactly 5 seconds during fsync operations. These weren't just minor delays - they were systemic freezes impacting multiple VMs sharing the same datastore. The characteristic pattern appears when using tools like fsync-tester:

fsync time: 5.0391
fsync time: 5.0438
fsync time: 5.0300
fsync time: 0.0231
fsync time: 0.0243
fsync time: 5.0382

We can consistently reproduce this using two standard benchmarking tools:

# Using fsync-tester (Ted Ts'o)
./fsync-tester

# Using ioping for latency measurement
./ioping -i 0.3 -p 20 .

The problem manifests across different environments:

  • Multiple ESXi builds (381591, 348481, 260247)
  • Various hardware platforms (Intel/AMD)
  • Different NFS servers (OpenIndiana, Linux, NexentaStor)

Through extensive testing, we discovered the issue primarily affects VMs using SCSI/SAS controllers with Native Command Queuing (NCQ) or Tagged Command Queuing (TCQ) enabled. Virtual IDE controllers don't exhibit this behavior, though they come with their own performance limitations.

Key observations about the write pattern:

# Problematic pattern (multiple small writes before fsync)
pwrite(3, "******", 4096, 1036288) = 4096
pwrite(3, "******", 4096, 1040384) = 4096
pwrite(3, "******", 4096, 1044480) = 4096
fsync(3) = 0

Packet captures reveal TCP behavior differences between problematic and working configurations. Here's a sample from Wireshark during a latency spike:

No.  Time        Source            Destination       Protocol Info
1082 16.164096   192.168.250.10    192.168.250.20    NFS V3 WRITE Call
1083 16.164112   192.168.250.10    192.168.250.20    NFS V3 WRITE Call
1085 16.167678   192.168.250.20    192.168.250.10    NFS V3 WRITE Reply
1086 16.168280   192.168.250.20    192.168.250.10    NFS V3 WRITE Reply

We found several approaches that mitigate the issue, each with its own compromises:

1. TCP Window Size Adjustment

# On Solaris-based NFS servers (OpenIndiana/Nexenta)
ndd -set /dev/tcp tcp_recv_hiwat 8192
ndd -set /dev/tcp tcp_max_buf 1048575

While this eliminates the latency spikes, it reduces throughput from 170MB/s to 80MB/s in our tests.

2. Alternative Protocol: iSCSI

Using COMSTAR iSCSI instead of NFS completely avoids the issue, though it sacrifices some management flexibility with VMDK files.

3. Controller Type Selection

Virtual IDE controllers don't exhibit the problem, but limit your maximum disk count per VM.

To better understand the small-write pattern impact, here's an enhanced version of fsync-tester that uses multiple small writes:

#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/time.h>
#include <string.h>

#define FILENAME "testfile"
#define FILESIZE (1024*1024)
#define BLOCKSIZE 4096
#define NUMWRITES 256

double get_time() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec / 1000000.0;
}

int main() {
    int fd;
    char buf[BLOCKSIZE];
    double start, end;
    int i;
    
    memset(buf, 'a', BLOCKSIZE);
    
    fd = open(FILENAME, O_RDWR|O_CREAT|O_TRUNC|O_SYNC, 0666);
    if (fd < 0) {
        perror("open");
        exit(1);
    }
    
    for (i = 0; i < NUMWRITES; i++) {
        if (pwrite(fd, buf, BLOCKSIZE, i*BLOCKSIZE) != BLOCKSIZE) {
            perror("pwrite");
            close(fd);
            exit(1);
        }
    }
    
    start = get_time();
    if (fsync(fd) < 0) {
        perror("fsync");
        close(fd);
        exit(1);
    }
    end = get_time();
    
    printf("fsync time: %.4f\n", end - start);
    
    close(fd);
    return 0;
}