Troubleshooting Extremely Slow VMware Snapshot Removal on iSCSI Storage with Veeam Backup


3 views

When dealing with VMware snapshots on iSCSI storage (specifically HP LeftHand in this case), the consolidation process can sometimes become unexpectedly slow. Let's examine the technical details from the reported case:

// Sample monitoring command used during troubleshooting
ls -lh | grep -E "delta|flat|sesparse"

The system showed these characteristics during the problematic snapshot removal:

  • 5GB delta disk from a 6-hour snapshot window
  • 800GB used space in a 1TB thick-provisioned disk
  • Extremely low storage I/O during consolidation (8MB/s throughput, 600 IOPS)
  • Multiple delta files appearing during the process

VMware's snapshot consolidation works through an iterative process:

# Typical snapshot consolidation flow
1. VM writes go to new delta file (e.g., EXAMPLE-000002-delta.vmdk)
2. System merges changes from old delta (EXAMPLE-000001-delta.vmdk) into base disk
3. Process repeats until all deltas are incorporated

When facing slow snapshot removal, consider these diagnostic approaches:

# Monitor storage performance
esxtop -d 2 -u -a

# Check VM disk operations
vim-cmd vmsvc/get.summary [VMID] | grep "consolidation"

Based on similar cases, these methods have proven effective:

  • Schedule snapshot removal during low-activity periods
  • Consider temporarily reducing VM I/O during consolidation
  • Evaluate storage multipathing configuration
  • Check for storage array-level bottlenecks

For critical situations, you might need to script a consolidation process:


#!/bin/bash
# VMware snapshot consolidation monitor
VMID=$(vim-cmd vmsvc/getallvms | grep "VM_NAME" | awk '{print $1}')

while true; do
  STATUS=$(vim-cmd vmsvc/get.summary $VMID | grep "consolidation" | awk -F'"' '{print $4}')
  if [ "$STATUS" == "none" ]; then
    echo "Consolidation complete"
    break
  else
    echo "Consolidation status: $STATUS"
    sleep 60
  fi
done

When dealing with VMware environments using iSCSI storage (particularly HP LeftHand), snapshot removal operations can sometimes exhibit perplexing performance characteristics. The case described involves a 1TB thick-provisioned virtual disk with an 800GB used space, where a mere 5GB snapshot delta took over 6 hours to consolidate - despite minimal storage I/O activity.

During snapshot removal, VMware performs a multi-phase operation:

1. Creates temporary delta file (EXAMPLE-000002-delta.vmdk)
2. Merges original delta (EXAMPLE-000001-delta.vmdk) into base disk
3. Repeats process for subsequent deltas

The key observation was seeing two active delta files during consolidation:

-rw------- 1 root root 194.0M Jun 15 01:28 EXAMPLE-000001-delta.vmdk
-rw------- 1 root root 274.0M Jun 15 01:27 EXAMPLE-000002-delta.vmdk

When I/O metrics don't explain the slowdown, consider these inspection methods:

# Monitor consolidation progress
while true; do 
  ls -lh | grep -E "delta|flat|sesparse";
  esxtop -b -n 1 -d 2 | grep -i "disk\\|vmx";
  sleep 30;
done

# Check VMX process status
vim-cmd vmsvc/get.tasklist [VMID] | grep -i consolidate

# Storage stack inspection
esxcli storage core device list | grep -A 10 "iSCSI"

Several non-obvious elements can affect consolidation:

  • SCSI Reservations: iSCSI arrays may impose locking during metadata operations
  • VAAI Primitive Conflicts: Hardware acceleration might stall during extended operations
  • VMFS Block Reclamation: Thick-to-thin conversions during merge can trigger background tasks

This Python script monitors consolidation progress and detects stalls:

import time
import subprocess

def get_delta_files(vm_path):
    cmd = f"ls -lh {vm_path} | grep delta"
    output = subprocess.check_output(cmd, shell=True).decode()
    return [line.split()[-1] for line in output.splitlines()]

def track_consolidation(vm_path, interval=60):
    last_size = 0
    stalled_count = 0
    
    while True:
        deltas = get_delta_files(vm_path)
        if not deltas:
            print("Consolidation complete!")
            break
            
        current_size = sum(int(f.split('-')[0]) for f in deltas)
        if current_size == last_size:
            stalled_count += 1
            if stalled_count > 3:
                print("WARNING: Possible consolidation stall detected")
        else:
            stalled_count = 0
            
        print(f"Current delta size: {current_size}MB | Files: {', '.join(deltas)}")
        last_size = current_size
        time.sleep(interval)

These advanced settings (set via ESXi CLI) can help:

esxcli system settings advanced set -o /VMFS3/EnableBlockDelete -i 1
esxcli system settings advanced set -o /Disk/MaxLUN -i 256
esxcli system settings advanced set -o /VMFS3/HardwareAcceleratedLocking -i 1

For mission-critical VMs where extended consolidation is unacceptable:

  • Implement storage vMotion to different datastore during maintenance
  • Use Veeam's application-aware processing instead of VM snapshots
  • Schedule consolidations during periods of low VM activity