Optimizing VMware ESXi Host for High-Density Windows XP VM Deployment: Performance Tuning and Resource Allocation Strategies


2 views

I recently experimented with running 100 Windows XP VMs on a single VMware ESXi 7.0 host with the following configuration:

Host Configuration:
- CPU: Dual Intel Xeon Silver 4210 (8 cores/16 threads total)
- RAM: 64GB DDR4 ECC
- Storage: 4x 480GB SAS SSDs in RAID 10
- Network: 10GbE dual-port NIC

After successfully running 20 VMs (each with 512MB RAM and 1 vCPU), adding more VMs caused significant performance degradation despite showing:

esxtop metrics at 50 VMs:
- CPU: 12% utilization
- MEM: 58% active usage
- DISK: 91% idle
- NET: 3% bandwidth used

After weeks of testing, these adjustments made 100 VMs stable:

# ESXi Advanced Settings (via SSH):
esxcfg-advcfg -s 2048 /VMFS3/MinFreeMB
esxcfg-advcfg -s 8 /Net/TcpipHeapSize
esxcfg-advcfg -s 32 /Net/TcpipHeapMax

# VMX parameters for Windows XP VMs:
monitor_control.restrict_backdoor = "TRUE"
isolation.tools.hgfs.disable = "TRUE"

The RAID controller cache settings made a huge difference:

# MegaCLI commands for LSI controller:
MegaCli -LDSetProp -WB -Immediate -LAll -a0
MegaCli -LDSetProp -Cached -LAll -a0
MegaCli -LDSetProp -NoCachedBadBBU -LAll -a0

Creating separate vSwitches for management and VM traffic helped:

# PowerCLI script snippet:
New-VirtualSwitch -Name "VM_Network" -Nic vmnic2,vmnic3 -MTU 9000
Get-VirtualPortGroup -Name "VM Network" | Set-VirtualPortGroup -VirtualSwitch "VM_Network"

This Python script helped identify resource contention:

import pyVmomi
from pyVim.connect import SmartConnect

def check_host_performance(host):
    perf_manager = host.content.perfManager
    metric_ids = [pyVmomi.vim.PerformanceManager.MetricId(
        counterId=perf_manager.QueryAvailablePerfMetric(
            entity=host).metricId[0].counterId,
        instance="*"
    )]
    query = pyVmomi.vim.PerformanceManager.QuerySpec(
        entity=host,
        metricId=metric_ids,
        intervalId=20
    )
    return perf_manager.QueryPerf(querySpec=[query])

After successfully running 20 Windows XP VMs (512MB RAM each) on an 8-core/64GB RAM host with RAID 10 SAS storage, I hit a wall when attempting to scale to 100 instances. Despite showing:

Host Metrics:
CPU Utilization: 12-15%
Memory Free: 18GB available
Disk Queue: 0.1 (15k SAS)
Network: 2% of 1Gbps

The VMs became unusably slow beyond the 20-instance threshold.

VMware's ESXi scheduler has several invisible constraints:

# Check current scheduler settings
esxcli system settings advanced list -o /VMkernel/Boot/hyperthreading
esxcli system settings advanced list -o /Net/NetHeapMax

Key findings from my testing:

  • Memory ballooning becomes aggressive at 75% host RAM usage
  • Default vCPU co-stop thresholds limit parallel execution
  • Windows XP's non-NUMA awareness causes vNUMA misalignment

After weeks of testing, these ESXi advanced parameters made 80+ VMs stable:

# /etc/vmware/esx.conf additions:
/Mem/IdleTax = "0"
/Mem/SamplePeriod = "1000"
/VMkernel/Boot/disableHugeTLB = "TRUE"
/Sched/Mem/PShareEnabled = "FALSE"

For Windows XP guests, add these VMX parameters:

monitor_control.restrict_backdoor = "TRUE"
isolation.tools.hgfs.disable = "TRUE"
vhv.enable = "FALSE"

The SAS RAID array needed special tuning:

esxcli storage nmp device set --device naa.xxx --psp VMW_PSP_RR
esxcli storage core device vaai status set --device naa.xxx --disable-ats

Each VM disk should be configured with:

scsi0:0.virtualSSD = "1"
scsi0:0.queues = "2"

After optimization, achieved stable performance with:

Metric Before After
VMs per host 20 84
Boot time (avg) 142s 68s
Disk latency 23ms 9ms

Sample monitoring script for tracking performance:

#!/bin/bash
while true; do
  esxcli system stats uptime get
  esxcli system process stats load get
  esxtop -b -n 1 | grep "VM Name\|%RDY\|%MLMTD"
  sleep 30
done