Optimizing Azure VM Disk Performance for Large File Extraction: A Developer’s Guide to SSD Benchmarking


4 views

When dealing with large file operations on Azure VMs, developers often encounter surprising disk performance patterns. My benchmark tests processing a 1.9GB zip containing a 50GB XML file revealed significant variations between instance types:

// Sample benchmark output structure
public class DiskBenchmarkResult {
    public string InstanceType { get; set; }
    public double AvgThroughputMBps { get; set; }
    public TimeSpan ExtractionTime { get; set; }
    public string CacheConfiguration { get; set; }
}

The performance differences stem from Azure's underlying architecture:

  • Standard Storage Accounts: Limited by network bandwidth and shared infrastructure
  • Local SSD (D-series): Physical SSDs directly attached to host server
  • IOPS Throttling: Azure enforces limits based on VM size

For optimal extraction performance:

// Recommended Azure PowerShell configuration
$vmConfig = New-AzVMConfig -VMName "PerfVm" -VMSize "Standard_D4s_v3"
Set-AzVMOsDisk -VM $vmConfig -Name "OSDisk" -Caching ReadWrite -CreateOption FromImage
Add-AzVMDataDisk -VM $vmConfig -Name "DataDisk1" -Lun 0 -Caching None -DiskSizeInGB 1024 -CreateOption Empty

The memory pressure you observed is due to Windows' file system cache behavior. Implement these adjustments:

// .NET memory optimization for file operations
var fileOptions = new FileStreamOptions {
    Mode = FileMode.Open,
    Access = FileAccess.Read,
    Options = FileOptions.SequentialScan | FileOptions.Asynchronous,
    BufferSize = 81920 // Optimal for SSD
};

For consistent high performance:

  • Consider Azure Premium SSDs with provisioned IOPS
  • Implement chunked processing for very large files
  • Use Azure Blob Storage for intermediate storage

Use this PowerShell to monitor disk performance during operations:

# Azure Disk Performance Monitor
while ($true) {
    $diskStats = Get-Counter -Counter "\PhysicalDisk(*)\Disk Bytes/sec" -SampleInterval 1
    $currentThroughput = [math]::Round(($diskStats.CounterSamples[1].CookedValue / 1MB), 2)
    Write-Host "Current throughput: $currentThroughput MB/s"
    Start-Sleep -Seconds 2
}

When processing large datasets in Azure VMs, disk I/O often becomes the bottleneck - especially when dealing with compressed archives. Let me share my benchmarking journey extracting a 1.9GB ZIP containing a 50GB XML file across different Azure instance types.

Test environment used a custom C# extractor with real-time throughput monitoring:

using System.IO.Compression;
using System.Diagnostics;

public class ZipExtractor 
{
    public static void Main(string[] args) 
    {
        string zipPath = args[0];
        var sw = Stopwatch.StartNew();
        long lastBytes = 0;
        
        using (var archive = ZipFile.OpenRead(zipPath))
        {
            foreach (var entry in archive.Entries)
            {
                using (var fs = new FileStream(entry.FullName, FileMode.Create))
                {
                    using (var entryStream = entry.Open())
                    {
                        byte[] buffer = new byte[8192];
                        int bytesRead;
                        
                        while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0) 
                        {
                            fs.Write(buffer, 0, bytesRead);
                            
                            if (sw.ElapsedMilliseconds > 1000) 
                            {
                                double mbWritten = (fs.Position - lastBytes) / (1024.0 * 1024);
                                Console.WriteLine($"{mbWritten:0.00} MB/s");
                                lastBytes = fs.Position;
                                sw.Restart();
                            }
                        }
                    }
                }
            }
        }
    }
}
Instance Configuration Avg Throughput Total Time
A4 4-disk RAID (no caching) 30-35 MB/s 24m 48s
D4 Local SSD 70-100 MB/s (peaks 200+) 9m 40s
D3 Local SSD 20-40 MB/s (initial 150+) 21m 49s

Several factors explain the performance differences:

  • Local SSD vs. Network Storage: D-series local SSDs outperform network-attached disks, but suffer from VM neighbor noise
  • Memory Pressure: Windows disk caching behaves differently under constrained RAM (note D3's 14GB vs D4's 28GB)
  • CPU Throttling: Lower-tier instances experience more aggressive CPU throttling during sustained I/O

For consistent performance with large files:

// Recommended buffer size for Azure SSD
const int OPTIMAL_BUFFER_SIZE = 64 * 1024; // 64KB

// Disable write buffering for large sequential writes
var fsOptions = new FileStreamOptions 
{
    BufferSize = 0, // Let OS handle buffering
    Options = FileOptions.WriteThrough
};

Additional configuration tweaks:

  1. Use Dv4/Dsv4 series VMs with NVMe drives for maximum throughput
  2. For network storage, create multiple storage accounts to avoid throttling
  3. Set Azure Disk caching to "ReadOnly" for extraction workloads

For pure extraction scenarios, consider using Azure Storage APIs directly:

BlobClient blob = new BlobClient(connectionString, container, blobName);
using (var stream = await blob.OpenReadAsync())
using (var zip = new ZipArchive(stream))
{
    // Parallel extraction implementation
    Parallel.ForEach(zip.Entries, entry => 
    {
        using (var entryStream = entry.Open())
        using (var fs = File.Create(entry.Name))
        {
            entryStream.CopyTo(fs);
        }
    });
}

This eliminates VM disk bottlenecks entirely by processing directly from blob storage.