When dealing with large file operations on Azure VMs, developers often encounter surprising disk performance patterns. My benchmark tests processing a 1.9GB zip containing a 50GB XML file revealed significant variations between instance types:
// Sample benchmark output structure
public class DiskBenchmarkResult {
public string InstanceType { get; set; }
public double AvgThroughputMBps { get; set; }
public TimeSpan ExtractionTime { get; set; }
public string CacheConfiguration { get; set; }
}
The performance differences stem from Azure's underlying architecture:
- Standard Storage Accounts: Limited by network bandwidth and shared infrastructure
- Local SSD (D-series): Physical SSDs directly attached to host server
- IOPS Throttling: Azure enforces limits based on VM size
For optimal extraction performance:
// Recommended Azure PowerShell configuration
$vmConfig = New-AzVMConfig -VMName "PerfVm" -VMSize "Standard_D4s_v3"
Set-AzVMOsDisk -VM $vmConfig -Name "OSDisk" -Caching ReadWrite -CreateOption FromImage
Add-AzVMDataDisk -VM $vmConfig -Name "DataDisk1" -Lun 0 -Caching None -DiskSizeInGB 1024 -CreateOption Empty
The memory pressure you observed is due to Windows' file system cache behavior. Implement these adjustments:
// .NET memory optimization for file operations
var fileOptions = new FileStreamOptions {
Mode = FileMode.Open,
Access = FileAccess.Read,
Options = FileOptions.SequentialScan | FileOptions.Asynchronous,
BufferSize = 81920 // Optimal for SSD
};
For consistent high performance:
- Consider Azure Premium SSDs with provisioned IOPS
- Implement chunked processing for very large files
- Use Azure Blob Storage for intermediate storage
Use this PowerShell to monitor disk performance during operations:
# Azure Disk Performance Monitor
while ($true) {
$diskStats = Get-Counter -Counter "\PhysicalDisk(*)\Disk Bytes/sec" -SampleInterval 1
$currentThroughput = [math]::Round(($diskStats.CounterSamples[1].CookedValue / 1MB), 2)
Write-Host "Current throughput: $currentThroughput MB/s"
Start-Sleep -Seconds 2
}
When processing large datasets in Azure VMs, disk I/O often becomes the bottleneck - especially when dealing with compressed archives. Let me share my benchmarking journey extracting a 1.9GB ZIP containing a 50GB XML file across different Azure instance types.
Test environment used a custom C# extractor with real-time throughput monitoring:
using System.IO.Compression;
using System.Diagnostics;
public class ZipExtractor
{
public static void Main(string[] args)
{
string zipPath = args[0];
var sw = Stopwatch.StartNew();
long lastBytes = 0;
using (var archive = ZipFile.OpenRead(zipPath))
{
foreach (var entry in archive.Entries)
{
using (var fs = new FileStream(entry.FullName, FileMode.Create))
{
using (var entryStream = entry.Open())
{
byte[] buffer = new byte[8192];
int bytesRead;
while ((bytesRead = entryStream.Read(buffer, 0, buffer.Length)) > 0)
{
fs.Write(buffer, 0, bytesRead);
if (sw.ElapsedMilliseconds > 1000)
{
double mbWritten = (fs.Position - lastBytes) / (1024.0 * 1024);
Console.WriteLine($"{mbWritten:0.00} MB/s");
lastBytes = fs.Position;
sw.Restart();
}
}
}
}
}
}
}
}
Instance | Configuration | Avg Throughput | Total Time |
---|---|---|---|
A4 | 4-disk RAID (no caching) | 30-35 MB/s | 24m 48s |
D4 | Local SSD | 70-100 MB/s (peaks 200+) | 9m 40s |
D3 | Local SSD | 20-40 MB/s (initial 150+) | 21m 49s |
Several factors explain the performance differences:
- Local SSD vs. Network Storage: D-series local SSDs outperform network-attached disks, but suffer from VM neighbor noise
- Memory Pressure: Windows disk caching behaves differently under constrained RAM (note D3's 14GB vs D4's 28GB)
- CPU Throttling: Lower-tier instances experience more aggressive CPU throttling during sustained I/O
For consistent performance with large files:
// Recommended buffer size for Azure SSD
const int OPTIMAL_BUFFER_SIZE = 64 * 1024; // 64KB
// Disable write buffering for large sequential writes
var fsOptions = new FileStreamOptions
{
BufferSize = 0, // Let OS handle buffering
Options = FileOptions.WriteThrough
};
Additional configuration tweaks:
- Use Dv4/Dsv4 series VMs with NVMe drives for maximum throughput
- For network storage, create multiple storage accounts to avoid throttling
- Set Azure Disk caching to "ReadOnly" for extraction workloads
For pure extraction scenarios, consider using Azure Storage APIs directly:
BlobClient blob = new BlobClient(connectionString, container, blobName);
using (var stream = await blob.OpenReadAsync())
using (var zip = new ZipArchive(stream))
{
// Parallel extraction implementation
Parallel.ForEach(zip.Entries, entry =>
{
using (var entryStream = entry.Open())
using (var fs = File.Create(entry.Name))
{
entryStream.CopyTo(fs);
}
});
}
This eliminates VM disk bottlenecks entirely by processing directly from blob storage.