Efficient File Splitting with Direct Compression: Split 100GB to 1GB Chunks with GZIP in One Command


3 views

When dealing with massive files (like our 100GB example), we often need to split them into manageable chunks for processing or transfer. The additional requirement of compressing each chunk adds complexity. While split and gzip work great separately, combining them efficiently requires some Unix pipe magic.

Here's how to achieve both splitting and compression in one go:

split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output_prefix

The magic lies in the --filter parameter, which processes each chunk through the specified command before writing to disk.

  • --bytes=1024M: Creates 1GB chunks (1048576KB)
  • --filter: Applies compression to each chunk immediately
  • $FILE: Automatic variable holding the output filename
  • The .gz extension is appended to each output file

For different compression needs, modify the filter:

Using ZIP compression

split --bytes=1024M --filter='zip -qr - | cat > $FILE.zip' /path/to/input /path/to/output

Parallel processing version

split --bytes=1024M --filter='gzip --fast | sponge $FILE.gz' /path/to/input /path/to/output &

After processing, verify integrity with:

find /path/to/output -name "*.gz" -type f -print0 | xargs -0 -P8 -n1 gunzip -t

This checks all compressed chunks in parallel without extracting them.

  • Disk I/O is typically the bottleneck - use fast storage if possible
  • For SSDs, increasing compression level (gzip -6) may be worthwhile
  • On multi-core systems, consider GNU parallel for post-split compression

Processing a 100GB CSV file on an AWS EC2 instance:

time split --bytes=1024M --filter='gzip -6 > $FILE.gz' large_dataset.csv split_files/dataset_part_

Typical results:

  • Original: 100GB
  • Split+Compressed: ~35GB total
  • Processing time: ~45 minutes (varies by instance type)

When dealing with massive files (100GB+ in this case), we often need to both split them into manageable chunks and compress them for storage or transfer. The typical approach involves two separate steps: splitting first, then compressing each chunk. This creates unnecessary intermediate files and extra I/O operations.

We can accomplish both operations in a single pipeline using Unix/Linux utilities. Here's the most efficient method:

split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output_prefix
  • --bytes=1024M: Splits the input file into 1GB chunks
  • --filter: Processes each chunk through the specified command
  • gzip > $FILE.gz: Compresses each chunk and adds the .gz extension
  • The command automatically handles the output numbering (output_prefixaa.gz, output_prefixab.gz, etc.)

For different compression needs:

# Using bzip2 instead of gzip
split --bytes=1024M --filter='bzip2 > $FILE.bz2' input_file output_prefix

# Using parallel processing for faster compression
split --bytes=1024M --filter='pigz -c > $FILE.gz' input_file output_prefix

# Preserving original line boundaries (important for text files)
split -d --bytes=1024M --filter='gzip > $FILE.gz' input_file output_prefix

When dealing with 100GB+ files:

  • Use pigz (parallel gzip) for multi-core systems
  • Monitor disk I/O with iotop if performance is critical
  • Consider pv to monitor progress: pv input_file | split --bytes=1024M --filter='gzip > $FILE.gz' - output_prefix

After processing, verify the files with:

# Check compressed file integrity
for f in output_prefix*; do gunzip -t "$f"; done

# Verify line counts match (for text files)
original_lines=$(wc -l < input_file)
split_lines=$(zcat output_prefix* | wc -l)
[ "$original_lines" -eq "$split_lines" ] && echo "Success" || echo "Error"