When dealing with massive files (like our 100GB example), we often need to split them into manageable chunks for processing or transfer. The additional requirement of compressing each chunk adds complexity. While split
and gzip
work great separately, combining them efficiently requires some Unix pipe magic.
Here's how to achieve both splitting and compression in one go:
split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output_prefix
The magic lies in the --filter
parameter, which processes each chunk through the specified command before writing to disk.
--bytes=1024M
: Creates 1GB chunks (1048576KB)--filter
: Applies compression to each chunk immediately$FILE
: Automatic variable holding the output filename- The
.gz
extension is appended to each output file
For different compression needs, modify the filter:
Using ZIP compression
split --bytes=1024M --filter='zip -qr - | cat > $FILE.zip' /path/to/input /path/to/output
Parallel processing version
split --bytes=1024M --filter='gzip --fast | sponge $FILE.gz' /path/to/input /path/to/output &
After processing, verify integrity with:
find /path/to/output -name "*.gz" -type f -print0 | xargs -0 -P8 -n1 gunzip -t
This checks all compressed chunks in parallel without extracting them.
- Disk I/O is typically the bottleneck - use fast storage if possible
- For SSDs, increasing compression level (
gzip -6
) may be worthwhile - On multi-core systems, consider GNU parallel for post-split compression
Processing a 100GB CSV file on an AWS EC2 instance:
time split --bytes=1024M --filter='gzip -6 > $FILE.gz' large_dataset.csv split_files/dataset_part_
Typical results:
- Original: 100GB
- Split+Compressed: ~35GB total
- Processing time: ~45 minutes (varies by instance type)
When dealing with massive files (100GB+ in this case), we often need to both split them into manageable chunks and compress them for storage or transfer. The typical approach involves two separate steps: splitting first, then compressing each chunk. This creates unnecessary intermediate files and extra I/O operations.
We can accomplish both operations in a single pipeline using Unix/Linux utilities. Here's the most efficient method:
split --bytes=1024M --filter='gzip > $FILE.gz' /path/to/input /path/to/output_prefix
--bytes=1024M
: Splits the input file into 1GB chunks--filter
: Processes each chunk through the specified commandgzip > $FILE.gz
: Compresses each chunk and adds the .gz extension- The command automatically handles the output numbering (output_prefixaa.gz, output_prefixab.gz, etc.)
For different compression needs:
# Using bzip2 instead of gzip
split --bytes=1024M --filter='bzip2 > $FILE.bz2' input_file output_prefix
# Using parallel processing for faster compression
split --bytes=1024M --filter='pigz -c > $FILE.gz' input_file output_prefix
# Preserving original line boundaries (important for text files)
split -d --bytes=1024M --filter='gzip > $FILE.gz' input_file output_prefix
When dealing with 100GB+ files:
- Use
pigz
(parallel gzip) for multi-core systems - Monitor disk I/O with
iotop
if performance is critical - Consider
pv
to monitor progress:pv input_file | split --bytes=1024M --filter='gzip > $FILE.gz' - output_prefix
After processing, verify the files with:
# Check compressed file integrity
for f in output_prefix*; do gunzip -t "$f"; done
# Verify line counts match (for text files)
original_lines=$(wc -l < input_file)
split_lines=$(zcat output_prefix* | wc -l)
[ "$original_lines" -eq "$split_lines" ] && echo "Success" || echo "Error"