When dealing with large zip files from HTTP sources, most developers follow the tedious download-extract-cleanup cycle:
wget https://example.com/large.zip
unzip large.zip
rm large.zip
This creates unnecessary temporary files and I/O operations. What we really want is streaming extraction:
curl -s https://example.com/large.zip | unzip -
The traditional unzip
utility requires random access to the archive file for these reasons:
- Central Directory located at file end (ZIP spec requirement)
- Needs to validate CRC32 checksums post-extraction
- Maintains internal file position pointers
Option 1: Using bsdtar (libarchive)
The BSD-derived tar
implementation handles streaming zips gracefully:
curl -s https://example.com/large.zip | bsdtar -xvf -
Key advantages:
- Progressive parsing of ZIP format
- No temporary files created
- Widely available (macOS default, package managers)
Option 2: Python's zipfile Module
For more control, use Python's streaming-capable zipfile:
python3 -c "
import sys, zipfile
with zipfile.ZipFile(sys.stdin.buffer) as z:
z.extractall()
" < <(curl -s https://example.com/large.zip)
Option 3: funzip for Single-File Extraction
For ZIPs containing one file, use funzip
:
curl -s https://example.com/single.zip | funzip > output.txt
When benchmarking 1GB test files:
Method | Memory | Time |
---|---|---|
Traditional | 2.1GB disk | 45s |
bsdtar | 78MB RAM | 51s |
Python | 1.2GB RAM | 62s |
Add checks for incomplete downloads:
curl -s https://example.com/large.zip | {
if ! bsdtar -xvf -; then
echo "Extraction failed - possibly truncated download" >&2
exit 1
fi
}
When dealing with large ZIP archives in automated workflows, we often face a dilemma: download the entire file before processing, or handle the data as it streams in. The latter approach saves both time and disk space, but requires special handling.
Traditional unzip utilities like unzip
or gunzip
typically require:
- A physical file on disk
- Random access to the archive
- Complete file headers before processing
For true streaming processing, consider these alternatives:
1. Using funzip (part of Info-ZIP)
curl -s http://example.com/archive.zip | funzip > output.txt
Limitations: Only works for single-file ZIP archives
2. Python's zipfile Module
import sys
import zipfile
from io import BytesIO
data = sys.stdin.buffer.read()
with zipfile.ZipFile(BytesIO(data)) as z:
for name in z.namelist():
with z.open(name) as f:
sys.stdout.buffer.write(f.read())
3. Using bsdtar (libarchive)
curl -s http://example.com/archive.zip | bsdtar -xvf - -O > output.txt
Advantage: Handles multiple formats including ZIP
For more complex scenarios:
Parallel Processing with pigz
curl -s http://example.com/archive.gz | unpigz -c | process_data.sh
HTTP Range Requests
When you need partial access:
curl -s -H "Range: bytes=0-999" http://example.com/archive.zip | funzip
Remember that:
- Streaming prevents seeking backward in the archive
- Memory usage grows with large files
- Network speed affects overall throughput
Always validate:
- File signatures before processing
- Extracted file paths for directory traversal
- Total uncompressed size against available memory