When implementing automated file processing (especially with large archives), we often encounter race conditions where a cron job attempts to process files before transfer completion. This manifests when:
- SCP/SSH transfers exceed cron interval duration
- Network latency causes prolonged file writes
- Multi-gigabyte archives take minutes to transfer
Method 1: lsof Verification
The most reliable approach checks if any process has the file open for writing:
#!/bin/bash
is_file_written() {
if lsof -F n "$1" 2>/dev/null | grep -q '^w.*'; then
return 0 # File is being written
else
return 1 # File is complete
fi
}
TARGET_FILE="/path/to/archive.tar.gz"
if ! is_file_written "$TARGET_FILE"; then
tar -xzf "$TARGET_FILE" --directory /target/path
rm -f "$TARGET_FILE"
fi
Method 2: Filesize Stability Check
For environments where lsof isn't available:
check_stable_size() {
local file=$1
local initial_size=$(stat -c%s "$file")
sleep 5 # Adjust interval based on transfer speed
[[ $initial_size -eq $(stat -c%s "$file") ]]
}
if check_stable_size "large_file.tar.gz"; then
# Proceed with extraction
fi
Method 3: Transfer Completion Flags
The alternative extension pattern mentioned works well when you control the transfer process:
# Sender side
scp large_file.tar.gz.part remote:/destination/
ssh remote "mv large_file.tar.gz.part large_file.tar.gz"
# Receiver processing
for f in *.tar.gz; do
[[ -f "${f}.part" ]] && continue # Skip incomplete transfers
tar -xzf "$f"
done
For mission-critical systems, combine multiple approaches:
- Use
flock
for exclusive access control - Implement MD5 verification for very large files
- Consider filesystem watches (inotify) instead of cron polling
# Combined example using flock and lsof
(
flock -x 200 || exit 1
if ! lsof -t "$1"; then
tar -xzf "$1"
fi
) 200>/var/lock/fileprocessing.lock
When automating TAR file processing with cron jobs, one critical edge case emerges: detecting whether a file transfer (especially over SSH) has completed before attempting extraction. The fundamental issue occurs when:
- A multi-gigabyte TAR archive takes >1 minute to transfer
- The monitoring cron job runs every 60 seconds
- The script attempts extraction on a partially transferred file
Here are three battle-tested approaches to solve this:
Method 1: lsof - The File Handle Check
The most direct way is checking if any process has the file open for writing:
#!/bin/bash
is_file_in_use() {
if lsof "$1" | grep -q 'REG.*W'; then
return 0
else
return 1
fi
}
for tarfile in /watchdir/*.tar.gz; do
if ! is_file_in_use "$tarfile"; then
tar -xzf "$tarfile" -C /target/dir
rm "$tarfile"
fi
done
Method 2: Size Stabilization Monitoring
For environments where lsof isn't available, track file size changes:
#!/bin/bash
check_stable_size() {
local file=$1
local interval=5
local prev_size=$(stat -c%s "$file")
sleep $interval
local current_size=$(stat -c%s "$file")
[ "$prev_size" -eq "$current_size" ]
}
for tarfile in /watchdir/*.tar.gz; do
if check_stable_size "$tarfile"; then
if tar -tzf "$tarfile" &>/dev/null; then # Test archive integrity
tar -xzf "$tarfile" -C /target/dir
rm "$tarfile"
else
echo "Corrupt archive detected: $tarfile" >&2
fi
fi
done
Method 3: The Atomic Rename Pattern
The most robust solution involves transfer protocol cooperation:
#!/bin/bash
for tarfile in /watchdir/*.tar.gz; do
# Skip temporary transfer files
if [[ "$tarfile" == *.part || "$tarfile" == *.tmp ]]; then
continue
fi
# Verify complete transfer marker
if [ -f "${tarfile}.md5" ]; then
if md5sum -c "${tarfile}.md5" &>/dev/null; then
tar -xzf "$tarfile" -C /target/dir
rm "$tarfile" "${tarfile}.md5"
fi
fi
done
Consider these additional safeguards:
- Lock files: Prevent concurrent processing with flock
- Inode checking: Detect when a new file replaces the original
- Transfer logs: Cross-reference with transfer completion logs
For enterprise deployments, combine multiple methods:
#!/bin/bash
process_tar() {
local file=$1
local lock_file="/tmp/${file##*/}.lock"
(
flock -n 9 || exit 1
if ! lsof "$file" &>/dev/null; then
if check_stable_size "$file"; then
if tar -tzf "$file" &>/dev/null; then
tar -xzf "$file" -C /target/dir
rm "$file"
fi
fi
fi
) 9>"$lock_file"
}
export -f process_tar
find /watchdir -name '*.tar.gz' -exec bash -c 'process_tar "$0"' {} \;