How to Detect If a File Is Being Written in Linux/Bash (Prevent Partial Tar Extraction)


2 views

When implementing automated file processing (especially with large archives), we often encounter race conditions where a cron job attempts to process files before transfer completion. This manifests when:

  • SCP/SSH transfers exceed cron interval duration
  • Network latency causes prolonged file writes
  • Multi-gigabyte archives take minutes to transfer

Method 1: lsof Verification

The most reliable approach checks if any process has the file open for writing:

#!/bin/bash
is_file_written() {
    if lsof -F n "$1" 2>/dev/null | grep -q '^w.*'; then
        return 0  # File is being written
    else
        return 1  # File is complete
    fi
}

TARGET_FILE="/path/to/archive.tar.gz"
if ! is_file_written "$TARGET_FILE"; then
    tar -xzf "$TARGET_FILE" --directory /target/path
    rm -f "$TARGET_FILE"
fi

Method 2: Filesize Stability Check

For environments where lsof isn't available:

check_stable_size() {
    local file=$1
    local initial_size=$(stat -c%s "$file")
    sleep 5  # Adjust interval based on transfer speed
    [[ $initial_size -eq $(stat -c%s "$file") ]]
}

if check_stable_size "large_file.tar.gz"; then
    # Proceed with extraction
fi

Method 3: Transfer Completion Flags

The alternative extension pattern mentioned works well when you control the transfer process:

# Sender side
scp large_file.tar.gz.part remote:/destination/
ssh remote "mv large_file.tar.gz.part large_file.tar.gz"

# Receiver processing
for f in *.tar.gz; do
    [[ -f "${f}.part" ]] && continue  # Skip incomplete transfers
    tar -xzf "$f"
done

For mission-critical systems, combine multiple approaches:

  • Use flock for exclusive access control
  • Implement MD5 verification for very large files
  • Consider filesystem watches (inotify) instead of cron polling
# Combined example using flock and lsof
(
    flock -x 200 || exit 1
    if ! lsof -t "$1"; then
        tar -xzf "$1"
    fi
) 200>/var/lock/fileprocessing.lock

When automating TAR file processing with cron jobs, one critical edge case emerges: detecting whether a file transfer (especially over SSH) has completed before attempting extraction. The fundamental issue occurs when:

  • A multi-gigabyte TAR archive takes >1 minute to transfer
  • The monitoring cron job runs every 60 seconds
  • The script attempts extraction on a partially transferred file

Here are three battle-tested approaches to solve this:

Method 1: lsof - The File Handle Check

The most direct way is checking if any process has the file open for writing:


#!/bin/bash
is_file_in_use() {
    if lsof "$1" | grep -q 'REG.*W'; then
        return 0
    else
        return 1
    fi
}

for tarfile in /watchdir/*.tar.gz; do
    if ! is_file_in_use "$tarfile"; then
        tar -xzf "$tarfile" -C /target/dir
        rm "$tarfile"
    fi
done

Method 2: Size Stabilization Monitoring

For environments where lsof isn't available, track file size changes:


#!/bin/bash
check_stable_size() {
    local file=$1
    local interval=5
    local prev_size=$(stat -c%s "$file")
    sleep $interval
    local current_size=$(stat -c%s "$file")
    [ "$prev_size" -eq "$current_size" ]
}

for tarfile in /watchdir/*.tar.gz; do
    if check_stable_size "$tarfile"; then
        if tar -tzf "$tarfile" &>/dev/null; then  # Test archive integrity
            tar -xzf "$tarfile" -C /target/dir
            rm "$tarfile"
        else
            echo "Corrupt archive detected: $tarfile" >&2
        fi
    fi
done

Method 3: The Atomic Rename Pattern

The most robust solution involves transfer protocol cooperation:


#!/bin/bash
for tarfile in /watchdir/*.tar.gz; do
    # Skip temporary transfer files
    if [[ "$tarfile" == *.part || "$tarfile" == *.tmp ]]; then
        continue
    fi
    
    # Verify complete transfer marker
    if [ -f "${tarfile}.md5" ]; then
        if md5sum -c "${tarfile}.md5" &>/dev/null; then
            tar -xzf "$tarfile" -C /target/dir
            rm "$tarfile" "${tarfile}.md5"
        fi
    fi
done

Consider these additional safeguards:

  • Lock files: Prevent concurrent processing with flock
  • Inode checking: Detect when a new file replaces the original
  • Transfer logs: Cross-reference with transfer completion logs

For enterprise deployments, combine multiple methods:


#!/bin/bash
process_tar() {
    local file=$1
    local lock_file="/tmp/${file##*/}.lock"
    
    (
        flock -n 9 || exit 1
        
        if ! lsof "$file" &>/dev/null; then
            if check_stable_size "$file"; then
                if tar -tzf "$file" &>/dev/null; then
                    tar -xzf "$file" -C /target/dir
                    rm "$file"
                fi
            fi
        fi
    ) 9>"$lock_file"
}

export -f process_tar
find /watchdir -name '*.tar.gz' -exec bash -c 'process_tar "$0"' {} \;