How to Exclude In-Progress File Transfers When Using rsync for SFTP Backups


4 views

When dealing with SFTP servers where clients continuously upload large files, a common challenge arises: how to only transfer completed files while ignoring those currently being written. Standard rsync operations will attempt to copy these in-progress files, which can lead to corrupted transfers and wasted bandwidth.

The key is to implement a check that verifies files haven't been modified for a certain period. Here's a bash script approach:

#!/bin/bash
SOURCE_DIR="/sftp/uploads"
DESTINATION="user@backup-server:/backups"
STABILITY_PERIOD=300 # 5 minutes in seconds

# Find files not modified in last 5 minutes
find "$SOURCE_DIR" -type f -mmin +$(($STABILITY_PERIOD/60)) -print0 | \
  rsync -av --files-from=- --from0 "$SOURCE_DIR" "$DESTINATION"

For more precise detection of files being written, you can check which files are currently open:

#!/bin/bash
SOURCE_DIR="/sftp/uploads"
DESTINATION="user@backup-server:/backups"

# Get list of files not currently opened by any process
comm -23 \
  <(find "$SOURCE_DIR" -type f | sort) \
  <(lsof +D "$SOURCE_DIR" | awk 'NR>1 {print $9}' | sort) | \
  rsync -av --files-from=- "$SOURCE_DIR" "$DESTINATION"

For ongoing transfers, consider using rsync's timeout feature:

rsync -av --timeout=60 --partial --progress /sftp/uploads/ user@backup-server:/backups/

For maximum reliability, combine multiple verification methods:

#!/bin/bash
SOURCE="/sftp/uploads"
DEST="user@backup-server:/backups"
LOG="/var/log/sftp_backups.log"

{
  echo "Starting backup at $(date)"
  
  # Step 1: Find files not modified in last 10 minutes
  STABLE_FILES=$(mktemp)
  find "$SOURCE" -type f -mmin +10 -print0 > "$STABLE_FILES"
  
  # Step 2: Filter out files currently open by any process
  OPEN_FILES=$(mktemp)
  lsof +D "$SOURCE" 2>/dev/null | awk 'NR>1 {print $9}' | sort > "$OPEN_FILES"
  
  # Step 3: Create final transfer list
  TRANSFER_LIST=$(mktemp)
  comm -23 \
    <(cat "$STABLE_FILES" | tr '\\0' '\\n' | sort) \
    "$OPEN_FILES" | tr '\\n' '\\0' > "$TRANSFER_LIST"
  
  # Step 4: Execute rsync
  rsync -av --files-from="$TRANSFER_LIST" --from0 "$SOURCE" "$DEST"
  
  # Cleanup
  rm "$STABLE_FILES" "$OPEN_FILES" "$TRANSFER_LIST"
  
  echo "Backup completed at $(date)"
} >> "$LOG" 2>&1

For complex scenarios, consider these alternatives:

  • csync2: Cluster synchronization tool with file verification
  • lsyncd: Live syncing daemon with various monitoring options
  • incron: Trigger actions on filesystem events

When managing an SFTP server that receives continuous large file uploads from clients, copying only complete files becomes crucial. Attempting to process partially uploaded files can lead to data corruption, processing errors, or incomplete datasets. The standard rsync behavior doesn't inherently distinguish between complete and in-progress transfers.

To safely identify files that aren't actively being written to, we can use several approaches:

# Method 1: Check if file is open by any process
lsof /path/to/file | grep 'REG'

# Method 2: Compare file size changes over time
stat -c %s /path/to/file
sleep 5
stat -c %s /path/to/file

Here are three practical rsync-based solutions:

1. Using --ignore-existing with Size Checks

#!/bin/bash
# First pass to identify stable files
find /sftp/uploads -type f -mmin +5 -exec stat -c "%s %n" {} + > stable_files.list

# Rsync only files that haven't changed in 5 minutes
rsync -avz --files-from=<(awk '{print $2}' stable_files.list) \
  --ignore-existing \
  user@sftp-server:/ /backup/destination/

2. Combining with inotifywait

# Monitor for file closure events
inotifywait -m -e close_write --format '%w%f' /sftp/uploads |
while read file
do
  rsync -avz "$file" user@backup-server:/destination/
done

3. LVM Snapshot Approach

# Create LVM snapshot
lvcreate -L10G -s -n sftp-snap /dev/vg/sftp-lv

# Mount snapshot and rsync from it
mount /dev/vg/sftp-snap /mnt/sftp-snapshot
rsync -avz /mnt/sftp-snapshot/uploads/ user@backup-server:/destination/

# Cleanup
umount /mnt/sftp-snapshot
lvremove /dev/vg/sftp-snap

For more complex scenarios, these alternatives might be better suited:

  • LFTP: Supports mirroring with better transfer control
  • csync2: Designed for cluster synchronization
  • Unison: Two-way file synchronization

When implementing any of these solutions:

  1. Always test with non-production data first
  2. Implement proper logging for troubleshooting
  3. Consider adding checksum verification for critical files
  4. Monitor disk space when using snapshot-based approaches
  5. Set up proper error handling in your scripts
# Example logging implementation
rsync -avz --log-file=/var/log/rsync_$(date +%Y%m%d).log \
  --files-from=stable_files.list \
  user@sftp-server:/ /backup/destination/