How to Properly Refresh a Website Mirror Created with Wget –mirror: Solving Recursion and Timestamp Issues

When using wget --mirror to maintain an updated website mirror, many developers encounter the frustrating behavior where subsequent runs fail to properly check and update all changed files. The root cause lies in how Wget handles recursion and timestamp comparisons during mirroring operations.

The --mirror option (equivalent to -r -N -l inf --no-remove-listing) performs these key functions:

- Recursive downloading (-r)
- Timestamp comparison (-N)
- Unlimited recursion depth (-l inf)
- Preserving directory listings (--no-remove-listing)

The primary issue occurs because:

Wget checks the top-level page's timestamp first
If unchanged, it assumes no child pages need checking
This behavior prevents discovery of updated nested content

Here are the most reliable approaches to force a proper refresh:

Option 1: Force Recursion with --reject

This makes Wget process all links regardless of timestamps:

wget --mirror --reject=index.html.tmp http://www.example.org/

Option 2: Combine with --no-if-modified-since

Disables timestamp checking entirely:

wget --mirror --no-if-modified-since http://www.example.org/

Option 3: Selective Recursion Control

For large sites, you might want to limit depth while forcing refresh:

wget --mirror --level=5 --no-parent --timestamping \
     --no-if-modified-since http://www.example.org/

For production environments, consider these enhancements:

wget --mirror --convert-links --adjust-extension \
     --page-requisites --no-parent --wait=2 \
     --random-wait --no-if-modified-since \
     --user-agent="Mozilla/5.0" http://www.example.org/

Create a cron job with forced refresh parameters:

0 3 * * * /usr/bin/wget --mirror --no-if-modified-since -o /var/log/wget-mirror.log http://www.example.org/

Server load: Add --wait and --limit-rate for large sites
Bandwidth: Use --quota if needed
Permissions: Maintain proper robots.txt compliance

When using wget --mirror to create and maintain website mirrors, many developers encounter a common frustration: subsequent runs don't properly refresh changed content when the root page remains unchanged. This occurs because wget's timestamp-based comparison logic stops recursion when it finds an unchanged parent page.

The --mirror option combines several flags (-r -N -l inf --no-remove-listing), but its timestamping behavior has limitations:

wget --mirror http://www.example.org/
# Only checks timestamps of existing files
# Stops recursion if parent page is unchanged

For a proper refresh, we need to modify the approach:

Option 1: Force Recursion with --no-parent

wget --mirror --no-parent --force-html --execute robots=off \
     --wait=1 --random-wait --limit-rate=100k \
     http://www.example.org/

Key flags explanation:

--no-parent: Ensures recursion continues regardless of parent page status
--force-html: Treats all downloaded files as HTML for parsing links
--wait: Adds politeness delay between requests

Option 2: Clean Refresh with Partial Deletion

For more reliable results, selectively delete files before refreshing:

# Delete timestamp files but keep HTML content
find www.example.org/ -name "*.wget*" -delete
find www.example.org/ -name "*.tmp" -delete

# Then run mirror with additional flags
wget --mirror --convert-links --adjust-extension \
     --page-requisites --span-hosts \
     http://www.example.org/

For scheduled updates, create a bash script:

#!/bin/bash
MIRROR_DIR="/path/to/mirror"
SITE_URL="http://www.example.org"

# Clean old timestamp files
find "$MIRROR_DIR" -name "*.wget*" -delete

# Run refresh with enhanced parameters
wget --mirror \
     --no-if-modified-since \
     --no-cache \
     --no-check-certificate \
     --user-agent="Mozilla/5.0 (MirrorBot)" \
     -P "$MIRROR_DIR" \
     "$SITE_URL"

# Fix relative links
find "$MIRROR_DIR" -type f -name "*.html" -exec \
     sed -i 's|href="/|href="./|g' {} \;

Even for HTML-only mirrors, you might need to handle pseudo-dynamic URLs:

wget --mirror --reject-regex "\?.*" \
     --accept-regex "/static/|\.html$|\.css$|\.js$" \
     http://www.example.org/

Check the refresh worked by comparing file counts and dates:

# Count files before and after
find www.example.org/ -type f | wc -l

# Check newest files
find www.example.org/ -type f -exec ls -lt {} + | head -n 20

ServerDevWorker