When using wget --mirror
to maintain an updated website mirror, many developers encounter the frustrating behavior where subsequent runs fail to properly check and update all changed files. The root cause lies in how Wget handles recursion and timestamp comparisons during mirroring operations.
The --mirror
option (equivalent to -r -N -l inf --no-remove-listing
) performs these key functions:
- Recursive downloading (-r)
- Timestamp comparison (-N)
- Unlimited recursion depth (-l inf)
- Preserving directory listings (--no-remove-listing)
The primary issue occurs because:
- Wget checks the top-level page's timestamp first
- If unchanged, it assumes no child pages need checking
- This behavior prevents discovery of updated nested content
Here are the most reliable approaches to force a proper refresh:
Option 1: Force Recursion with --reject
This makes Wget process all links regardless of timestamps:
wget --mirror --reject=index.html.tmp http://www.example.org/
Option 2: Combine with --no-if-modified-since
Disables timestamp checking entirely:
wget --mirror --no-if-modified-since http://www.example.org/
Option 3: Selective Recursion Control
For large sites, you might want to limit depth while forcing refresh:
wget --mirror --level=5 --no-parent --timestamping \
--no-if-modified-since http://www.example.org/
For production environments, consider these enhancements:
wget --mirror --convert-links --adjust-extension \
--page-requisites --no-parent --wait=2 \
--random-wait --no-if-modified-since \
--user-agent="Mozilla/5.0" http://www.example.org/
Create a cron job with forced refresh parameters:
0 3 * * * /usr/bin/wget --mirror --no-if-modified-since -o /var/log/wget-mirror.log http://www.example.org/
- Server load: Add
--wait
and--limit-rate
for large sites - Bandwidth: Use
--quota
if needed - Permissions: Maintain proper
robots.txt
compliance
When using wget --mirror
to create and maintain website mirrors, many developers encounter a common frustration: subsequent runs don't properly refresh changed content when the root page remains unchanged. This occurs because wget's timestamp-based comparison logic stops recursion when it finds an unchanged parent page.
The --mirror
option combines several flags (-r -N -l inf --no-remove-listing
), but its timestamping behavior has limitations:
wget --mirror http://www.example.org/
# Only checks timestamps of existing files
# Stops recursion if parent page is unchanged
For a proper refresh, we need to modify the approach:
Option 1: Force Recursion with --no-parent
wget --mirror --no-parent --force-html --execute robots=off \
--wait=1 --random-wait --limit-rate=100k \
http://www.example.org/
Key flags explanation:
--no-parent
: Ensures recursion continues regardless of parent page status--force-html
: Treats all downloaded files as HTML for parsing links--wait
: Adds politeness delay between requests
Option 2: Clean Refresh with Partial Deletion
For more reliable results, selectively delete files before refreshing:
# Delete timestamp files but keep HTML content
find www.example.org/ -name "*.wget*" -delete
find www.example.org/ -name "*.tmp" -delete
# Then run mirror with additional flags
wget --mirror --convert-links --adjust-extension \
--page-requisites --span-hosts \
http://www.example.org/
For scheduled updates, create a bash script:
#!/bin/bash
MIRROR_DIR="/path/to/mirror"
SITE_URL="http://www.example.org"
# Clean old timestamp files
find "$MIRROR_DIR" -name "*.wget*" -delete
# Run refresh with enhanced parameters
wget --mirror \
--no-if-modified-since \
--no-cache \
--no-check-certificate \
--user-agent="Mozilla/5.0 (MirrorBot)" \
-P "$MIRROR_DIR" \
"$SITE_URL"
# Fix relative links
find "$MIRROR_DIR" -type f -name "*.html" -exec \
sed -i 's|href="/|href="./|g' {} \;
Even for HTML-only mirrors, you might need to handle pseudo-dynamic URLs:
wget --mirror --reject-regex "\?.*" \
--accept-regex "/static/|\.html$|\.css$|\.js$" \
http://www.example.org/
Check the refresh worked by comparing file counts and dates:
# Count files before and after
find www.example.org/ -type f | wc -l
# Check newest files
find www.example.org/ -type f -exec ls -lt {} + | head -n 20