Mastering Selective Recursive Downloads with wget
When mirroring websites using
wget --recursive
, we often encounter a common frustration: the tool downloads all linked files first, then applies rejection rules. This wastes bandwidth and time, especially when dealing with large sites.The solution lies in combining several wget parameters:
wget --recursive --level=5 --no-parent \ --accept '*.html,*.htm,*.php' \ --reject '*.jpg,*.png,*.gif,*.js,*.css' \ --domains example.com \ http://example.com/
--accept
: Whitelist specific file extensions--reject
: Blacklist unwanted file types--level
: Control recursion depth--no-parent
: Stay within specified directoryFor more complex filtering, use regular expressions with
--accept-regex
and--reject-regex
:wget -r -l2 --no-parent \ --accept-regex '/blog/202[0-9]/.*\.html$' \ --reject-regex '.*/temp/.*' \ http://example.com/blog/
Tool Advantage Disadvantage wget Built-in, no installation needed Limited regex support httrack GUI available, better filtering Larger footprint curl + grep Maximum flexibility Requires scripting To download documentation while excluding images and archives:
wget -r -np -k -p \ --accept '*.html,*.pdf' \ --reject '*.zip,*.tar.gz,*.jpg' \ --wait=2 --random-wait \ --user-agent="Mozilla/5.0" \ https://docs.example.com/manual/For large-scale operations, consider:
- Using
--limit-rate=500k
to throttle bandwidth- Adding
--no-clobber
to skip existing files- Setting
--timeout=30
for slow servers
When using
wget --recursive
to mirror websites, many developers encounter an efficiency problem. While the--reject
option prevents certain files from being saved, wget still downloads them before deletion - wasting bandwidth and time. What we really need is a way to prevent wget from even requesting unwanted resources in the first place.The solution lies in combining multiple wget parameters intelligently:
wget --recursive --level=5 --no-parent --convert-links \ --accept '*.html,*.php,*.css' \ --reject '*.jpg,*.png,*.gif,*.zip,*.pdf' \ http://example.com/Key parameters explained:
--accept
: Only follow links matching these patterns--reject
: Additional filter for URLs to skip--level
: Controls recursion depth--no-parent
: Prevents ascending to parent directories
For complex filtering needs, wget supports sophisticated pattern matching:
# Exclude all JavaScript except specific files wget --recursive --reject '*.js' \ --accept-regex '/main\.js$|/utils\.js$' \ http://example.com/ # Mirror only documentation paths wget --recursive --include-directories='/docs,/manual' \ --exclude-directories='blog,forum' \ http://example.com/
For scenarios requiring more complex rules, consider these alternatives:
# Using httrack with filters httrack http://example.com/ -O /mirror -%v \ '-*.jpg' '-*.png' '-*/temp/*' '+*/articles/*' # Custom script with curl + jq curl -s http://example.com/sitemap.json | \ jq -r '.links[] | select(.type == "text/html")' | \ xargs -n1 wget
When mirroring large sites, these optimizations help:
- Use
--wait
to avoid overloading servers - Set
--random-wait
to appear more human-like - Combine with
--limit-rate=500k
for bandwidth control - Cache DNS lookups with
--no-dns-cache