Mastering Selective Recursive Downloads with wget
When mirroring websites using
wget --recursive, we often encounter a common frustration: the tool downloads all linked files first, then applies rejection rules. This wastes bandwidth and time, especially when dealing with large sites.The solution lies in combining several wget parameters:
wget --recursive --level=5 --no-parent \ --accept '*.html,*.htm,*.php' \ --reject '*.jpg,*.png,*.gif,*.js,*.css' \ --domains example.com \ http://example.com/
--accept: Whitelist specific file extensions--reject: Blacklist unwanted file types--level: Control recursion depth--no-parent: Stay within specified directoryFor more complex filtering, use regular expressions with
--accept-regexand--reject-regex:wget -r -l2 --no-parent \ --accept-regex '/blog/202[0-9]/.*\.html$' \ --reject-regex '.*/temp/.*' \ http://example.com/blog/
Tool Advantage Disadvantage wget Built-in, no installation needed Limited regex support httrack GUI available, better filtering Larger footprint curl + grep Maximum flexibility Requires scripting To download documentation while excluding images and archives:
wget -r -np -k -p \ --accept '*.html,*.pdf' \ --reject '*.zip,*.tar.gz,*.jpg' \ --wait=2 --random-wait \ --user-agent="Mozilla/5.0" \ https://docs.example.com/manual/For large-scale operations, consider:
- Using
--limit-rate=500kto throttle bandwidth- Adding
--no-clobberto skip existing files- Setting
--timeout=30for slow servers
When using
wget --recursiveto mirror websites, many developers encounter an efficiency problem. While the--rejectoption prevents certain files from being saved, wget still downloads them before deletion - wasting bandwidth and time. What we really need is a way to prevent wget from even requesting unwanted resources in the first place.The solution lies in combining multiple wget parameters intelligently:
wget --recursive --level=5 --no-parent --convert-links \ --accept '*.html,*.php,*.css' \ --reject '*.jpg,*.png,*.gif,*.zip,*.pdf' \ http://example.com/Key parameters explained:
--accept: Only follow links matching these patterns--reject: Additional filter for URLs to skip--level: Controls recursion depth--no-parent: Prevents ascending to parent directories
For complex filtering needs, wget supports sophisticated pattern matching:
# Exclude all JavaScript except specific files
wget --recursive --reject '*.js' \
--accept-regex '/main\.js$|/utils\.js$' \
http://example.com/
# Mirror only documentation paths
wget --recursive --include-directories='/docs,/manual' \
--exclude-directories='blog,forum' \
http://example.com/
For scenarios requiring more complex rules, consider these alternatives:
# Using httrack with filters
httrack http://example.com/ -O /mirror -%v \
'-*.jpg' '-*.png' '-*/temp/*' '+*/articles/*'
# Custom script with curl + jq
curl -s http://example.com/sitemap.json | \
jq -r '.links[] | select(.type == "text/html")' | \
xargs -n1 wget
When mirroring large sites, these optimizations help:
- Use
--waitto avoid overloading servers - Set
--random-waitto appear more human-like - Combine with
--limit-rate=500kfor bandwidth control - Cache DNS lookups with
--no-dns-cache