How to Use wget Recursive Download with Selective Link Following (Avoid Unwanted Files)

Tool	Advantage	Disadvantage
wget	Built-in, no installation needed	Limited regex support
httrack	GUI available, better filtering	Larger footprint
curl + grep	Maximum flexibility	Requires scripting

Mastering Selective Recursive Downloads with wget When mirroring websites using wget --recursive, we often encounter a common frustration: the tool downloads all linked files first, then applies rejection rules. This wastes bandwidth and time, especially when dealing with large sites. The solution lies in combining several wget parameters: wget --recursive --level=5 --no-parent \ --accept '*.html,*.htm,*.php' \ --reject '*.jpg,*.png,*.gif,*.js,*.css' \ --domains example.com \ http://example.com/ --accept: Whitelist specific file extensions --reject: Blacklist unwanted file types --level: Control recursion depth --no-parent: Stay within specified directory For more complex filtering, use regular expressions with --accept-regex and --reject-regex: wget -r -l2 --no-parent \ --accept-regex '/blog/202[0-9]/.*\.html$' \ --reject-regex '.*/temp/.*' \ http://example.com/blog/ Tool Advantage Disadvantage wget Built-in, no installation needed Limited regex support httrack GUI available, better filtering Larger footprint curl + grep Maximum flexibility Requires scripting To download documentation while excluding images and archives: wget -r -np -k -p \ --accept '*.html,*.pdf' \ --reject '*.zip,*.tar.gz,*.jpg' \ --wait=2 --random-wait \ --user-agent="Mozilla/5.0" \ https://docs.example.com/manual/ For large-scale operations, consider: Using --limit-rate=500k to throttle bandwidth Adding --no-clobber to skip existing files Setting --timeout=30 for slow servers

When using wget --recursive to mirror websites, many developers encounter an efficiency problem. While the --reject option prevents certain files from being saved, wget still downloads them before deletion - wasting bandwidth and time. What we really need is a way to prevent wget from even requesting unwanted resources in the first place.

The solution lies in combining multiple wget parameters intelligently:

wget --recursive --level=5 --no-parent --convert-links \
     --accept '*.html,*.php,*.css' \
     --reject '*.jpg,*.png,*.gif,*.zip,*.pdf' \
     http://example.com/

Key parameters explained:

--accept: Only follow links matching these patterns
--reject: Additional filter for URLs to skip
--level: Controls recursion depth
--no-parent: Prevents ascending to parent directories

For complex filtering needs, wget supports sophisticated pattern matching:

# Exclude all JavaScript except specific files
wget --recursive --reject '*.js' \
     --accept-regex '/main\.js$|/utils\.js$' \
     http://example.com/

# Mirror only documentation paths
wget --recursive --include-directories='/docs,/manual' \
     --exclude-directories='blog,forum' \
     http://example.com/

For scenarios requiring more complex rules, consider these alternatives:

# Using httrack with filters
httrack http://example.com/ -O /mirror -%v \
        '-*.jpg' '-*.png' '-*/temp/*' '+*/articles/*'

# Custom script with curl + jq
curl -s http://example.com/sitemap.json | \
jq -r '.links[] | select(.type == "text/html")' | \
xargs -n1 wget

When mirroring large sites, these optimizations help:

Use --wait to avoid overloading servers
Set --random-wait to appear more human-like
Combine with --limit-rate=500k for bandwidth control
Cache DNS lookups with --no-dns-cache

ServerDevWorker

How to Use wget Recursive Download with Selective Link Following (Avoid Unwanted Files)

Mastering Selective Recursive Downloads with wget

Related Articles