How to Use wget Recursive Download with Selective Link Following (Avoid Unwanted Files)


2 views

Mastering Selective Recursive Downloads with wget

When mirroring websites using wget --recursive, we often encounter a common frustration: the tool downloads all linked files first, then applies rejection rules. This wastes bandwidth and time, especially when dealing with large sites.

The solution lies in combining several wget parameters:

wget --recursive --level=5 --no-parent \
     --accept '*.html,*.htm,*.php' \
     --reject '*.jpg,*.png,*.gif,*.js,*.css' \
     --domains example.com \
     http://example.com/
  • --accept: Whitelist specific file extensions
  • --reject: Blacklist unwanted file types
  • --level: Control recursion depth
  • --no-parent: Stay within specified directory

For more complex filtering, use regular expressions with --accept-regex and --reject-regex:

wget -r -l2 --no-parent \
     --accept-regex '/blog/202[0-9]/.*\.html$' \
     --reject-regex '.*/temp/.*' \
     http://example.com/blog/
Tool Advantage Disadvantage
wget Built-in, no installation needed Limited regex support
httrack GUI available, better filtering Larger footprint
curl + grep Maximum flexibility Requires scripting

To download documentation while excluding images and archives:

wget -r -np -k -p \
     --accept '*.html,*.pdf' \
     --reject '*.zip,*.tar.gz,*.jpg' \
     --wait=2 --random-wait \
     --user-agent="Mozilla/5.0" \
     https://docs.example.com/manual/

For large-scale operations, consider:

  • Using --limit-rate=500k to throttle bandwidth
  • Adding --no-clobber to skip existing files
  • Setting --timeout=30 for slow servers


When using wget --recursive to mirror websites, many developers encounter an efficiency problem. While the --reject option prevents certain files from being saved, wget still downloads them before deletion - wasting bandwidth and time. What we really need is a way to prevent wget from even requesting unwanted resources in the first place.

The solution lies in combining multiple wget parameters intelligently:

wget --recursive --level=5 --no-parent --convert-links \
     --accept '*.html,*.php,*.css' \
     --reject '*.jpg,*.png,*.gif,*.zip,*.pdf' \
     http://example.com/

Key parameters explained:

  • --accept: Only follow links matching these patterns
  • --reject: Additional filter for URLs to skip
  • --level: Controls recursion depth
  • --no-parent: Prevents ascending to parent directories

For complex filtering needs, wget supports sophisticated pattern matching:

# Exclude all JavaScript except specific files
wget --recursive --reject '*.js' \
     --accept-regex '/main\.js$|/utils\.js$' \
     http://example.com/

# Mirror only documentation paths
wget --recursive --include-directories='/docs,/manual' \
     --exclude-directories='blog,forum' \
     http://example.com/

For scenarios requiring more complex rules, consider these alternatives:

# Using httrack with filters
httrack http://example.com/ -O /mirror -%v \
        '-*.jpg' '-*.png' '-*/temp/*' '+*/articles/*'

# Custom script with curl + jq
curl -s http://example.com/sitemap.json | \
jq -r '.links[] | select(.type == "text/html")' | \
xargs -n1 wget

When mirroring large sites, these optimizations help:

  • Use --wait to avoid overloading servers
  • Set --random-wait to appear more human-like
  • Combine with --limit-rate=500k for bandwidth control
  • Cache DNS lookups with --no-dns-cache