When attempting to scrape websites using wget, many developers encounter the challenge of selectively downloading only HTML/PHP content while excluding media files and stylesheets. The default behavior often either grabs everything or misses dynamic pages.
Here's the optimized command that handles both static and dynamic pages:
wget --recursive --level=inf --no-parent --timestamping \
--convert-links --adjust-extension --page-requisites \
--no-check-certificate --reject-regex "\\.(css|js|jpg|jpeg|gif|png|mp4|webm|svg|woff2?|ttf|eot|ico)" \
--accept-regex "\\.(html|htm|php|asp|aspx|jsp)" \
--user-agent="Mozilla/5.0" "https://example.com"
--reject-regex: This is where we specify all file extensions to exclude. The regex pattern handles multiple formats efficiently.
--accept-regex: Ensures we capture all common page extensions including PHP and other server-side files.
For sites with query parameters in URLs, add these parameters:
--span-hosts --domains example.com \
--include-directories="/path1,/path2" \
--restrict-file-names=windows
Here's how to scrape a WordPress site while avoiding media:
wget --recursive --level=3 --no-parent \
--reject "*.jpg,*.jpeg,*.gif,*.png,*.css,*.js,*.zip,*.mp4" \
--accept "*.html,*.htm,*.php" \
--wait=2 --random-wait \
--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
"https://wordpress-site.com/blog/"
When mirroring websites for offline analysis or archival purposes, we often need just the textual content - HTML pages, PHP scripts, or other server-side rendered content - without bloating storage with media assets. Wget's default recursive download behavior grabs everything unless properly constrained.
The key is using -A
(accept) and -R
(reject) flags with precise file extensions:
wget --recursive --level=inf --no-parent --convert-links \
--adjust-extension --page-requisites --span-hosts \
--timestamping --no-clobber --random-wait \
--user-agent="Mozilla/5.0" --html-extension \
-A "*.html,*.htm,*.php,*.asp,*.aspx,*.jsp" \
-R "*.jpg,*.jpeg,*.gif,*.png,*.svg,*.css,*.js,*.mp4,*.webm,*.ogg" \
https://example.com/
The command above combines several important techniques:
- Multiple
-A
patterns cover dynamic pages (PHP/ASP/JSP) - Comprehensive
-R
list blocks common media formats --html-extension
ensures proper file naming--adjust-extension
handles server-side rendered pages
For more precise control, combine with DOMAIN and DIRECTORIES:
wget --domains=example.com --no-parent \
--accept-regex '/[^/]*\.(html?|php|aspx?)($|\?)' \
--reject-regex '\.(css|js|png|jpe?g|gif)(\?.*)?$' \
https://example.com/path/
For modern SPAs, you might need:
wget --execute robots=off --mirror --convert-links \
--adjust-extension --page-requisites \
--no-check-certificate --span-hosts \
--accept '*.html,*.php,*.json' \
--reject '*.png,*.jpg,*.jpeg,*.gif,*.css,*.js' \
-U "Mozilla/5.0" \
https://example.com/
For phpBB forums:
wget --recursive --level=2 --no-parent \
--convert-links --adjust-extension \
-A "*.html,*.php,*.htm" \
-R "*.png,*.jpg,*.css,*.js,*.gif,*.ico" \
--wait=2 --random-wait \
--user-agent="Mozilla/5.0" \
https://forum.example.com/viewforum.php?f=1