How to Use Wget to Download Only HTML/PHP Pages Without CSS, Images, or Other Assets


2 views

When attempting to scrape websites using wget, many developers encounter the challenge of selectively downloading only HTML/PHP content while excluding media files and stylesheets. The default behavior often either grabs everything or misses dynamic pages.

Here's the optimized command that handles both static and dynamic pages:

wget --recursive --level=inf --no-parent --timestamping \
--convert-links --adjust-extension --page-requisites \
--no-check-certificate --reject-regex "\\.(css|js|jpg|jpeg|gif|png|mp4|webm|svg|woff2?|ttf|eot|ico)" \
--accept-regex "\\.(html|htm|php|asp|aspx|jsp)" \
--user-agent="Mozilla/5.0" "https://example.com"

--reject-regex: This is where we specify all file extensions to exclude. The regex pattern handles multiple formats efficiently.

--accept-regex: Ensures we capture all common page extensions including PHP and other server-side files.

For sites with query parameters in URLs, add these parameters:

--span-hosts --domains example.com \
--include-directories="/path1,/path2" \
--restrict-file-names=windows

Here's how to scrape a WordPress site while avoiding media:

wget --recursive --level=3 --no-parent \
--reject "*.jpg,*.jpeg,*.gif,*.png,*.css,*.js,*.zip,*.mp4" \
--accept "*.html,*.htm,*.php" \
--wait=2 --random-wait \
--user-agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64)" \
"https://wordpress-site.com/blog/"

When mirroring websites for offline analysis or archival purposes, we often need just the textual content - HTML pages, PHP scripts, or other server-side rendered content - without bloating storage with media assets. Wget's default recursive download behavior grabs everything unless properly constrained.

The key is using -A (accept) and -R (reject) flags with precise file extensions:

wget --recursive --level=inf --no-parent --convert-links \
--adjust-extension --page-requisites --span-hosts \
--timestamping --no-clobber --random-wait \
--user-agent="Mozilla/5.0" --html-extension \
-A "*.html,*.htm,*.php,*.asp,*.aspx,*.jsp" \
-R "*.jpg,*.jpeg,*.gif,*.png,*.svg,*.css,*.js,*.mp4,*.webm,*.ogg" \
https://example.com/

The command above combines several important techniques:

  • Multiple -A patterns cover dynamic pages (PHP/ASP/JSP)
  • Comprehensive -R list blocks common media formats
  • --html-extension ensures proper file naming
  • --adjust-extension handles server-side rendered pages

For more precise control, combine with DOMAIN and DIRECTORIES:

wget --domains=example.com --no-parent \
--accept-regex '/[^/]*\.(html?|php|aspx?)($|\?)' \
--reject-regex '\.(css|js|png|jpe?g|gif)(\?.*)?$' \
https://example.com/path/

For modern SPAs, you might need:

wget --execute robots=off --mirror --convert-links \
--adjust-extension --page-requisites \
--no-check-certificate --span-hosts \
--accept '*.html,*.php,*.json' \
--reject '*.png,*.jpg,*.jpeg,*.gif,*.css,*.js' \
-U "Mozilla/5.0" \
https://example.com/

For phpBB forums:

wget --recursive --level=2 --no-parent \
--convert-links --adjust-extension \
-A "*.html,*.php,*.htm" \
-R "*.png,*.jpg,*.css,*.js,*.gif,*.ico" \
--wait=2 --random-wait \
--user-agent="Mozilla/5.0" \
https://forum.example.com/viewforum.php?f=1