When examining Apache error logs, you might frequently encounter entries like:
[access_compat:error] [pid 5059] [client 180.76.15.138:58811]
AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/de/mod/module-dict.html
These typically originate from legitimate search engine crawlers (Googlebot, Baiduspider) attempting to access documentation paths that don't exist in your web root.
Search engines maintain historical records of previously indexed URLs. Even after server reconfiguration, they may continue attempting to crawl:
- Legacy Apache documentation paths from default installations
- URLs that existed in previous website versions
- Common documentation locations across web servers
The surprising path mapping occurs because Apache's default configuration often includes alias directives like:
Alias /manual /usr/share/doc/apache2-doc/manual
Even when not explicitly defined in your vhost, these may inherit from main configuration files (typically in /etc/apache2/conf-enabled/
or similar).
Option 1: Explicitly deny documentation access
Require all denied
Options None
AllowOverride None
Option 2: Rewrite rules for crawlers
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} (Googlebot|Baiduspider) [NC]
RewriteRule ^/manual/ - [G]
Option 3: Virtual host hardening
DocumentRoot "/var/www/example.com/public"
# Explicitly reset any inherited aliases
Alias /manual ""
Require all granted
Options -Indexes
While these 403 errors might seem concerning, they generally don't negatively impact your site's ranking. However, to ensure optimal crawling:
- Submit an updated sitemap via Google Search Console
- Monitor crawl stats for your actual content
- Consider adding a
robots.txt
directive for documentation paths
To identify all active path mappings:
apache2ctl -S
apache2ctl -M | grep alias
For a comprehensive solution, combine server configuration with proper HTTP status codes to guide crawlers away from non-existent resources.
For over a month now, I've observed curious patterns in my Apache 2.4.7 logs where search engine spiders (specifically Googlebot and Baiduspider) attempt to access documentation paths that don't exist on my production server:
[Wed Jun 24 16:13:34.430884 2015] [access_compat:error] [pid 5059] [client 180.76.15.138:58811]
AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/de/mod/module-dict.html
The most puzzling aspect is how requests to /manual/de/mod/...
get mapped to /usr/share/doc/apache2-doc/manual/de/mod/...
instead of the expected /var/www/example.com/public/manual/de/mod/...
. This occurs because:
- The default Apache configuration often includes Alias directives for documentation
- Some Linux distributions pre-configure these paths for local documentation
- The spiders may have cached old documentation URLs from other servers
After investigating, I found these spiders are making legitimate requests (confirmed via reverse DNS), but to non-existent resources. Here's a sample virtual host configuration that might help others:
<VirtualHost *:80>
ServerName example.com
ServerAlias www.example.com
# Explicitly deny access to system documentation
<Directory "/usr/share/doc">
Require all denied
Options None
AllowOverride None
</Directory>
# Main website directory
<Directory "/var/www/example.com/public">
Require all granted
Options FollowSymLinks
AllowOverride FileInfo
</Directory>
</VirtualHost>
For those experiencing similar issues, consider these approaches:
- Explicitly deny access to system documentation paths in your Apache config
- Create robots.txt entries to discourage spidering of these paths
- Monitor logs but don't panic - these are normal 403 responses
Example robots.txt addition:
User-agent: *
Disallow: /manual/
Disallow: /doc/
While these 403 errors won't affect your site's functionality, they can:
- Create unnecessary log entries
- Waste server resources on invalid requests
- Potentially slow down legitimate crawling
The solution I implemented was to add specific deny rules while maintaining proper access to the actual website content. This maintains security while reducing log clutter.