When implementing robots.txt rules across domains and subdomains, it's crucial to understand that each subdomain maintains its own robots.txt file. The robots.txt file at example.com/robots.txt
has no effect on subdomain.example.com
- they're treated as completely separate entities by search engines.
To achieve your goal of allowing main domain crawling while blocking subdomains:
1. For your main domain (example.com):
User-agent: *
Allow: /
2. For each subdomain (subdomain.example.com):
User-agent: *
Disallow: /
The effectiveness of this approach depends on proper server configuration:
- Each subdomain must have its own web root directory
- Each subdomain must be configured to serve its own robots.txt file
- DNS must properly resolve all subdomains
For a more sophisticated setup using Apache's virtual hosts:
<VirtualHost *:80>
ServerName example.com
DocumentRoot /var/www/main
<Directory "/var/www/main">
AllowOverride All
</Directory>
</VirtualHost>
<VirtualHost *:80>
ServerName subdomain.example.com
DocumentRoot /var/www/subdomain
<Directory "/var/www/subdomain">
AllowOverride All
</Directory>
</VirtualHost>
Always verify your setup using:
- Google Search Console's robots.txt tester
- Direct URL access (e.g., subdomain.example.com/robots.txt)
- Command line tools like curl or wget
For complete exclusion, consider combining robots.txt with:
<meta name="robots" content="noindex">
And using X-Robots-Tag in HTTP headers:
Header set X-Robots-Tag "noindex, nofollow"
When managing web crawlers' access across domains and subdomains, it's crucial to understand that each subdomain is treated as a separate entity in robots.txt interpretation. A common misconception is that placing a robots.txt at the root of your main domain affects all subdomains - this is incorrect.
To effectively block crawling on subdomains while allowing the main domain:
- Create individual robots.txt files for each subdomain
- Place them in the root directory of each subdomain
- Use the following directive:
User-agent: *
Disallow: /
A well-organized deployment might look like:
main-domain.com/ ├── robots.txt (allows crawling) subdomain.main-domain.com/ ├── robots.txt (blocks crawling) dev.main-domain.com/ ├── robots.txt (blocks crawling)
Use these methods to verify your implementation:
- Google Search Console's robots.txt Tester
- Direct URL access: https://subdomain.yourdomain.com/robots.txt
- Crawl simulation tools like Screaming Frog
For complex setups:
# Block specific paths on subdomains while allowing others
User-agent: *
Disallow: /private/
Allow: /public/
Remember that robots.txt directives are suggestions, not enforced rules. For stronger protection, consider implementing password protection or IP whitelisting.
- Assuming wildcards work in robots.txt (they don't)
- Forgetting to upload robots.txt to each subdomain's root
- Mixing up Allow/Disallow directive order (order matters!)
For large organizations with many subdomains, automate deployment:
#!/bin/bash
# Automated robots.txt deployer for subdomains
SUBDOMAINS=(staging dev test internal)
for SUB in "${SUBDOMAINS[@]}"
do
echo "User-agent: *" > /var/www/$SUB.yourdomain.com/robots.txt
echo "Disallow: /" >> /var/www/$SUB.yourdomain.com/robots.txt
done