How to Use robots.txt to Block Crawling for Subdomains While Allowing Main Domain Indexing


17 views

When implementing robots.txt rules across domains and subdomains, it's crucial to understand that each subdomain maintains its own robots.txt file. The robots.txt file at example.com/robots.txt has no effect on subdomain.example.com - they're treated as completely separate entities by search engines.

To achieve your goal of allowing main domain crawling while blocking subdomains:

1. For your main domain (example.com):

User-agent: *
Allow: /

2. For each subdomain (subdomain.example.com):

User-agent: *
Disallow: /

The effectiveness of this approach depends on proper server configuration:

  • Each subdomain must have its own web root directory
  • Each subdomain must be configured to serve its own robots.txt file
  • DNS must properly resolve all subdomains

For a more sophisticated setup using Apache's virtual hosts:

<VirtualHost *:80>
    ServerName example.com
    DocumentRoot /var/www/main
    <Directory "/var/www/main">
        AllowOverride All
    </Directory>
</VirtualHost>

<VirtualHost *:80>
    ServerName subdomain.example.com
    DocumentRoot /var/www/subdomain
    <Directory "/var/www/subdomain">
        AllowOverride All
    </Directory>
</VirtualHost>

Always verify your setup using:

  • Google Search Console's robots.txt tester
  • Direct URL access (e.g., subdomain.example.com/robots.txt)
  • Command line tools like curl or wget

For complete exclusion, consider combining robots.txt with:

<meta name="robots" content="noindex">

And using X-Robots-Tag in HTTP headers:

Header set X-Robots-Tag "noindex, nofollow"

When managing web crawlers' access across domains and subdomains, it's crucial to understand that each subdomain is treated as a separate entity in robots.txt interpretation. A common misconception is that placing a robots.txt at the root of your main domain affects all subdomains - this is incorrect.

To effectively block crawling on subdomains while allowing the main domain:

  1. Create individual robots.txt files for each subdomain
  2. Place them in the root directory of each subdomain
  3. Use the following directive:
User-agent: *
Disallow: /

A well-organized deployment might look like:

main-domain.com/
├── robots.txt (allows crawling)
subdomain.main-domain.com/
├── robots.txt (blocks crawling)
dev.main-domain.com/
├── robots.txt (blocks crawling)

Use these methods to verify your implementation:

  • Google Search Console's robots.txt Tester
  • Direct URL access: https://subdomain.yourdomain.com/robots.txt
  • Crawl simulation tools like Screaming Frog

For complex setups:

# Block specific paths on subdomains while allowing others
User-agent: *
Disallow: /private/
Allow: /public/

Remember that robots.txt directives are suggestions, not enforced rules. For stronger protection, consider implementing password protection or IP whitelisting.

  • Assuming wildcards work in robots.txt (they don't)
  • Forgetting to upload robots.txt to each subdomain's root
  • Mixing up Allow/Disallow directive order (order matters!)

For large organizations with many subdomains, automate deployment:

#!/bin/bash
# Automated robots.txt deployer for subdomains
SUBDOMAINS=(staging dev test internal)
for SUB in "${SUBDOMAINS[@]}"
do
  echo "User-agent: *" > /var/www/$SUB.yourdomain.com/robots.txt
  echo "Disallow: /" >> /var/www/$SUB.yourdomain.com/robots.txt
done