How to Use robots.txt to Block Crawling for Subdomains While Allowing Main Domain Indexing


2 views

When implementing robots.txt rules across domains and subdomains, it's crucial to understand that each subdomain maintains its own robots.txt file. The robots.txt file at example.com/robots.txt has no effect on subdomain.example.com - they're treated as completely separate entities by search engines.

To achieve your goal of allowing main domain crawling while blocking subdomains:

1. For your main domain (example.com):

User-agent: *
Allow: /

2. For each subdomain (subdomain.example.com):

User-agent: *
Disallow: /

The effectiveness of this approach depends on proper server configuration:

  • Each subdomain must have its own web root directory
  • Each subdomain must be configured to serve its own robots.txt file
  • DNS must properly resolve all subdomains

For a more sophisticated setup using Apache's virtual hosts:

<VirtualHost *:80>
    ServerName example.com
    DocumentRoot /var/www/main
    <Directory "/var/www/main">
        AllowOverride All
    </Directory>
</VirtualHost>

<VirtualHost *:80>
    ServerName subdomain.example.com
    DocumentRoot /var/www/subdomain
    <Directory "/var/www/subdomain">
        AllowOverride All
    </Directory>
</VirtualHost>

Always verify your setup using:

  • Google Search Console's robots.txt tester
  • Direct URL access (e.g., subdomain.example.com/robots.txt)
  • Command line tools like curl or wget

For complete exclusion, consider combining robots.txt with:

<meta name="robots" content="noindex">

And using X-Robots-Tag in HTTP headers:

Header set X-Robots-Tag "noindex, nofollow"

When managing web crawlers' access across domains and subdomains, it's crucial to understand that each subdomain is treated as a separate entity in robots.txt interpretation. A common misconception is that placing a robots.txt at the root of your main domain affects all subdomains - this is incorrect.

To effectively block crawling on subdomains while allowing the main domain:

  1. Create individual robots.txt files for each subdomain
  2. Place them in the root directory of each subdomain
  3. Use the following directive:
User-agent: *
Disallow: /

A well-organized deployment might look like:

main-domain.com/
├── robots.txt (allows crawling)
subdomain.main-domain.com/
├── robots.txt (blocks crawling)
dev.main-domain.com/
├── robots.txt (blocks crawling)

Use these methods to verify your implementation:

  • Google Search Console's robots.txt Tester
  • Direct URL access: https://subdomain.yourdomain.com/robots.txt
  • Crawl simulation tools like Screaming Frog

For complex setups:

# Block specific paths on subdomains while allowing others
User-agent: *
Disallow: /private/
Allow: /public/

Remember that robots.txt directives are suggestions, not enforced rules. For stronger protection, consider implementing password protection or IP whitelisting.

  • Assuming wildcards work in robots.txt (they don't)
  • Forgetting to upload robots.txt to each subdomain's root
  • Mixing up Allow/Disallow directive order (order matters!)

For large organizations with many subdomains, automate deployment:

#!/bin/bash
# Automated robots.txt deployer for subdomains
SUBDOMAINS=(staging dev test internal)
for SUB in "${SUBDOMAINS[@]}"
do
  echo "User-agent: *" > /var/www/$SUB.yourdomain.com/robots.txt
  echo "Disallow: /" >> /var/www/$SUB.yourdomain.com/robots.txt
done