How to Implement User-Agent Based Rate Limiting in Nginx for Bots and Browsers


15 views

When implementing rate limiting in Nginx, we often need different rules for different types of traffic. The common requirement is to apply stricter limits for regular browsers while allowing legitimate bots (like Googlebot or Bingbot) higher request rates. The challenge comes from Nginx's configuration limitations when trying to combine limit_req directives with conditional logic.

Here's a tested configuration that properly implements user-agent based rate limiting:

http {
    # Define user agent types
    map $http_user_agent $limit_key {
        default         $binary_remote_addr;
        "~*Googlebot"   "googlebot";
        "~*Bingbot"     "bingbot";
        "~*Slurp"       "";
        "~*NastyBot"    "";
    }

    # Rate limit zones
    limit_req_zone $binary_remote_addr zone=browser:10m rate=1r/s;
    limit_req_zone $binary_remote_addr zone=bot:10m rate=10r/s;

    server {
        listen 80;
        server_name example.com;

        location / {
            # Block bad bots (empty $limit_key)
            if ($limit_key = "") {
                return 403;
            }

            # Apply bot rate limit if matched
            if ($limit_key ~* "bot") {
                limit_req zone=bot burst=20 nodelay;
                break;
            }

            # Default to browser rate limit
            limit_req zone=browser burst=5 nodelay;
            
            # Your regular configuration
            try_files $uri $uri/ =404;
        }
    }
}

The solution uses Nginx's map directive to create a variable that identifies different user agents. Important notes:

  • Legitimate bots are mapped to special values (containing "bot")
  • Bad bots are mapped to empty string, which triggers 403 response
  • Default case uses client IP address for rate limiting
  • The break directive prevents falling through to default limit

For more complex scenarios, you might want to:

# Multiple location blocks approach
location /api/ {
    # Special rate limits for API endpoints
    if ($limit_key ~* "bot") {
        limit_req zone=bot_api burst=30 nodelay;
    }
    limit_req zone=browser_api burst=10 nodelay;
}

location / {
    # Regular content rate limits
    if ($limit_key ~* "bot") {
        limit_req zone=bot_content burst=50 nodelay;
    }
    limit_req zone=browser_content burst=20 nodelay;
}

Since this solution relies on User-Agent, you should verify Googlebot requests:

# Reverse DNS verification for Googlebot
location / {
    if ($http_user_agent ~* "Googlebot") {
        set $googlebot 1;
    }
    
    if ($remote_addr !~* "^66\.249\.([6-9][0-9]|8[0-4])\.([0-9]|[1-9][0-9]|1([0-9][0-9])|2([0-4][0-9]|5[0-5]))$") {
        set $googlebot "${googlebot}0";
    }
    
    if ($googlebot = "10") {
        return 403;
    }
    
    # Rest of your configuration
}

When configuring rate limiting in Nginx, we often need different rules for various user agents. The main challenges are:

  • Properly identifying bots via User-Agent strings
  • Applying different rate limits without using 'if' in location blocks
  • Rejecting malicious bots while allowing legitimate crawlers

Here's a tested configuration that properly implements user-agent based rate limiting:

http {
  # Define user agent types
  map $http_user_agent $limit_key {
    default $binary_remote_addr;
    "~*Googlebot" "googlebot";
    "~*Bingbot" "bingbot";
    "~*Slurp" "bad_bot";
    "~*(nastybot|evilbot)" "bad_bot";
  }

  # Rate limit zones
  limit_req_zone $binary_remote_addr zone=normal:10m rate=1r/s;
  limit_req_zone $binary_remote_addr zone=crawlers:10m rate=10r/s;
  
  # Bad bots zone (will be blocked)
  limit_req_zone $binary_remote_addr zone=bad_bots:10m rate=1r/m;

  server {
    listen 80;
    server_name example.com;

    # Block bad bots first
    if ($limit_key = "bad_bot") {
      return 403;
    }

    location / {
      # Apply normal rate limiting by default
      limit_req zone=normal burst=5 nodelay;

      # Apply crawler-specific rate limiting
      if ($limit_key ~ "^(googlebot|bingbot)$") {
        limit_req zone=crawlers burst=20 nodelay;
      }

      # Your regular configuration
      try_files $uri $uri/ =404;
    }
  }
}

For complex configurations, consider using separate location blocks matched by user agent:

http {
  limit_req_zone $binary_remote_addr zone=normal:10m rate=1r/s;
  limit_req_zone $binary_remote_addr zone=crawlers:10m rate=10r/s;

  server {
    listen 80;
    server_name example.com;

    # Default location with normal rate limiting
    location / {
      limit_req zone=normal burst=5 nodelay;
      try_files $uri $uri/ =404;
    }

    # Special handling for known bots
    location @crawlers {
      limit_req zone=crawlers burst=20 nodelay;
      try_files $uri $uri/ =404;
    }

    # Block bad bots
    location @bad_bots {
      return 403;
    }
  }

  # Map user agents to locations
  map $http_user_agent $limit_loc {
    default @default;
    "~*Googlebot" @crawlers;
    "~*Bingbot" @crawlers;
    "~*(Slurp|nastybot)" @bad_bots;
  }
}

After implementing, test your configuration with:

sudo nginx -t
sudo systemctl reload nginx

You can verify the rate limiting is working with curl commands:

# Test normal rate limiting
curl -A "Mozilla" -I http://example.com

# Test crawler rate limiting  
curl -A "Googlebot" -I http://example.com

# Test bad bot blocking
curl -A "Slurp" -I http://example.com
  • Regularly update your bot patterns as user agent strings evolve
  • Consider adding the crawler IP validation for additional security
  • Monitor your rate limit zones size (10m in examples) based on traffic
  • Adjust burst values according to your specific requirements