Skip to content

Bad Bot Detection

This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.

Overview

The badbots.py script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.

How It Works

  1. Fetches bot lists from public sources:
  2. Generates blocking configurations for each platform
  3. Updates configurations daily via GitHub Actions

Generated Files

PlatformFileFormat
Nginxbots.confMap directive
Apachebots.confModSecurity rules
Traefikbots.tomlMiddleware config
HAProxybots.aclACL patterns

Nginx Bot Blocker

The Nginx configuration uses a map directive:

nginx
# In http block
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot" 1;
    "~*SemrushBot" 1;
    "~*MJ12bot" 1;
    "~*DotBot" 1;
    # ... more bots
}

# In server block
if ($bad_bot) {
    return 403;
}

Integration

nginx
http {
    include /path/to/waf_patterns/nginx/bots.conf;
    
    server {
        if ($bad_bot) {
            return 403;
        }
    }
}

Apache Bot Blocker

Uses ModSecurity rules:

apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"

HAProxy Bot Blocker

Uses ACL rules:

haproxy
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_bot

Blocked Bot Categories

The following categories of bots are blocked by default:

SEO/Marketing Crawlers

  • AhrefsBot
  • SemrushBot
  • MJ12bot
  • DotBot
  • BLEXBot

AI/ML Crawlers

  • GPTBot
  • ChatGPT-User
  • CCBot
  • Google-Extended
  • Anthropic-AI

Scrapers

  • DataForSeoBot
  • PetalBot
  • Bytespider
  • ClaudeBot

Malicious Bots

  • Known vulnerability scanners
  • Spam bots
  • Content scrapers

Customization

Add Custom Bots

Edit the generated file or add your own patterns:

nginx
# Nginx: Add to bots.conf
"~*MyCustomBot" 1;
apache
# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,deny"

Whitelist Bots

For Nginx, allow specific bots:

nginx
map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot" 0;     # Allow Google
    "~*AhrefsBot" 1;     # Block Ahrefs
}

Allow All Bots for Specific Paths

nginx
location /public-api {
    # Override bot blocking
    if ($bad_bot) {
        # Don't block here
    }
}

Generate Manually

Run the script to regenerate bot lists:

bash
python badbots.py

The script supports fallback lists if primary sources are unavailable.

Monitoring

Log Blocked Bots

Enable logging to track blocked requests:

nginx
if ($bad_bot) {
    access_log /var/log/nginx/blocked_bots.log;
    return 403;
}

Analyze Bot Traffic

bash
# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
  awk '{print $12}' | sort | uniq -c | sort -rn | head -20

Best Practices

  1. Regular Updates: The bot lists are updated daily. Pull the latest changes or download from releases.

  2. Monitor False Positives: Some legitimate services may use blocked User-Agents. Monitor your logs.

  3. Combine with Rate Limiting: Use bot blocking with rate limiting for comprehensive protection.

  4. Test Before Deploying: Verify that legitimate traffic (search engines, monitoring) is not blocked.

WARNING

Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do not block major search engines.

Released under the MIT License.