Skip to content

Bad Bot Detection

badbots.py generates per-platform User-Agent blocklists alongside the OWASP rules, so you can drop noisy crawlers, AI scrapers, and known abusive scanners in a single include.

How it works

  1. The script fetches public bot lists — including ai.robots.txt and other community-curated sources.
  2. It deduplicates and normalizes the User-Agent patterns.
  3. It emits one file per platform under waf_patterns/<platform>/.
  4. The daily GitHub Actions workflow regenerates and republishes these files alongside the OWASP-derived rules.

If a primary source is unreachable, the script falls back to a bundled list so the build still succeeds.

Generated files

PlatformFileFormat
Nginxbots.confmap $http_user_agent $bad_bot
Apachebots.confModSecurity SecRule directives
Traefikbots.tomlMiddleware regex replacements
HAProxybots.aclOne regex per line, loadable with -f

Nginx

nginx
# In the http block:
include /etc/nginx/waf_patterns/nginx/bots.conf;

# In any server block you want to protect:
server {
    if ($bad_bot) { return 403; }
}

The map looks like:

nginx
map $http_user_agent $bad_bot {
    default 0;
    "~*AhrefsBot"   1;
    "~*SemrushBot"  1;
    "~*MJ12bot"     1;
    "~*GPTBot"      1;
    # …
}

Apache

apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
    "id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"

Include the file globally or per VirtualHost:

apache
Include /etc/apache2/waf_patterns/apache/bots.conf

HAProxy

haproxy
acl bad_bot hdr(User-Agent) -m reg -i -f /etc/haproxy/bots.acl
http-request deny deny_status 403 if bad_bot

Traefik

toml
[http.middlewares.bot-blocker]
  # populated automatically by bots.toml

Reference bot-blocker@file from the routers you want to protect.

What gets blocked

The default list groups User-Agent patterns into four broad categories.

SEO and marketing crawlers

Aggressive site indexers that are usually unwelcome on production traffic:

  • AhrefsBot
  • SemrushBot
  • MJ12bot
  • DotBot
  • BLEXBot

AI training crawlers

Most are documented at ai.robots.txt:

  • GPTBot, ChatGPT-User
  • ClaudeBot, Anthropic-AI
  • Google-Extended
  • CCBot, Bytespider, PerplexityBot

General scrapers

  • DataForSeoBot
  • PetalBot
  • Bytespider

Malicious scanners

Public vulnerability scanners and spam bots that have no legitimate reason to crawl your origin.

Search engines are not blocked

Major search engines (Googlebot, Bingbot, DuckDuckBot, Baiduspider, YandexBot) are not included in the default block list — blocking them harms SEO.

Customization

Add your own pattern

nginx
# Append in bots.conf
"~*MyCustomBot" 1;
apache
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
    "id:200999,phase:1,deny,status:403"

Whitelist a bot

For Nginx, override the match before the catch-all:

nginx
map $http_user_agent $bad_bot {
    default 0;
    "~*Googlebot"   0;   # explicit allow
    "~*AhrefsBot"   1;
}

Allow bots inside a path

nginx
location /public-api/ {
    # bypass the bot rule for this path
    proxy_pass http://upstream;
}

location / {
    if ($bad_bot) { return 403; }
    proxy_pass http://upstream;
}

Regenerating manually

bash
python badbots.py

The generated files end up in waf_patterns/<platform>/.

Monitoring

Track which patterns actually fire in your traffic:

bash
# Top 20 user agents that hit a 403
awk '$9 == 403 {print $12}' /var/log/nginx/access.log \
  | sort | uniq -c | sort -rn | head -20

If you see legitimate traffic in the list, add it to a whitelist and re-include bots.conf after your override.

Released under the MIT License.