Bad Bot Detection
badbots.py generates per-platform User-Agent blocklists alongside the OWASP rules, so you can drop noisy crawlers, AI scrapers, and known abusive scanners in a single include.
How it works
- The script fetches public bot lists — including
ai.robots.txtand other community-curated sources. - It deduplicates and normalizes the User-Agent patterns.
- It emits one file per platform under
waf_patterns/<platform>/. - The daily GitHub Actions workflow regenerates and republishes these files alongside the OWASP-derived rules.
If a primary source is unreachable, the script falls back to a bundled list so the build still succeeds.
Generated files
| Platform | File | Format |
|---|---|---|
| Nginx | bots.conf | map $http_user_agent $bad_bot |
| Apache | bots.conf | ModSecurity SecRule directives |
| Traefik | bots.toml | Middleware regex replacements |
| HAProxy | bots.acl | One regex per line, loadable with -f |
Nginx
# In the http block:
include /etc/nginx/waf_patterns/nginx/bots.conf;
# In any server block you want to protect:
server {
if ($bad_bot) { return 403; }
}The map looks like:
map $http_user_agent $bad_bot {
default 0;
"~*AhrefsBot" 1;
"~*SemrushBot" 1;
"~*MJ12bot" 1;
"~*GPTBot" 1;
# …
}Apache
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"Include the file globally or per VirtualHost:
Include /etc/apache2/waf_patterns/apache/bots.confHAProxy
acl bad_bot hdr(User-Agent) -m reg -i -f /etc/haproxy/bots.acl
http-request deny deny_status 403 if bad_botTraefik
[http.middlewares.bot-blocker]
# populated automatically by bots.tomlReference bot-blocker@file from the routers you want to protect.
What gets blocked
The default list groups User-Agent patterns into four broad categories.
SEO and marketing crawlers
Aggressive site indexers that are usually unwelcome on production traffic:
- AhrefsBot
- SemrushBot
- MJ12bot
- DotBot
- BLEXBot
AI training crawlers
Most are documented at ai.robots.txt:
- GPTBot, ChatGPT-User
- ClaudeBot, Anthropic-AI
- Google-Extended
- CCBot, Bytespider, PerplexityBot
General scrapers
- DataForSeoBot
- PetalBot
- Bytespider
Malicious scanners
Public vulnerability scanners and spam bots that have no legitimate reason to crawl your origin.
Search engines are not blocked
Major search engines (Googlebot, Bingbot, DuckDuckBot, Baiduspider, YandexBot) are not included in the default block list — blocking them harms SEO.
Customization
Add your own pattern
# Append in bots.conf
"~*MyCustomBot" 1;SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
"id:200999,phase:1,deny,status:403"Whitelist a bot
For Nginx, override the match before the catch-all:
map $http_user_agent $bad_bot {
default 0;
"~*Googlebot" 0; # explicit allow
"~*AhrefsBot" 1;
}Allow bots inside a path
location /public-api/ {
# bypass the bot rule for this path
proxy_pass http://upstream;
}
location / {
if ($bad_bot) { return 403; }
proxy_pass http://upstream;
}Regenerating manually
python badbots.pyThe generated files end up in waf_patterns/<platform>/.
Monitoring
Track which patterns actually fire in your traffic:
# Top 20 user agents that hit a 403
awk '$9 == 403 {print $12}' /var/log/nginx/access.log \
| sort | uniq -c | sort -rn | head -20If you see legitimate traffic in the list, add it to a whitelist and re-include bots.conf after your override.