Bad Bot Detection
This guide explains how to use the bad bot detection feature to block malicious crawlers and scrapers.
Overview
The badbots.py script generates configuration files to block known malicious bots based on their User-Agent strings. It fetches bot lists from multiple public sources and generates blocking rules for each supported web server.
How It Works
- Fetches bot lists from public sources:
- ai.robots.txt
- Various community-maintained bot lists
- Generates blocking configurations for each platform
- Updates configurations daily via GitHub Actions
Generated Files
| Platform | File | Format |
|---|---|---|
| Nginx | bots.conf | Map directive |
| Apache | bots.conf | ModSecurity rules |
| Traefik | bots.toml | Middleware config |
| HAProxy | bots.acl | ACL patterns |
Nginx Bot Blocker
The Nginx configuration uses a map directive:
# In http block
map $http_user_agent $bad_bot {
default 0;
"~*AhrefsBot" 1;
"~*SemrushBot" 1;
"~*MJ12bot" 1;
"~*DotBot" 1;
# ... more bots
}
# In server block
if ($bad_bot) {
return 403;
}Integration
http {
include /path/to/waf_patterns/nginx/bots.conf;
server {
if ($bad_bot) {
return 403;
}
}
}Apache Bot Blocker
Uses ModSecurity rules:
SecRule REQUEST_HEADERS:User-Agent "@rx AhrefsBot" \
"id:200001,phase:1,deny,status:403,msg:'Bad Bot Blocked'"HAProxy Bot Blocker
Uses ACL rules:
acl bad_bot hdr_reg(User-Agent) -i -f /etc/haproxy/bots.acl
http-request deny if bad_botBlocked Bot Categories
The following categories of bots are blocked by default:
SEO/Marketing Crawlers
- AhrefsBot
- SemrushBot
- MJ12bot
- DotBot
- BLEXBot
AI/ML Crawlers
- GPTBot
- ChatGPT-User
- CCBot
- Google-Extended
- Anthropic-AI
Scrapers
- DataForSeoBot
- PetalBot
- Bytespider
- ClaudeBot
Malicious Bots
- Known vulnerability scanners
- Spam bots
- Content scrapers
Customization
Add Custom Bots
Edit the generated file or add your own patterns:
# Nginx: Add to bots.conf
"~*MyCustomBot" 1;# Apache: Add rule
SecRule REQUEST_HEADERS:User-Agent "@rx MyCustomBot" \
"id:200999,deny"Whitelist Bots
For Nginx, allow specific bots:
map $http_user_agent $bad_bot {
default 0;
"~*Googlebot" 0; # Allow Google
"~*AhrefsBot" 1; # Block Ahrefs
}Allow All Bots for Specific Paths
location /public-api {
# Override bot blocking
if ($bad_bot) {
# Don't block here
}
}Generate Manually
Run the script to regenerate bot lists:
python badbots.pyThe script supports fallback lists if primary sources are unavailable.
Monitoring
Log Blocked Bots
Enable logging to track blocked requests:
if ($bad_bot) {
access_log /var/log/nginx/blocked_bots.log;
return 403;
}Analyze Bot Traffic
# Count blocked bot requests
grep "403" /var/log/nginx/access.log | \
awk '{print $12}' | sort | uniq -c | sort -rn | head -20Best Practices
Regular Updates: The bot lists are updated daily. Pull the latest changes or download from releases.
Monitor False Positives: Some legitimate services may use blocked User-Agents. Monitor your logs.
Combine with Rate Limiting: Use bot blocking with rate limiting for comprehensive protection.
Test Before Deploying: Verify that legitimate traffic (search engines, monitoring) is not blocked.
WARNING
Blocking search engine bots (Googlebot, Bingbot) can negatively impact SEO. The default lists do not block major search engines.