Rate Limiting & Metrics - Documentation
Overview
ALMA now includes production-grade rate limiting and comprehensive metrics collection. These features provide abuse prevention, fair usage enforcement, and deep observability into system behavior.
Rate Limiting
Architecture
Token bucket algorithm with per-client and per-endpoint tracking:
- Tokens: Each client gets a bucket with configurable capacity
- Refill: Tokens automatically refill at a constant rate (RPM)
- Burst Handling: Burst size allows temporary spikes beyond average rate
- Cleanup: Inactive clients are removed after 1 hour to prevent memory leaks
Configuration
Default limits in alma/middleware/rate_limit.py:
# Global defaults
DEFAULT_RPM = 60 # 60 requests per minute
DEFAULT_BURST_SIZE = 10 # Allow bursts of 10 requests
# Per-endpoint limits (override global)
ENDPOINT_LIMITS = {
"/api/v1/conversation/chat-stream": 20, # LLM streaming
"/api/v1/blueprints/generate-blueprint": 30, # Blueprint generation
"/api/v1/tools/execute": 40, # Tool execution
"/api/v1/blueprints": 100, # CRUD operations
}Response Headers
Rate limit information in every response:
X-RateLimit-Limit: 60 # Max requests per minute
X-RateLimit-Remaining: 45 # Requests remaining
X-RateLimit-Reset: 1642345678 # Unix timestamp when limit resetsWhen rate limited (HTTP 429):
Retry-After: 30 # Seconds until next token availableCustomization
Modify limits in code or via environment variables:
# In alma/middleware/rate_limit.py
rate_limiter = RateLimiter(
requests_per_minute=120, # Double the default
burst_size=20 # Larger burst allowance
)
# Per-endpoint
endpoint_limiter = EndpointRateLimiter(
"/api/v1/my-expensive-endpoint": 10 # Very restrictive
)Monitoring
Check rate limit statistics:
curl http://localhost:8000/monitoring/rate-limit/statsResponse:
{
"total_clients": 45,
"rate_limited_clients": 3,
"endpoint_stats": {
"/api/v1/conversation/chat-stream": {
"limit": 20,
"active_clients": 12,
"total_requests": 2450,
"rejected_requests": 15
}
}
}Metrics Collection
Prometheus Integration
ALMA exposes Prometheus-compatible metrics at /metrics:
curl http://localhost:8000/metricsMetric Types
HTTP Metrics
http_requests_total(Counter): Total HTTP requests by method, endpoint, statushttp_request_duration_seconds(Histogram): Request latency distributionhttp_request_size_bytes(Histogram): Request body size distributionhttp_response_size_bytes(Histogram): Response body size distribution
LLM Metrics
llm_requests_total(Counter): LLM requests by model, operation, statusllm_generation_duration_seconds(Histogram): LLM generation timellm_tokens_generated_total(Counter): Total tokens generatedllm_tokens_consumed_total(Counter): Total tokens consumed (input)
Blueprint Metrics
blueprint_operations_total(Counter): Blueprint CRUD operationsblueprint_resources_count(Gauge): Resources per blueprintblueprint_validation_errors_total(Counter): Validation failures
Deployment Metrics
deployment_operations_total(Counter): Deployment create/update/deletedeployment_duration_seconds(Histogram): Time to deployactive_deployments(Gauge): Currently running deployments
Tool Metrics
tool_executions_total(Counter): Tool calls by tool name, statustool_execution_duration_seconds(Histogram): Tool execution time
Rate Limiting Metrics
rate_limit_hits_total(Counter): Rate limit violations by endpoint
System Metrics
active_connections(Gauge): Current WebSocket/HTTP connectionsdatabase_connections(Gauge): Active DB connectionscache_hit_rate(Gauge): Cache effectiveness
Human-Readable API
For quick debugging, use the monitoring endpoints:
# Metrics summary
curl http://localhost:8000/monitoring/metrics/summary
# System overview
curl http://localhost:8000/monitoring/stats/overview
# Health check
curl http://localhost:8000/monitoring/health/detailedExample response:
{
"http": {
"total_requests": 15234,
"avg_latency_ms": 145,
"error_rate": 0.02
},
"llm": {
"total_generations": 1250,
"avg_tokens_per_request": 450,
"total_tokens_generated": 562500
},
"blueprints": {
"total_created": 340,
"avg_resources": 8,
"validation_errors": 12
}
}Visualization with Grafana
Quick Start
- Start the stack:
docker-compose -f docker-compose.metrics.yml up -dServices:
- ALMA API:
http://localhost:8000 - Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3000(admin/admin)
- Dashboard auto-loads at Grafana → Dashboards → ALMA Metrics
Dashboard Panels
The auto-generated dashboard includes:
- Request Rate: Requests per second by endpoint
- Response Time: P50/P90/P99 latency percentiles
- Error Rate: HTTP 4xx/5xx over time
- LLM Performance: Generation time and token throughput
- Blueprint Operations: Create/validate/deploy rates
- Rate Limit Hits: Rate limiting violations
- Active Connections: WebSocket and HTTP connections
- Tool Executions: Tool usage distribution
- System Health: CPU, memory, DB connections
Custom Dashboards
Create custom panels with PromQL queries:
# Average response time by endpoint
rate(http_request_duration_seconds_sum[5m]) /
rate(http_request_duration_seconds_count[5m])
# LLM tokens per second
rate(llm_tokens_generated_total[1m])
# Blueprint validation failure rate
rate(blueprint_validation_errors_total[5m]) /
rate(blueprint_operations_total{operation="validate"}[5m])
# Top rate-limited endpoints
topk(5, rate(rate_limit_hits_total[5m]))Alerting
Prometheus Alerts
Create config/alerts/ALMA.yml:
groups:
- name: ALMA
interval: 30s
rules:
# High error rate
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "{{ $value }}% of requests failing"
# Slow LLM responses
- alert: SlowLLMGeneration
expr: histogram_quantile(0.95, llm_generation_duration_seconds) > 10
for: 5m
labels:
severity: warning
annotations:
summary: "LLM generation slow"
description: "P95 latency {{ $value }}s"
# Rate limiting spike
- alert: RateLimitSpike
expr: rate(rate_limit_hits_total[5m]) > 10
for: 3m
labels:
severity: warning
annotations:
summary: "Rate limiting spike"
description: "{{ $value }} requests/sec being rate limited"
# Too many active deployments
- alert: DeploymentBacklog
expr: active_deployments > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Deployment backlog building"
description: "{{ $value }} active deployments"Grafana Alerts
Set thresholds directly in dashboard panels:
- Edit panel → Alert tab
- Define condition (e.g.,
avg() > 5) - Configure notification channel (Slack, PagerDuty, etc.)
Performance Impact
Rate Limiting Overhead
- Memory: ~200 bytes per active client (10K clients = 2MB)
- CPU: O(1) token calculation, negligible overhead
- Latency: <1ms per request
Metrics Collection Overhead
- Memory: ~5MB for metric registry (grows slowly with label cardinality)
- CPU: Counter increment ~50ns, histogram observation ~500ns
- Latency: <0.5ms per request (middleware overhead)
Recommendations
- Keep label cardinality low: Avoid high-cardinality labels (user IDs, timestamps)
- Use histograms sparingly: Histograms are more expensive than counters
- Aggregate in Prometheus: Don't pre-aggregate in application code
- Set retention limits: Prometheus default 15 days, configure based on disk space
Production Checklist
- [ ] Configure appropriate rate limits per endpoint
- [ ] Set up Prometheus scraping (every 10-15s)
- [ ] Create Grafana dashboards for key metrics
- [ ] Define alert rules for critical thresholds
- [ ] Configure notification channels (Slack, PagerDuty)
- [ ] Set Prometheus retention policy
- [ ] Enable authentication for Prometheus/Grafana endpoints
- [ ] Monitor metrics collection overhead
- [ ] Document custom metrics and their meaning
- [ ] Set up log aggregation for detailed debugging
Troubleshooting
Rate Limiting Not Working
Check middleware order in
alma/api/main.py:pythonapp.middleware("http")(metrics_middleware) # First app.middleware("http")(rate_limit_middleware) # SecondVerify endpoint limits:
python# In rate_limit.py ENDPOINT_LIMITS = {...} # Check your endpoint is listedTest with curl:
bashfor i in {1..70}; do curl -w "%{http_code}\n" http://localhost:8000/api/v1/blueprints; done # Should see 429 after ~60 requests
Metrics Not Appearing
Check
/metricsendpoint:bashcurl http://localhost:8000/metrics | grep http_requests_total # Should see counter valuesVerify Prometheus scraping:
- Open
http://localhost:9090/targets ALMA-apishould be "UP"
- Open
Check Grafana data source:
- Settings → Data Sources → Prometheus
- "Save & Test" should succeed
Dashboard Empty
Generate traffic to create metrics:
bash# Make some requests curl http://localhost:8000/api/v1/blueprints curl http://localhost:8000/api/v1/conversation/chat -X POST -H "Content-Type: application/json" -d '{"message":"test"}'Check time range in Grafana (top right)
- Set to "Last 5 minutes"
Verify queries in panel editor:
- Edit panel → Query tab
- Run query manually
Next Steps
Custom Metrics: Add application-specific metrics
pythonfrom alma.middleware.metrics import get_metrics_collector metrics = get_metrics_collector() metrics.custom_counter.labels(operation="my_feature").inc()Advanced Rate Limiting: Implement tiered limits (free/pro/enterprise)
Distributed Tracing: Integrate OpenTelemetry for request tracing
Cost Tracking: Add metrics for infrastructure costs per deployment
SLO/SLI Monitoring: Define service level objectives and track them