ALMA LLM Integration Guide
This guide explains how to configure and use the LLM integration in ALMA for conversational infrastructure management.
Overview
ALMA supports configurable LLM backends (defaulting to Qwen/Qwen2.5-0.5B-Instruct for local inference) to provide:
- Natural language to infrastructure - Describe what you want, get a blueprint
- Infrastructure to natural language - Understand what your blueprints do
- Improvement suggestions - Get recommendations for improvements
- Security audits - Identify security issues in blueprints
- Resource sizing - Get resource recommendations
- Intent classification - Understand user requests
Installation
Prerequisites
# Install ALMA with LLM support
pip install -e ".[dev,llm]"
# This installs:
# - transformers>=4.36.0
# - torch>=2.1.0Configuration
Configure the LLM in your .env file:
# LLM Configuration
LLM_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct
LLM_DEVICE=cpu # cpu, cuda, or mps (Apple Silicon)
LLM_MAX_TOKENS=512
# Optional: Disable LLM (use rule-based fallback)
# LLM_DEVICE=noneDevice Selection
- CPU: Works everywhere, slower
- CUDA: NVIDIA GPUs, fastest
- MPS: Apple Silicon (M1/M2/M3), fast
The system will automatically select the best available device if not specified.
API Endpoints
1. Chat with AI
Have a conversation about infrastructure:
curl -X POST http://localhost:8000/api/v1/conversation/chat \
-H "Content-Type: application/json" \
-d '{
"message": "I need to deploy a scalable web application"
}'Response:
{
"intent": "create_blueprint",
"confidence": 0.9,
"response": "I understand you want to create a blueprint. I can help you create an infrastructure blueprint.",
"blueprint": null
}2. Generate Blueprint from Description
Convert natural language to infrastructure code:
curl -X POST http://localhost:8000/api/v1/conversation/generate-blueprint \
-H "Content-Type: application/json" \
-d '{
"description": "I need a high-availability web application with load balancer, 3 web servers, and a PostgreSQL database cluster"
}'Response:
version: "1.0"
name: ha-web-application
description: High-availability web application with load balancer and database
resources:
- type: compute
name: web-server-1
provider: proxmox
specs:
cpu: 4
memory: 8GB
storage: 50GB
# ... more servers
- type: network
name: load-balancer
provider: fake
specs:
type: http
algorithm: round-robin
backends:
- web-server-1
- web-server-2
- web-server-3
- type: compute
name: postgres-primary
provider: proxmox
specs:
cpu: 8
memory: 32GB
storage: 500GB3. Describe Blueprint in Natural Language
Convert infrastructure code to human-readable description:
curl -X POST http://localhost:8000/api/v1/conversation/describe-blueprint \
-H "Content-Type: application/json" \
-d '{
"blueprint": {
"name": "my-app",
"resources": [...]
}
}'Response:
{
"description": "This infrastructure 'my-app' provides a highly available web application consisting of:\n\n1. A load balancer distributing traffic across 3 web servers\n2. Three application servers running your web application\n3. A PostgreSQL database cluster with primary and standby nodes\n4. Automated backups configured for disaster recovery\n\nThe setup ensures 99.9% uptime with automatic failover capabilities."
}4. Get Improvement Suggestions
Analyze your infrastructure and get recommendations:
curl -X POST http://localhost:8000/api/v1/conversation/suggest-improvements \
-H "Content-Type: application/json" \
-d '{
"blueprint": {
"name": "single-server-app",
"resources": [{
"type": "compute",
"name": "web-server",
"specs": {"cpu": 1, "memory": "512MB"}
}]
}
}'Response:
{
"suggestions": [
"Add redundant servers for high availability (minimum 2 instances). Current setup has a single point of failure.",
"Increase resource allocation - 1 CPU and 512MB RAM may be insufficient for production workloads. Recommend at least 2 CPUs and 2GB RAM.",
"Add a load balancer to distribute traffic and enable horizontal scaling.",
"Implement backup storage for disaster recovery. No backup solution is currently configured.",
"Add monitoring and observability tools (e.g., Prometheus, Grafana) to track system health."
]
}5. Resource Sizing Recommendations
Get resource sizing recommendations for your workload:
curl -X POST http://localhost:8000/api/v1/conversation/resource-sizing \
-H "Content-Type: application/json" \
-d '{
"workload": "Django web application with PostgreSQL",
"expected_load": "5000 concurrent users, 100 requests/second"
}'Response:
{
"cpu": 8,
"memory": "16GB",
"storage": "200GB",
"storage_type": "SSD",
"network": "1Gbps",
"reasoning": "For a Django application serving 5000 concurrent users at 100 req/s: 8 CPUs handle request processing, 16GB RAM supports Django processes + database connections, 200GB SSD ensures fast database queries and application response times."
}6. Security Audit
Get security analysis of your blueprint:
curl -X POST http://localhost:8000/api/v1/conversation/security-audit \
-H "Content-Type: application/json" \
-d '{
"blueprint": {
"name": "web-app",
"resources": [...]
}
}'Response:
{
"findings": [
{
"severity": "High",
"issue": "Database server directly accessible from internet",
"recommendation": "Place database in private subnet, only accessible from application servers"
},
{
"severity": "Medium",
"issue": "No SSL/TLS encryption configured for load balancer",
"recommendation": "Enable HTTPS with valid certificates, redirect HTTP to HTTPS"
},
{
"severity": "Medium",
"issue": "No firewall rules defined",
"recommendation": "Implement network segmentation with strict firewall rules"
}
]
}Python SDK Usage
Basic Usage
from alma.core.llm_qwen import Qwen3LLM
from alma.core.llm_orchestrator import EnhancedOrchestrator
# Initialize LLM
llm = Qwen3LLM(
model_name="Qwen/Qwen2.5-0.5B-Instruct",
device="cpu", # or "cuda", "mps"
max_tokens=512
)
# Initialize orchestrator
orchestrator = EnhancedOrchestrator(llm=llm, use_llm=True)
# Generate blueprint
blueprint = await orchestrator.natural_language_to_blueprint(
"Create a microservices architecture with API gateway, 3 services, and message queue"
)
# Get suggestions
suggestions = await orchestrator.suggest_improvements(blueprint)
# Clean up
await llm.close()Using Service Layer
from alma.core.llm_service import get_orchestrator
# Get singleton orchestrator (automatically manages LLM lifecycle)
orchestrator = await get_orchestrator()
# Use it
blueprint = await orchestrator.natural_language_to_blueprint(
"Deploy a CI/CD pipeline"
)CLI Usage
The CLI uses the LLM for natural language processing:
# Generate blueprint from natural language
ALMA generate "I need a Kubernetes cluster with 3 nodes"
# This will:
# 1. Use LLM to understand your intent
# 2. Generate appropriate blueprint
# 3. Save it to blueprints/Performance Considerations
Model Size vs. Speed
- Qwen2.5-0.5B: Fast, good for most tasks (~1-2 seconds per query on CPU)
- Qwen2.5-1.5B: Better quality, slower (~3-5 seconds per query on CPU)
- Qwen2.5-7B: Best quality, requires GPU (~1-2 seconds on GPU)
Optimization Tips
- Use GPU/MPS for faster inference
- Enable model caching - first request is slow, subsequent are fast
- Batch requests when possible
- Warmup on startup (automatically done by API server)
Memory Requirements
- 0.5B model: ~2GB RAM
- 1.5B model: ~4GB RAM
- 7B model: ~16GB RAM (GPU recommended)
Fallback Behavior
If LLM initialization fails (missing dependencies, insufficient RAM, etc.), ALMA automatically falls back to rule-based processing:
- Blueprint generation uses keyword matching
- Suggestions use predefined rules
- Descriptions use templates
This ensures ALMA always works, even without LLM support.
Advanced: Custom Prompts
You can customize prompts for specific use cases:
from alma.core.prompts import InfrastructurePrompts
# Use custom prompt for blueprint generation
custom_prompt = f"""
{InfrastructurePrompts.blueprint_generation("my infrastructure")}
Additional constraints:
- Use only open-source technologies
- Optimize for cost efficiency
- Include monitoring from day 1
"""
response = await llm.generate(custom_prompt)Troubleshooting
Model Download Issues
# Pre-download model
python -c "from transformers import AutoModelForCausalLM; AutoModelForCausalLM.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')"Out of Memory
- Use smaller model:
LLM_MODEL_NAME=Qwen/Qwen2.5-0.5B-Instruct - Reduce max tokens:
LLM_MAX_TOKENS=256 - Use CPU if GPU OOM:
LLM_DEVICE=cpu
Slow Performance
- Use GPU:
LLM_DEVICE=cudaorLLM_DEVICE=mps - Reduce max tokens
- Enable model quantization (advanced)
Examples
Example 1: Microservices Platform
description = """
I need a production-ready microservices platform with:
- API Gateway (Kong or similar)
- 3 microservices (auth, users, orders)
- PostgreSQL database
- Redis cache
- RabbitMQ message queue
- Elasticsearch for logging
- Prometheus + Grafana for monitoring
"""
blueprint = await orchestrator.natural_language_to_blueprint(description)Example 2: Security Hardening
# Get your current blueprint
blueprint = {...}
# Run security audit
findings = await orchestrator.security_audit(blueprint)
# Apply suggestions
suggestions = await orchestrator.suggest_improvements(blueprint)Example 3: Cost Optimization
# Get resource sizing for your workload
sizing = await orchestrator.estimate_resources(
workload="E-commerce website with PostgreSQL",
expected_load="10,000 daily active users, peak 500 concurrent"
)
print(f"Recommended: {sizing['cpu']} CPUs, {sizing['memory']} RAM")
print(f"Reasoning: {sizing['reasoning']}")Best Practices
- Be Specific: Provide detailed requirements for better results
- Iterate: Use suggestions to improve your blueprints
- Validate: Always review AI-generated blueprints before deployment
- Use Dry-Run: Test deployments before actual execution
- Monitor: Track LLM performance and accuracy
Limitations
- LLM may occasionally generate invalid YAML (validation will catch this)
- Complex architectures may need manual refinement
- Model has knowledge cutoff (trained on data up to certain date)
- Performance varies by hardware