LLM Response Caching
Multi-tier caching to avoid redundant LLM API calls
Trinity implements a 3-tier cache for LLM responses to avoid repeating identical API calls across builds.
Overview
LLM API calls are slow (2-5 seconds) and potentially costly for cloud providers. When the same prompt is used in multiple builds, caching avoids redundant calls.
Cache tiers:
- Memory cache: <1ms access (in-process LRU, cleared on restart)
- Redis cache: 5-10ms access (distributed, optional - requires Redis)
- Filesystem cache: 20-50ms access (persistent,
.cache/directory)
Architecture
CacheManager (Unified API)
│
┌─────────┼─────────┐
▼ ▼ ▼
Memory Redis File
Tier (opt.) .cache/
LRU persistLookup order:
- Check memory cache (fastest)
- If miss, check Redis cache (if configured)
- If miss, check filesystem cache
- If miss, call LLM API
- Store result in all enabled cache tiers
Cache Tiers
Tier 1: Memory Cache
python
from cachetools import LRUCache
class MemoryCache:
def __init__(self, max_size: int = 100):
self.cache = LRUCache(maxsize=max_size)
def get(self, key: str) -> Optional[Any]:
return self.cache.get(key)
def set(self, key: str, value: Any) -> None:
self.cache[key] = value- Speed: <1ms
- Capacity: 100 entries (configurable)
- Persistence: cleared on restart
- Scope: single process
Tier 2: Redis Cache (optional)
python
import redis
import json
class RedisCache:
def __init__(self, host="localhost", port=6379, db=0, ttl=3600):
self.redis = redis.Redis(host=host, port=port, db=db, decode_responses=True)
self.ttl = ttl
def get(self, key: str) -> Optional[Any]:
value = self.redis.get(key)
return json.loads(value) if value else None
def set(self, key: str, value: Any) -> None:
self.redis.setex(key, self.ttl, json.dumps(value))- Speed: 5-10ms
- Persistence: configurable
- Scope: shared across processes/servers
- Requires: running Redis server
Tier 3: Filesystem Cache
python
import json
from pathlib import Path
import hashlib
class FilesystemCache:
def __init__(self, cache_dir: Path = Path(".cache")):
self.cache_dir = cache_dir
self.cache_dir.mkdir(exist_ok=True)
def _get_path(self, key: str) -> Path:
key_hash = hashlib.sha256(key.encode()).hexdigest()
return self.cache_dir / f"{key_hash}.json"
def get(self, key: str) -> Optional[Any]:
path = self._get_path(key)
if path.exists():
with open(path) as f:
return json.load(f)
return None
def set(self, key: str, value: Any) -> None:
path = self._get_path(key)
with open(path, 'w') as f:
json.dump(value, f)- Speed: 20-50ms
- Persistence: survives restarts
- Scope: single machine
Cache Key Generation
Cache keys are generated from:
- Prompt content (hashed)
- Model name
- Provider
- Generation parameters (temperature, top_p)
python
import hashlib
import json
def generate_cache_key(prompt, model, provider, temperature=0.7, top_p=0.9):
key_data = {
"prompt": prompt,
"model": model,
"provider": provider,
"temperature": temperature,
"top_p": top_p
}
key_string = json.dumps(key_data, sort_keys=True)
key_hash = hashlib.sha256(key_string.encode()).hexdigest()
return f"llm:{key_hash[:16]}"Implementation
CacheManager
python
class CacheManager:
def __init__(self, enabled_tiers=None, ttl=3600):
self.enabled_tiers = enabled_tiers or [CacheTier.MEMORY, CacheTier.FILESYSTEM]
self.memory = MemoryCache() if CacheTier.MEMORY in self.enabled_tiers else None
self.redis = RedisCache(ttl=ttl) if CacheTier.REDIS in self.enabled_tiers else None
self.filesystem = FilesystemCache() if CacheTier.FILESYSTEM in self.enabled_tiers else None
def get(self, key: str) -> Optional[Any]:
if self.memory:
value = self.memory.get(key)
if value is not None:
return value
if self.redis:
value = self.redis.get(key)
if value is not None:
if self.memory:
self.memory.set(key, value)
return value
if self.filesystem:
value = self.filesystem.get(key)
if value is not None:
if self.memory:
self.memory.set(key, value)
if self.redis:
self.redis.set(key, value)
return value
return None
def set(self, key: str, value: Any) -> None:
if self.memory:
self.memory.set(key, value)
if self.redis:
self.redis.set(key, value)
if self.filesystem:
self.filesystem.set(key, value)Integration with LLM Client
Caching is applied automatically in the async LLM client:
python
async def generate_content_async(self, prompt, model="llama3.2:3b", use_cache=True):
cache_key = generate_cache_key(prompt, model, self.provider)
if use_cache:
cached = self.cache_manager.get(cache_key)
if cached is not None:
return cached
response = await self._call_llm(prompt, model)
if use_cache:
self.cache_manager.set(cache_key, response)
return responseConfiguration
yaml
# config/settings.yaml
cache:
enabled: true
tiers:
- memory # Always recommended
- redis # Optional (requires Redis server)
- filesystem # Always recommended
memory:
max_size: 100
redis:
host: localhost
port: 6379
db: 0
password: null
ttl: 3600
filesystem:
directory: .cache
max_size_mb: 100Environment Variables
bash
export CACHE_ENABLED=true
export CACHE_TTL=3600
export CACHE_REDIS_HOST=localhost
export CACHE_REDIS_PORT=6379
export CACHE_REDIS_DB=0Cache Management
bash
# Clear all cache tiers
make cache-clear
# Or manually
rm -rf .cache/
redis-cli FLUSHDB # if using RedisRedis Setup (Optional)
bash
# macOS
brew install redis && brew services start redis
# Ubuntu/Debian
sudo apt-get install redis-server && sudo systemctl start redis
# Docker
docker run -d -p 6379:6379 redis:7-alpine
# Verify
redis-cli ping # Returns "PONG"Security Considerations
- Do not cache API keys or credentials
- Use Redis AUTH if exposing Redis externally
- Validate cached responses if security is critical
Troubleshooting
Cache not working:
python
from trinity.utils.cache_manager import CacheManager
cache = CacheManager()
print(cache.enabled_tiers)Redis connection issues:
bash
redis-cli ping # Should return "PONG"High memory usage:
yaml
cache:
memory:
max_size: 50 # Reduce from default 100Next Steps
- Async & MLOps - Async LLM client
- Setup Guide - Installation
- Self-Healing - Layout validation