Troubleshooting

This guide helps diagnose and resolve common issues with LXC AutoScale ML.

Quick Diagnostics

Check All Services

bash

systemctl status lxc_autoscale_api
systemctl status lxc_monitor
systemctl status lxc_autoscale_ml

View Recent Logs

bash

journalctl -u lxc_autoscale_ml -n 50 --no-pager
journalctl -u lxc_autoscale_api -n 50 --no-pager
journalctl -u lxc_monitor -n 50 --no-pager

Test API Connectivity

bash

curl http://localhost:5000/health/check

Common Issues

Container Not Scaling Down

Symptom: Containers remain at high resource allocation despite low usage.

Possible causes:

Current resources not being fetched

Check if the API endpoint works:

bash

curl -H "X-API-Key: YOUR_KEY" \
  "http://localhost:5000/resource/vm/config?vm_id=104"

If the API returns an error, restart the service:

bash

systemctl restart lxc_autoscale_api

Thresholds too restrictive
Check current thresholds:
bash
```
grep threshold /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml
```
Default values:
- cpu_scale_down_threshold: 30 (CPU must be below 30%)
- ram_scale_down_threshold: 40 (RAM must be below 40%)
Adjust if needed and restart:
bash
```
systemctl restart lxc_autoscale_ml
```

Already at minimum resources

Check container limits:

bash

pct config 104 | grep -E "cores|memory"
grep -E "min_cpu|min_ram" /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml

Container Jumps to Maximum Resources

Symptom: Resources jump directly to max values instead of incremental scaling.

Diagnosis:

bash

grep -E "cpu_scale_step|ram_scale_step" /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml

Fix: Ensure step sizes are configured:

yaml

scaling:
  cpu_scale_step: 1        # Add/remove 1 core at a time
  ram_scale_step_mb: 512   # Add/remove 512 MB at a time

No Scaling Happening

Symptom: Logs show "No scaling needed" even when resources are at extremes.

Check metrics are being collected

bash

ls -lh /var/log/lxc_metrics.json
cat /var/log/lxc_metrics.json | jq '.[0]' | head -20

If file is missing or old, restart the monitor:

bash

systemctl restart lxc_monitor

Container in ignore list

bash

grep -A5 "ignore_lxc" /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml

Insufficient historical data
Check data points:
bash
```
cat /var/log/lxc_metrics.json | jq 'length'
```
The model needs at least 5-10 data points. Wait for more collection cycles.

API Connection Errors

Symptom: Logs show "Error fetching config" or "API request failed"

Diagnosis:

bash

# Check if API is listening
ss -tlnp | grep :5000

# Test connectivity
curl -v http://localhost:5000/health/check

Fixes:

bash

# Restart API service
systemctl restart lxc_autoscale_api

# Check API logs
journalctl -u lxc_autoscale_api -n 100 --no-pager

# Verify API URL in config
grep api_url /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml

Configuration Validation Errors

Symptom: Service fails to start with config error messages.

Common errors and fixes:

Error	Fix
`min_cpu_cores cannot be greater than max_cpu_cores`	Ensure `min_cpu_cores < max_cpu_cores`
`cpu_scale_down_threshold must be less than cpu_scale_up_threshold`	Adjust threshold values
Threshold not between 0-100	Use percentage values (0-100)

Validate configuration syntax:

bash

python3 -c "import yaml; yaml.safe_load(open('/etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml'))"

Model Training Failures

Symptom: Logs show "Model training failed" or "Error during model training"

Diagnosis:

bash

cat /var/log/lxc_metrics.json | jq '.[0] | to_entries | length'

Fixes:

bash

# Validate JSON
jq empty /var/log/lxc_metrics.json

# If corrupted, reset metrics
mv /var/log/lxc_metrics.json /var/log/lxc_metrics.json.bak
systemctl restart lxc_monitor

# Check Python dependencies
pip3 list | grep -E "sklearn|pandas|numpy"

Rate Limiting (429 Errors)

Symptom: API returns "Rate limit exceeded" errors.

For external clients:

The default limit is 120 requests per minute. Check remaining quota:

bash

curl -I -H "X-API-Key: YOUR_KEY" http://proxmox:5000/routes

Look for headers:

X-RateLimit-Limit: 120
X-RateLimit-Remaining: 45
X-RateLimit-Reset: 1703436789

For internal services:

Localhost requests bypass rate limiting. Ensure requests come from 127.0.0.1:

bash

curl http://127.0.0.1:5000/health/check

Authentication Failures (401 Errors)

Symptom: API returns "Missing or invalid API key"

Check API key configuration:

bash

grep -A2 "authentication" /etc/lxc_autoscale_ml/lxc_autoscale_api.yaml

Verify header format:

bash

# Correct
curl -H "X-API-Key: your-key" http://localhost:5000/routes

# Alternative (query parameter)
curl "http://localhost:5000/routes?api_key=your-key"

Lock File Issues

Symptom: Service reports "Lock file exists" on startup.

Version 2.0 includes automatic stale lock cleanup. If the issue persists:

bash

# Check if process is running
cat /var/lock/lxc_autoscale_ml.lock
ps -p $(cat /var/lock/lxc_autoscale_ml.lock)

# If process not running, service auto-cleans
systemctl restart lxc_autoscale_ml

# Manual cleanup (if needed)
rm /var/lock/lxc_autoscale_ml.lock
systemctl start lxc_autoscale_ml

Circuit Breaker Open

Symptom: Logs show "Circuit breaker opened" or containers are skipped.

Cause: Too many consecutive API failures for a specific endpoint.

Check status:

bash

grep "circuit breaker" /var/log/lxc_autoscale_ml.log

Resolution:

Verify API service is running:
bash
```
systemctl status lxc_autoscale_api
```
Circuit breaker resets automatically after timeout (default: 5 minutes)

Adjust settings if needed:

yaml

circuit_breaker:
  failure_threshold: 5   # Failures before opening
  timeout_seconds: 300   # Reset timeout

Debug Mode

Enable detailed logging:

yaml

# /etc/lxc_autoscale_ml/lxc_autoscale_ml.yaml
logging:
  log_level: "DEBUG"

bash

systemctl restart lxc_autoscale_ml
journalctl -u lxc_autoscale_ml -f

Debug output includes:

Feature lists used for training
Detailed threshold comparisons
Anomaly detection details
Step-by-step scaling calculations

Health Check Script

bash

#!/bin/bash
echo "=== LXC AutoScale Health Check ==="

echo -n "API Service: "
systemctl is-active lxc_autoscale_api && echo "OK" || echo "FAILED"

echo -n "ML Service: "
systemctl is-active lxc_autoscale_ml && echo "OK" || echo "FAILED"

echo -n "Monitor Service: "
systemctl is-active lxc_monitor && echo "OK" || echo "FAILED"

echo -n "API Health: "
curl -s http://localhost:5000/health/check | jq -r '.status' || echo "FAILED"

echo -n "Metrics File: "
if [ -f /var/log/lxc_metrics.json ]; then
  echo "OK ($(cat /var/log/lxc_metrics.json | jq 'length') entries)"
else
  echo "MISSING"
fi

echo ""
echo "Recent Scaling:"
grep "Successfully scaled" /var/log/lxc_autoscale_ml.log 2>/dev/null | tail -3

Collecting Diagnostic Information

When reporting issues, collect this information:

bash

#!/bin/bash
mkdir -p /tmp/diagnostics

# Service status
systemctl status lxc_autoscale_* > /tmp/diagnostics/service-status.txt 2>&1

# Logs
cp /var/log/lxc_autoscale_ml.log /tmp/diagnostics/ 2>/dev/null
journalctl -u lxc_autoscale_ml -n 200 > /tmp/diagnostics/ml-journal.txt 2>&1
journalctl -u lxc_autoscale_api -n 200 > /tmp/diagnostics/api-journal.txt 2>&1

# Configuration (sanitized - remove API key)
grep -v api_key /etc/lxc_autoscale_ml/*.yaml > /tmp/diagnostics/config.txt 2>/dev/null

# System info
pveversion > /tmp/diagnostics/pve-version.txt 2>&1
python3 --version > /tmp/diagnostics/python-version.txt 2>&1

# Create archive
tar czf /tmp/lxc-diagnostics-$(date +%Y%m%d).tar.gz -C /tmp diagnostics/
echo "Diagnostics: /tmp/lxc-diagnostics-$(date +%Y%m%d).tar.gz"

Getting Help

GitHub Issues: Report a bug
Documentation: Review the Configuration Reference

Troubleshooting ​

Quick Diagnostics ​

Check All Services ​

View Recent Logs ​

Test API Connectivity ​

Common Issues ​

Container Not Scaling Down ​

Container Jumps to Maximum Resources ​

No Scaling Happening ​

API Connection Errors ​

Configuration Validation Errors ​

Model Training Failures ​

Rate Limiting (429 Errors) ​

Authentication Failures (401 Errors) ​

Lock File Issues ​

Circuit Breaker Open ​

Debug Mode ​

Health Check Script ​

Collecting Diagnostic Information ​

Getting Help ​

Troubleshooting

Quick Diagnostics

Check All Services

View Recent Logs

Test API Connectivity

Common Issues

Container Not Scaling Down

Container Jumps to Maximum Resources

No Scaling Happening

API Connection Errors

Configuration Validation Errors

Model Training Failures

Rate Limiting (429 Errors)

Authentication Failures (401 Errors)

Lock File Issues

Circuit Breaker Open

Debug Mode

Health Check Script

Collecting Diagnostic Information

Getting Help