proxmox-lxc-autoscale-ml

LXC Monitor Documentation

LXC Monitor is a Python-based service designed to monitor LXC containers on a Linux system, such as Proxmox. It periodically collects a wide range of metrics from running containers, including CPU usage, memory usage, I/O statistics, network usage, filesystem usage, and the number of running processes. These collected metrics are then exported to a JSON file, allowing for detailed analysis and monitoring.

Summary


Overview

The LXC Monitor service is a lightweight but powerful tool that provides comprehensive monitoring of LXC containers on Linux-based systems. Designed with efficiency in mind, it gathers essential metrics from your containers, enabling you to keep a close eye on their performance and resource usage. Whether you’re running a homelab, self-hosted services, or managing a larger scale environment, LXC Monitor ensures that you have the necessary insights to maintain optimal container performance.


Features

LXC Monitor offers a robust set of features that cover all critical aspects of container performance:


Setup

Setting up LXC Monitor involves ensuring that your system meets the prerequisites, configuring the service, and installing it. Below is a detailed guide to help you get started.

1. Prerequisites

Before installing LXC Monitor, make sure your system meets the following requirements:

2. Configuration

Create the main configuration file at /etc/lxc_autoscale_ml/lxc_monitor.yaml. This file will define how the monitoring service operates, including where to log information and how often to check for metrics.

Example Configuration:

logging:
  log_file: "/var/log/lxc_monitor.log"
  log_max_bytes: 5242880  # 5 MB
  log_backup_count: 7
  log_level: "INFO"

monitoring:
  export_file: "/var/log/lxc_metrics.json"
  check_interval: 60  # seconds
  enable_swap: true
  enable_network: true
  enable_filesystem: true
  parallel_processing: true
  max_workers: 8
  excluded_devices: ['loop', 'dm-']

Explanation of Configuration Options:

3. Service Configuration

To run LXC Monitor as a systemd service, you’ll need to create a service configuration file. This file tells systemd how to manage the LXC Monitor service.

Example Service Configuration:

Create the file /etc/systemd/system/lxc_monitor.service with the following content:

[Unit]
Description=LXC Monitor Service
After=network.target

[Service]
ExecStart=/usr/bin/python3 /usr/local/bin/lxc_monitor.py
WorkingDirectory=/usr/local/bin/
StandardOutput=inherit
StandardError=inherit
Restart=on-failure
User=root

# Logging configuration
Environment="PYTHONUNBUFFERED=1"
EnvironmentFile=/etc/lxc_autoscale_ml/lxc_monitor.yaml

[Install]
WantedBy=multi-user.target

Usage

LXC Monitor runs as a background service, continuously collecting metrics based on the configured interval. Below are some common commands to manage the service.

Starting the Service

The LXC Monitor service should start automatically after installation. However, you can manually control the service using systemd commands:

These commands allow you to manage the service as needed, such as restarting it after making configuration changes.


Monitoring and Logs

Logs

LXC Monitor logs its operations to a file specified in the configuration. The log file is crucial for tracking the service’s behavior and diagnosing any issues.

You can view the logs in real-time using the tail command:

tail -f /var/log/lxc_monitor.log

Metrics

The metrics collected by LXC Monitor are exported to a JSON file:

Example JSON output:

[
  {
    "container_id": "100",
    "timestamp": "2024-08-14T22:04:45Z",
    "cpu_usage": 15.6,
    "memory_usage": 512,
    "swap_usage": 0,
    "io_read_bytes": 102400,
    "io_write_bytes": 204800,
    "network_received_bytes": 12345678,
    "network_transmitted_bytes": 87654321,
    "filesystem_usage": {
      "total": 10485760,
      "used": 5242880,
      "free": 5242880
    },
    "process_count": 25
  }
]

Configuration Options

LXC Monitor’s behavior can be finely tuned through its configuration file. Below are the detailed options available:

Logging Configuration

Monitoring Configuration


Code Structure

LXC Monitor is built using a modular Python codebase. Here’s a breakdown of its key functions:


Error Handling

LXC Monitor includes robust error handling to ensure reliable operation even in the face of occasional issues. The service uses a retry mechanism for critical commands, attempting up to 3 retries before logging an error. This helps mitigate temporary issues, such as momentary network disruptions or transient system errors.

Example:

If a command to gather CPU usage fails, retry_on_failure will attempt the command again after a short delay. If all retries fail, the issue is logged, but the service continues monitoring other containers. This approach ensures that a single failure does not disrupt the entire monitoring process.


Best Practices and Tips

1. Regularly Review Logs

Monitoring logs provide valuable insights into the service’s performance and any potential issues. Regularly reviewing these logs can help you catch and resolve problems early.

2. Optimize Configuration for Your Environment

Tailor the configuration file to your specific needs. For instance, if network metrics are not essential, disabling them can reduce the overhead on your system. Similarly, adjust check_interval based on how frequently you need updated metrics.

3. Monitor Disk Space

Ensure that the system has sufficient disk space for both logs and the metrics JSON file. Consider setting up log rotation and monitoring the size of the metrics file to avoid storage issues.

4. Test Configuration Changes

Before applying significant changes to the monitoring configuration, test them in a non-production environment. This can help you understand the impact of the changes and avoid disrupting critical services.

5. Use Parallel Processing Wisely

If your system has multiple containers, enabling parallel processing can significantly speed up metrics collection. However, ensure that your system has enough CPU resources to handle the additional load from multiple workers.