Engine / Models API
Prefix: /api/engine
Source: backend/app/api/engine.py
Models
List Models
GET /api/engine/modelsReturns all registered models with download status, paths, and metadata.
Download Model
POST /api/engine/models/download{ "model_id": "mlx-community/Qwen3-1.7B-MLX-8bit" }Downloads model files from Hugging Face. Runs in background.
Delete Model
POST /api/engine/models/delete{ "model_id": "model-uuid" }Removes model files from disk.
Register Custom Model
POST /api/engine/models/register{
"name": "My Model",
"path": "/absolute/path/to/model",
"url": "https://huggingface.co/..."
}Scan Directory
POST /api/engine/models/scan{ "path": "/path/to/models/directory" }Auto-discovers and registers all valid MLX models in the directory.
Get Active Model
GET /api/engine/models/activeReturns the currently loaded model with id, name, size, path, architecture, context_window, and is_vision. Returns { "model": null } if no model is loaded.
Load Model
POST /api/engine/models/load{ "model_id": "model-uuid", "kv_quantization": 4 }Loads model into MLX memory. Returns context_window and architecture if available. Only one model can be loaded at a time.
kv_quantization: optional, 4 or 8. Enables KV-cache quantization during generation. Omit for no quantization.
Unload Model
POST /api/engine/models/unloadFrees model from memory.
List Adapters
GET /api/engine/models/adaptersReturns models where is_finetuned is true.
Export Model
POST /api/engine/models/export{
"model_id": "adapter-uuid",
"output_path": "/path/to/output",
"q_bits": 4
}q_bits: valid values are 0, 2, 3, 4, 6, 8. 0 = full precision.
Get Model Format
GET /api/engine/models/{model_id}/formatReturns model_type, has_chat_template, eos_token, bos_token, pad_token, and other tokenizer metadata. Useful for determining the chat template format before training.
Chat
Generate (SSE)
POST /api/engine/chat{
"model_id": "model-uuid",
"messages": [
{ "role": "user", "content": "Hello" }
],
"temperature": 0.7,
"max_tokens": 512,
"top_p": 0.9,
"repetition_penalty": 1.1,
"seed": null
}Returns a streaming SSE response. Each event is a JSON object with a text field. The engine uses mlx_lm.stream_generate with a persistent KV cache (make_prompt_cache / trim_prompt_cache) across turns. Common token prefixes are reused automatically.
messages[].content can be a string or an array of content parts (for vision models):
[
{ "type": "text", "text": "Describe this image" },
{ "type": "image_url", "image_url": { "url": "data:image/png;base64,..." } }
]Images must be under 20 MB.
Stop Generation
POST /api/engine/chat/stopSignals the active generation to stop. Returns { "status": "stopped" }.
Predict Completion
POST /api/engine/predict{
"model_id": "model-uuid",
"prompt": "def hello(",
"max_tokens": 50
}Non-streaming single-pass completion for inline code suggestions (ghost text). Returns { "text": "..." }.
Fine-Tuning
Start Fine-Tuning
POST /api/engine/finetune{
"model_id": "model-uuid",
"dataset_path": "/path/to/data.jsonl",
"job_name": "my-finetune",
"epochs": 3,
"learning_rate": 1e-4,
"batch_size": 1,
"lora_rank": 8,
"lora_alpha": 16.0,
"lora_dropout": 0.0,
"lora_layers": 8,
"max_seq_length": 512,
"seed": null
}Field ranges: epochs 1–100, batch_size 1–64, lora_rank 1–256, lora_layers 1–128, max_seq_length 64–32768.
Returns { "job_id": "...", "status": "running" }.
Get Job Status
GET /api/engine/jobs/{job_id}Returns current training state: step, epoch, loss, metrics, completion status.