Model Evaluations

Source: src/renderer/src/components/Evaluations.tsx

Overview

Run benchmarks against the currently loaded model. Each benchmark has 30 questions; results are scored and stored in localStorage for comparison across sessions.

Available Benchmarks

Benchmark	Measures	Pool
MMLU	General knowledge (multiple choice)	30
HellaSwag	Common-sense sentence completion	30
HumanEval	Python function generation	30
TruthfulQA	Factual yes/no answers	30

Sample Count

The default is 30 questions (all). You can reduce it to 10 or 20 using the selector in the banner. Questions are sampled randomly from the full pool each run, so repeated runs with fewer samples test different subsets.

Workflow

Load a model from the Models tab.
Optionally change the sample count in the banner (default: 30).
Click "Run" next to the benchmark you want to evaluate.
Progress is shown as each question is answered.
The final score (% correct) appears in the history table.

Evaluation History

Past results are stored in localStorage under silicon-studio-eval-history and displayed in a history table with model name, date, benchmark, score, and question count. Use the "Clear" button to reset.

Limitations

Requires a model to be loaded in memory.
All benchmarks are locally implemented; they are not identical to official reference implementations.
Results are stored locally only; no server persistence.

Model Evaluations ​

Overview ​

Available Benchmarks ​

Sample Count ​

Workflow ​

Evaluation History ​

Limitations ​