My benchmarking tool for local LLMs: quality × hardware × dollars, measured on the machine you actually own. Existing tools measure how fast hardware runs models, or how good outputs are — never both at once, and never what the difference is worth in money.
Named for the Greek titaness of practical wisdom and cunning counsel. Fittingly for a tool about model capability, she was swallowed by something bigger and kept advising from the inside.
The engine is headless by design. Every measurement is captured to JSON artifacts on disk; anything visual — pages, charts, reports — only ever reads those artifacts back. No chart is allowed to be the source of a number. It keeps results reproducible and the measurement honest.
// the pipeline
Run → score → judge → report → economics.
One command takes a model and a task suite end to end.
1 · run
Execute every task against the model through Ollama, capturing full telemetry per generation.
2 · score
Layered scoring, programmatic ground truth first: code executed against tests, exact-answer checks.
3 · judge
LLM-as-judge only for what cannot be checked mechanically, never for anything verifiable.
4 · report
Roll the artifacts up into findings: quality, speed, VRAM, coverage, all read back from JSON.
5 · economics
Price each run against configurable API pricing for a real per-task cost and a break-even against cloud.
// the task suite
Frozen, versioned, fingerprinted.
A frozen task suite (v1.0, 21 tasks) spanning reasoning, coding, summarisation, instruction-following, and multi-step agentic tool use. Freezing the suite is what makes runs comparable across models and across time; every run is stamped with a hardware fingerprint, so a number always knows which machine produced it.
Programmatic ground truth wherever possible: the moment a task can be checked by executing code or matching an exact answer, the judge is taken out of the loop. The LLM-as-judge tier exists only for genuinely open outputs, and even then against a rubric.
// what each run captures
The full telemetry.
latency
TTFT plus prefill and decode tokens/second, per generation.
memory
Peak VRAM, the number that decides whether a model even fits an 8 GB card.
power
GPU power draw and total energy per run, for the cost-per-task maths.
hardware
A per-run fingerprint: GPU, CPU, RAM, so results are tied to the rig that made them.
// updates
Project Log
Context-length scaling: the 16k decode cliff
A scaling mode pads each task to a target context window and measures decode throughput. qwen3:8b holds ~40 tok/s through 8k, then drops to 9.8 tok/s at 16k with zero errors and unchanged quality — the KV cache silently spilling 8 GB of VRAM into shared system memory.
Routing benchmark feeds the fleet
The per-category routing claim graduated from a Metis simulation into the AI Command Center's live --auto router: clearly-safe text work goes to local qwen3:8b, coding to Codex, agentic and low-confidence tasks stay on Claude.
Headless engine working end to end
Run, score, judge, report, economics — against three local models and a Claude reference, N=5. qwen3:8b reached 87% of Claude's mean quality and matched it through depth-5 tool use. All artifacts published under results/published/.