06 / AI Evaluation · the tool

Metis work in progress

My benchmarking tool for local LLMs: quality × hardware × dollars, measured on the machine you actually own. Existing tools measure how fast hardware runs models, or how good outputs are — never both at once, and never what the difference is worth in money.

Named for the Greek titaness of practical wisdom and cunning counsel. Fittingly for a tool about model capability, she was swallowed by something bigger and kept advising from the inside.

Python Ollama Headless engine LLM-as-judge Break-even economics
↗ github.com/lachydotmcg/metis
// the results

What it found.

The point of the tool is the evaluation it produces. The headline, on a single RTX 3060 8 GB — the full data lives on the research page.

87%
qwen3:8b vs Claude

of Claude Sonnet 4.6's mean per-task quality, on an 8 GB card.

−85%
cost vs all-Sonnet

Route local-first, send only coding to Claude, and the suite costs about 6.6× less than all-Sonnet.

depth 5
reliable tool use

The first local tier where multi-step agentic work actually holds up.

📊 See the full evaluation data →
// the principle

Nothing visual ever measures.

The engine is headless by design. Every measurement is captured to JSON artifacts on disk; anything visual — pages, charts, reports — only ever reads those artifacts back. No chart is allowed to be the source of a number. It keeps results reproducible and the measurement honest.

// the pipeline

Run → score → judge → report → economics.

One command takes a model and a task suite end to end.

1 · run

Execute every task against the model through Ollama, capturing full telemetry per generation.

2 · score

Layered scoring, programmatic ground truth first: code executed against tests, exact-answer checks.

3 · judge

LLM-as-judge only for what cannot be checked mechanically, never for anything verifiable.

4 · report

Roll the artifacts up into findings: quality, speed, VRAM, coverage, all read back from JSON.

5 · economics

Price each run against configurable API pricing for a real per-task cost and a break-even against cloud.

// the task suite

Frozen, versioned, fingerprinted.

A frozen task suite (v1.0, 21 tasks) spanning reasoning, coding, summarisation, instruction-following, and multi-step agentic tool use. Freezing the suite is what makes runs comparable across models and across time; every run is stamped with a hardware fingerprint, so a number always knows which machine produced it.

Programmatic ground truth wherever possible: the moment a task can be checked by executing code or matching an exact answer, the judge is taken out of the loop. The LLM-as-judge tier exists only for genuinely open outputs, and even then against a rubric.

// what each run captures

The full telemetry.

latency

TTFT plus prefill and decode tokens/second, per generation.

memory

Peak VRAM, the number that decides whether a model even fits an 8 GB card.

power

GPU power draw and total energy per run, for the cost-per-task maths.

hardware

A per-run fingerprint: GPU, CPU, RAM, so results are tied to the rig that made them.

// updates

Project Log

Context-length scaling: the 16k decode cliff

A scaling mode pads each task to a target context window and measures decode throughput. qwen3:8b holds ~40 tok/s through 8k, then drops to 9.8 tok/s at 16k with zero errors and unchanged quality — the KV cache silently spilling 8 GB of VRAM into shared system memory.

Routing benchmark feeds the fleet

The per-category routing claim graduated from a Metis simulation into the AI Command Center's live --auto router: clearly-safe text work goes to local qwen3:8b, coding to Codex, agentic and low-confidence tasks stay on Claude.

Headless engine working end to end

Run, score, judge, report, economics — against three local models and a Claude reference, N=5. qwen3:8b reached 87% of Claude's mean quality and matched it through depth-5 tool use. All artifacts published under results/published/.