← Tīrtha Results Measured, not claimed

Frontier-class code, for a fraction of the cost. And we measured it.

A local model does the everyday work. Tīrtha runs every answer to make sure it's right, and escalates only the genuinely hard problems to a frontier model. Here's exactly how it was tested — numbers included.

Pre-beta — preliminary until the full run finishes.
Head to head · HumanEval+ (164 problems)

Level with the best models, and ~43% of the work served free.

How often each one gets a problem right
ModelGot it rightServed free
Tīrtha95%43%
Claude (frontier)96%n/a
Codex (frontier)95%n/a
The local model, alone82%100%
The dial

Accuracy and cost move together — and you hold the dial.

Check the work harder and it reaches frontier accuracy. Check lighter and the cost drops toward nothing. You pick the operating point per workload.

FRONTIER ACCURACY LOCAL MODEL ALONE SWEET SPOT HOW HARD IT CHECKS
how often it's right cost
Three real operating points

Same engine. You pick the balance.

Every number here is measured, not guessed.

Most accurate
95%

Match the frontier.

Escalates the hard problems to a top model. Highest accuracy — and still cheaper than using that model for everything.

Best value
91%

Nearly as good, far less spend.

Escalates to a smaller, faster frontier model instead. Gives up a few points for a large cost cut.

Lowest cost
82%

Bare-minimum spend.

The local model alone. Near-free, for when speed and cost matter most.

Speed

And it gets faster the more you use it.

Tīrtha caches the answers it has already verified, so work it has seen before comes straight back.

Instant
24×

Repeat work, right away.

Something it has already worked out and tested comes back about 24× faster than asking a frontier model again.

Faster
1.4×

Quicker on everyday code.

For normal problems, the local model beats the frontier on speed too — by about 1.4×.

Grows
More

Better with every use.

The more it runs, the more it remembers, so a growing share of answers return instantly.

One honest note: a brand-new hard problem takes a moment longer, because Tīrtha runs and checks it before trusting the answer.

Real-world test, no answer key

It holds up when the system writes its own tests.

Accuracy
82%

with the system writing and checking its own tests, and no answer sheet to peek at.

Free
62%

of the work still handled by the local model under those real-world conditions.

The gap
measured

the bit between us and a perfect score is just how good the checking is — and we keep improving it. We show it, we don't hide it.

Why you can trust these numbers

Numbers that hold up to a skeptic.

No peeking

We don't grade our own homework.

The tests used to check the work are kept separate from the tests used to score it. It can't cheat by seeing the answers ahead of time.

Check it yourself

Run the test yourself.

We'll share the exact harness so you can verify every number instead of taking our word. Coming soon.

Honest

We call out our own mistakes.

We caught one of our own numbers that was inflated, and said so out loud. Everything here survived that scrutiny.