Two connected loops: a pre-production cycle and a production cycle.

North Star for engineering

Externalize the prompt, wire up observability, and stop reverse-engineering vague specs.

Most teams stall trying to begin: where do you even start with evals? Do you need a dataset first? A scoring rubric? A test suite? North Star starts wherever you do.

Business goals and user stories become a skill and prompt, then a seed, dataset, scorers, and benchmark, with backfill and improve loops.
No prompt yet.

Describe what your AI should do and what your users want to achieve with it.

North Star will draft a prompt, a seed to formalize success criteria, build a dataset, and generate scorers tailored to your goals.

Already have a prompt?

Drop it in and North Star will backfill the goals and user stories so you can review and edit them to your liking.

After that the rest of the scaffolding follows: seed, dataset, and scorers for you to use.

Output

Prompt

Drafted from your goals when you don't have one, versioned and tracked across iterations when you do.

Seed

Your goals, formalized into success criteria.

Dataset

An improved dataset optimized for evaluations, partly or fully synthesized from your seed.

Scorers

Tuned to that seed, not generic metrics.

Benchmark

Your application's current performance against the bar you set.

Once you ship, the questions change. How is it actually performing? What are users doing that you didn't anticipate? When something gets worse, will you know fast enough? North Star keeps the cycle going.

User input flows through prompt, production output, sampling, scorers, and alerts, with an improve loop.
Catch regressions

Real traffic runs through the same scorers in your eval runtime, sampled to fit your budget.

Anything that drifts surfaces as an alert.

Feed the next cycle

Failing cases flow back to North Star as new entries in your dataset.

The next pre-production cycle starts with what production just taught you.

Output

Samples

Production calls scored with the same scorers you built, sampled to fit your cost budget.

Alerts

When scores drop, you find out fast.

New test cases

The surprises get added to your dataset, automatically.

Diagnosis

When something fails, North Star points at the likely layer.

The evaluation loop

Two complementary tracks

PRE-PRODUCTION CYCLE

Build the evals before you ship.

PRODUCTION CYCLE

Keep them honest after you do.

Build once. Measure forever.