Externalize the prompt, wire up observability, and stop reverse-engineering vague specs.
Most teams stall trying to begin: where do you even start with evals? Do you need a dataset first? A scoring rubric? A test suite? North Star starts wherever you do.
Describe what your AI should do and what your users want to achieve with it.
North Star will draft a prompt, a seed to formalize success criteria, build a dataset, and generate scorers tailored to your goals.
Drop it in and North Star will backfill the goals and user stories so you can review and edit them to your liking.
After that the rest of the scaffolding follows: seed, dataset, and scorers for you to use.
Drafted from your goals when you don't have one, versioned and tracked across iterations when you do.
Your goals, formalized into success criteria.
An improved dataset optimized for evaluations, partly or fully synthesized from your seed.
Tuned to that seed, not generic metrics.
Your application's current performance against the bar you set.
Once you ship, the questions change. How is it actually performing? What are users doing that you didn't anticipate? When something gets worse, will you know fast enough? North Star keeps the cycle going.
Real traffic runs through the same scorers in your eval runtime, sampled to fit your budget.
Anything that drifts surfaces as an alert.
Failing cases flow back to North Star as new entries in your dataset.
The next pre-production cycle starts with what production just taught you.
Production calls scored with the same scorers you built, sampled to fit your cost budget.
When scores drop, you find out fast.
The surprises get added to your dataset, automatically.
When something fails, North Star points at the likely layer.
Two complementary tracks
Build the evals before you ship.
Keep them honest after you do.