Build your baseline of truth with North Star

What you can measure, you can understand.

North Star — Dataset view showing candidate screening scenarios with labels and status

Run your requirements

Turn what product writes into artifacts engineering can build and everyone can measure.

Product requirements → Golden seed

Distill your requirements into a seed, grow it into a dataset you can build toward and measure against.

Acceptance criteria → Scorers

Each criterion becomes a grader, so every result is scored against what you defined as “good”.

Vision → Benchmark

Your vision becomes the bar those grades are read against, so progress is a number, not a feeling.

Product requirements → Golden seed

From a doc everyone reads once to a charter everything is measured against.

How it used to be

The PRD was the single source of truth. Goals, users, and stories spelled out in prose. The whole team read the same doc, aligned on intent, and engineering built against it.

What breaks with AI

Prose can't check anything. The doc can't tell you whether an AI output is on target. What good looks like stays implicit, so every reviewer interprets it differently.

The North Star way

North Star distills the PRD into a golden seed: the task's input, its output, and what a good output must satisfy. Paste a PRD and the seed pre-fills itself. From there it anchors everything downstream: dataset, scorers, evals.

How product writes the seed What engineering builds on it

Acceptance criteria → Scorers

From a checklist walked at sign-off to graders that run on every change.

How it used to be

Acceptance criteria defined done for each story. A checklist agreed between product and engineering, it settled scope before building and converted straight into QA test cases.

What breaks with AI

Binary pass or fail doesn't survive AI. The same prompt gives different outputs every run, and the quality you care about can't be written as one expected answer. Checking once fails when every change shifts the results.

The North Star way

Each criterion becomes a scorer: same intent, now executable. An LLM judge grades every output in the dataset against it, and grades again on every change. A one-time gate becomes a living measure of good.

Define good without code Wire scorers into the harness

Vision → Benchmark

From a direction you point at to a yardstick you score against.

How it used to be

The vision set the direction. It rallied the team and stakeholders around where the product is headed, and it justified the roadmap and its bets.

What breaks with AI

Better and smarter have no number behind them. Each release gets judged on vibes, regressions slip through, and the goal never touches the day-to-day work.

The North Star way

The vision becomes a benchmark: a labeled dataset plus scorers that make good measurable. Every change is scored against the same yardstick. Progress is a number, and regressions show up across the whole dataset.

Prove the feature moves the metric Catch regressions before they ship

Two jobs, one source of truth

The same seed and scorers from each side of the team, and the ROI both can point to.

For product

Own the definition of good and see the scores, no code required. Prove the feature moves the metric, and prioritize by impact.

The product view →

For engineering

Build the agentic harness once and reuse it across features. Monitoring and alerts built in, and labeled data that compounds into smaller, cheaper models.

The engineering view →