An appraisal tells you the price.
An inspection tells you why.
Most AI data vendors hand you a number. Ours comes with a property inspection. Every pair is scored independently by two judges on five dimensions — accuracy, completeness, specificity, structure, domain expertise. The deed doesn't just say "0.87 Royal Jelly." It proves why.
In commercial real estate, an appraisal tells the buyer what a property is worth. An inspection report tells them what's behind the walls — the foundation, the wiring, the roof. The buyer makes a better decision because they see the coordinates, not just the price.
Our deeds work the same way. Each one carries 10 data points — five dimension scores from each of two independent judge architectures. The buyer sees exactly where quality is strong and where it could improve.
Specificity is the gap. Responses score high on accuracy, structure, and expertise — but tend toward generic advice when they should give concrete numbers, named entities, and actionable steps. We know exactly what to fix. That's the point of the inspection.
Every change is tested against ground truth
No improvement to our scoring methodology ships without a measured delta. We pull evaluation sets from our 11,000+ Royal Jelly deeds, run controlled experiments, and publish every result — including the ones that failed.
The MT-Bench paper (2023) found that LLM judges score differently depending on whether content appears first or second. We tested this on our tribunal by scoring 600 Royal Jelly pairs in both orders.
| Metric | Normal Order | Swapped Order | Delta |
|---|---|---|---|
| Mean score | 0.9319 | 0.9383 | +0.0064 |
| Swapped higher | 40.8% of pairs | direction bias | |
| Normal higher | 7.8% of pairs | mild asymmetry | |
| Domain | Delta | n | Finding |
|---|---|---|---|
| grants | +0.0042 | 286 | negligible |
| medical | +0.0113 | 179 | mild |
| legal | -0.0009 | 35 | zero bias |
Gate: Bias magnitude 0.0064 is below the 0.01 significance threshold. No action needed. 100-pair pilot overstated the effect by 67% — the 500-pair confirmation corrected it. This is why we always run confirmation studies.
Auto-CoT research suggests that adding scored examples to the prompt calibrates the judge. We tested 3 tier-spanning exemplars (Royal Jelly, Honey, Propolis) from medical domain deeds.
| Metric | Baseline | Few-Shot | Delta |
|---|---|---|---|
| Mean score | 0.9321 | 0.9168 | -0.0153 |
| Agreement rate | 98.0% | 96.8% | -1.2% |
| Mean |JA-JB| | 0.0622 | 0.0629 | +0.0007 |
| Domain | Agreement | Finding |
|---|---|---|
| grants | improved | Cross-domain calibration helped |
| legal | improved | Biggest improvement: 0.0603 → 0.0394 |
| medical | worsened | Self-reference distortion |
Gate: Overall agreement worsened. Cross-domain benefit was offset by same-domain self-reference penalty. The zero-shot prompt at 98% agreement is robust and hard to beat. We publish this because negative results prevent repeating mistakes.
Instead of one holistic score, the judge evaluates each dimension independently with its own reasoning chain. The final score is the average of five dimension-level assessments.
| Metric | Holistic | Per-Dimension | Delta |
|---|---|---|---|
| Mean score | 0.9320 | 0.9154 | -0.0165 |
| Score stdev | 0.0454 | 0.0357 | -21% tighter |
| Agreement rate | 98.0% | 99.0% | +1.0% |
| Mean |JA-JB| | 0.0623 | 0.0583 | -6.4% better |
| Latency | 4.0s | 4.4s | +10% |
| Tier changes | 1 up, 1 down | net zero | |
Gate: +1.0% agreement is below the +2% ship threshold. Currently validating with Scale B (qwen2.5:32b) to determine if both scales benefit symmetrically. If both improve, the combined gain may cross the gate. This experiment also produced the first-ever dimension-level breakdown of our corpus quality.
How we test every change
No scoring change is deployed based on theory alone. Every modification follows a five-step protocol that requires measured improvement against Royal Jelly ground truth.
Three independent architectures
Every pair is scored by two independent language models running on separate hardware. A third judge — the Inspector — audits the scoring after the fact. No model judges its own output.
Different model families. Different parameter counts. Different hardware. When gemma3 and qwen2.5 independently agree that a pair is Royal Jelly, the signal is architecture-independent — not an artifact of one model's biases.