Evaluation Methodology

We don't just score.
We prove the score.

Three judges. Five dimensions. Every improvement tested against ground truth before it touches production.

11,952
Deeds Scored
99%
Scale Agreement
5
Dimensions
3
Independent Scales
01 — The Thesis

An appraisal tells you the price.
An inspection tells you why.

A single score is an appraisal. Five dimension scores are an inspection report. The Glass Wall Doctrine

Most AI data vendors hand you a number. Ours comes with a property inspection. Every pair is scored independently by two judges on five dimensions — accuracy, completeness, specificity, structure, domain expertise. The deed doesn't just say "0.87 Royal Jelly." It proves why.

In commercial real estate, an appraisal tells the buyer what a property is worth. An inspection report tells them what's behind the walls — the foundation, the wiring, the roof. The buyer makes a better decision because they see the coordinates, not just the price.

Our deeds work the same way. Each one carries 10 data points — five dimension scores from each of two independent judge architectures. The buyer sees exactly where quality is strong and where it could improve.

The Five Coordinates — Corpus Baseline (n=500 Royal Jelly pairs)
structure 0.968
strongest
domain_expertise 0.940
accuracy 0.930
completeness 0.899
specificity 0.849
weakest

Specificity is the gap. Responses score high on accuracy, structure, and expertise — but tend toward generic advice when they should give concrete numbers, named entities, and actionable steps. We know exactly what to fix. That's the point of the inspection.

02 — Tested, Not Assumed

Every change is tested against ground truth

No improvement to our scoring methodology ships without a measured delta. We pull evaluation sets from our 11,000+ Royal Jelly deeds, run controlled experiments, and publish every result — including the ones that failed.

EXP-001 · Position Bias
Do scores change depending on content order?
600 pairs tested. Delta: +0.0064. Below significance threshold. Our judges are position-neutral.
View methodology ↓
NO BIAS

The MT-Bench paper (2023) found that LLM judges score differently depending on whether content appears first or second. We tested this on our tribunal by scoring 600 Royal Jelly pairs in both orders.

MetricNormal OrderSwapped OrderDelta
Mean score0.93190.9383+0.0064
Swapped higher40.8% of pairsdirection bias
Normal higher7.8% of pairsmild asymmetry
DomainDeltanFinding
grants+0.0042286negligible
medical+0.0113179mild
legal-0.000935zero bias

Gate: Bias magnitude 0.0064 is below the 0.01 significance threshold. No action needed. 100-pair pilot overstated the effect by 67% — the 500-pair confirmation corrected it. This is why we always run confirmation studies.

EXP-002 · Few-Shot Calibration
Do scoring examples improve consistency?
500 pairs tested. Agreement: -1.2%. Exemplars caused self-reference bias in same-domain pairs. Zero-shot prompt is more robust.
View methodology ↓
NEGATIVE

Auto-CoT research suggests that adding scored examples to the prompt calibrates the judge. We tested 3 tier-spanning exemplars (Royal Jelly, Honey, Propolis) from medical domain deeds.

MetricBaselineFew-ShotDelta
Mean score0.93210.9168-0.0153
Agreement rate98.0%96.8%-1.2%
Mean |JA-JB|0.06220.0629+0.0007
DomainAgreementFinding
grantsimprovedCross-domain calibration helped
legalimprovedBiggest improvement: 0.0603 → 0.0394
medicalworsenedSelf-reference distortion

Gate: Overall agreement worsened. Cross-domain benefit was offset by same-domain self-reference penalty. The zero-shot prompt at 98% agreement is robust and hard to beat. We publish this because negative results prevent repeating mistakes.

EXP-003 · Per-Dimension Scoring
Does scoring each dimension independently improve agreement?
500 pairs tested. Agreement: +1.0%. Score spread compressed 21%. First improvement found. Unlocked the five-coordinate inspection report.
View methodology ↓
VALIDATING

Instead of one holistic score, the judge evaluates each dimension independently with its own reasoning chain. The final score is the average of five dimension-level assessments.

MetricHolisticPer-DimensionDelta
Mean score0.93200.9154-0.0165
Score stdev0.04540.0357-21% tighter
Agreement rate98.0%99.0%+1.0%
Mean |JA-JB|0.06230.0583-6.4% better
Latency4.0s4.4s+10%
Tier changes1 up, 1 downnet zero

Gate: +1.0% agreement is below the +2% ship threshold. Currently validating with Scale B (qwen2.5:32b) to determine if both scales benefit symmetrically. If both improve, the combined gain may cross the gate. This experiment also produced the first-ever dimension-level breakdown of our corpus quality.

Papers: Complexity-Based Prompting (2022), CoT for Assessment (2024)
03 — The Protocol

How we test every change

No scoring change is deployed based on theory alone. Every modification follows a five-step protocol that requires measured improvement against Royal Jelly ground truth.

01
SIGNAL
Read the research paper. Extract a concrete change — not theory, but a specific modification to the scoring prompt, judge configuration, or evaluation method.
02
PLAN
Design an A/B test. Pull an evaluation set from 11,000+ scored deeds. Define the metric, the measurement, and the gate threshold.
03
TEST
Score the evaluation set with both the current and proposed approach. Offline only — never on the production tribunal. 500+ pairs minimum for confirmation.
04
GATE
+2% inter-judge agreement to ship. Below that: hold for more data or revert. The gate is non-negotiable — promising isn't good enough.
05
DOCUMENT
Every result — positive or negative — is published. Negative results prevent repeating failed approaches. The record is the product.
We tested few-shot exemplars. They made things worse. That's in the record. Negative results are published because the process IS the product. Research Ops Protocol — Swarm & Bee
04 — The Judges

Three independent architectures

Every pair is scored by two independent language models running on separate hardware. A third judge — the Inspector — audits the scoring after the fact. No model judges its own output.

Tribunal Architecture
Scale A gemma3:12b
Scale B qwen2.5:32b
Scale C (Inspector) Swarm-Inspector
Drift threshold 0.15
Validation 2-pass per judge
Finality PostgreSQL → Merkle → Hedera

Different model families. Different parameter counts. Different hardware. When gemma3 and qwen2.5 independently agree that a pair is Royal Jelly, the signal is architecture-independent — not an artifact of one model's biases.

© 2025-2026 Swarm & Bee LLC · Jupiter, FL · Validate the Validator. · @swarmandbee