Evaluation Methodology — Swarm & Bee | Proof of Location for AI Training Data

01 — The Thesis

An appraisal tells you the price.
An inspection tells you why.

A single score is an appraisal. Five dimension scores are an inspection report. The Glass Wall Doctrine

Most AI data vendors hand you a number. Ours comes with a property inspection. Every pair is scored independently by two judges on five dimensions — accuracy, completeness, specificity, structure, domain expertise. The deed doesn't just say "0.87 Royal Jelly." It proves why.

In commercial real estate, an appraisal tells the buyer what a property is worth. An inspection report tells them what's behind the walls — the foundation, the wiring, the roof. The buyer makes a better decision because they see the coordinates, not just the price.

Our deeds work the same way. Each one carries 10 data points — five dimension scores from each of two independent judge architectures. The buyer sees exactly where quality is strong and where it could improve.

The Five Coordinates — Corpus Baseline (n=500 Royal Jelly pairs)

structure 0.968

strongest

domain_expertise 0.940

accuracy 0.930

completeness 0.899

specificity 0.849

weakest

Specificity is the gap. Responses score high on accuracy, structure, and expertise — but tend toward generic advice when they should give concrete numbers, named entities, and actionable steps. We know exactly what to fix. That's the point of the inspection.

02 — Tested, Not Assumed

Every change is tested against ground truth

No improvement to our scoring methodology ships without a measured delta. We pull evaluation sets from our 11,000+ Royal Jelly deeds, run controlled experiments, and publish every result — including the ones that failed.

EXP-001 · Position Bias

Do scores change depending on content order?

600 pairs tested. Delta: +0.0064. Below significance threshold. Our judges are position-neutral.

View methodology ↓

NO BIAS

The MT-Bench paper (2023) found that LLM judges score differently depending on whether content appears first or second. We tested this on our tribunal by scoring 600 Royal Jelly pairs in both orders.

Metric	Normal Order	Swapped Order	Delta
Mean score	0.9319	0.9383	+0.0064
Swapped higher	40.8% of pairs		direction bias
Normal higher	7.8% of pairs		mild asymmetry

Domain	Delta	n	Finding
grants	+0.0042	286	negligible
medical	+0.0113	179	mild
legal	-0.0009	35	zero bias

Gate: Bias magnitude 0.0064 is below the 0.01 significance threshold. No action needed. 100-pair pilot overstated the effect by 67% — the 500-pair confirmation corrected it. This is why we always run confirmation studies.

Paper: Judging LLM-as-a-Judge with MT-Bench (2023)

EXP-002 · Few-Shot Calibration

Do scoring examples improve consistency?

500 pairs tested. Agreement: -1.2%. Exemplars caused self-reference bias in same-domain pairs. Zero-shot prompt is more robust.

View methodology ↓

NEGATIVE

Auto-CoT research suggests that adding scored examples to the prompt calibrates the judge. We tested 3 tier-spanning exemplars (Royal Jelly, Honey, Propolis) from medical domain deeds.

Metric	Baseline	Few-Shot	Delta
Mean score	0.9321	0.9168	-0.0153
Agreement rate	98.0%	96.8%	-1.2%
Mean \|JA-JB\|	0.0622	0.0629	+0.0007

Domain	Agreement	Finding
grants	improved	Cross-domain calibration helped
legal	improved	Biggest improvement: 0.0603 → 0.0394
medical	worsened	Self-reference distortion

Gate: Overall agreement worsened. Cross-domain benefit was offset by same-domain self-reference penalty. The zero-shot prompt at 98% agreement is robust and hard to beat. We publish this because negative results prevent repeating mistakes.

Papers: Auto-CoT (2022), Automate-CoT (2023)

EXP-003 · Per-Dimension Scoring

Does scoring each dimension independently improve agreement?

500 pairs tested. Agreement: +1.0%. Score spread compressed 21%. First improvement found. Unlocked the five-coordinate inspection report.

View methodology ↓

VALIDATING

Instead of one holistic score, the judge evaluates each dimension independently with its own reasoning chain. The final score is the average of five dimension-level assessments.

Metric	Holistic	Per-Dimension	Delta
Mean score	0.9320	0.9154	-0.0165
Score stdev	0.0454	0.0357	-21% tighter
Agreement rate	98.0%	99.0%	+1.0%
Mean \|JA-JB\|	0.0623	0.0583	-6.4% better
Latency	4.0s	4.4s	+10%
Tier changes	1 up, 1 down		net zero

Gate: +1.0% agreement is below the +2% ship threshold. Currently validating with Scale B (qwen2.5:32b) to determine if both scales benefit symmetrically. If both improve, the combined gain may cross the gate. This experiment also produced the first-ever dimension-level breakdown of our corpus quality.

Papers: Complexity-Based Prompting (2022), CoT for Assessment (2024)

03 — The Protocol

How we test every change

No scoring change is deployed based on theory alone. Every modification follows a five-step protocol that requires measured improvement against Royal Jelly ground truth.

01

SIGNAL

Read the research paper. Extract a concrete change — not theory, but a specific modification to the scoring prompt, judge configuration, or evaluation method.

02

PLAN

Design an A/B test. Pull an evaluation set from 11,000+ scored deeds. Define the metric, the measurement, and the gate threshold.

03

TEST

Score the evaluation set with both the current and proposed approach. Offline only — never on the production tribunal. 500+ pairs minimum for confirmation.

04

GATE

+2% inter-judge agreement to ship. Below that: hold for more data or revert. The gate is non-negotiable — promising isn't good enough.

05

DOCUMENT

Every result — positive or negative — is published. Negative results prevent repeating failed approaches. The record is the product.

We tested few-shot exemplars. They made things worse. That's in the record. Negative results are published because the process IS the product. Research Ops Protocol — Swarm & Bee

04 — The Judges

Three independent architectures

Every pair is scored by two independent language models running on separate hardware. A third judge — the Inspector — audits the scoring after the fact. No model judges its own output.

Tribunal Architecture

Scale A gemma3:12b

Scale B qwen2.5:32b

Scale C (Inspector) Swarm-Inspector

Drift threshold 0.15

Validation 2-pass per judge

Finality PostgreSQL → Merkle → Hedera

Different model families. Different parameter counts. Different hardware. When gemma3 and qwen2.5 independently agree that a pair is Royal Jelly, the signal is architecture-independent — not an artifact of one model's biases.

We don't just score.
We prove the score.

An appraisal tells you the price.
An inspection tells you why.

Every change is tested against ground truth

How we test every change

Three independent architectures

Verify it yourself.

We don't just score.We prove the score.

An appraisal tells you the price.An inspection tells you why.

Every change is tested against ground truth

How we test every change

Three independent architectures

Verify it yourself.

We don't just score.
We prove the score.

An appraisal tells you the price.
An inspection tells you why.