Swarm & Bee | Morey | SwarmCare | Dev Tools | GitHub | Discord
Open Source · MIT License

Measure data diversity
before you ship.

Most fine-tuning teams measure loss curves and pass rates but never check if their dataset is actually diverse. A dataset with 88% pass rate but 58% structural concentration will train a model that parrots templates instead of learning the task.

terminal
$ python3 dataset_entropy.py train.jsonl
dataset_entropy.py train.jsonl
$ python3 dataset_entropy.py train.jsonl

Dataset Health Report
════════════════════════════════════════════════════════
  file:  train.jsonl
  pairs: 30,000

  Health Score
  [██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░] 42/100  WEAK

  Entropy Metrics
  ────────────────────────────────────────────────────
  Vocabulary    [████████████████████░░░░░░░░░░░░░░░░] 0.5787
  Structure     [█████████████████████████░░░░░░░░░░░] 0.7152
  Bigram        [███████████████████████░░░░░░░░░░░░░] 0.6443
  Bigram Div    [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0.0144

  Top-10 Concentration
  ────────────────────────────────────────────────────
  [█████████████████████░░░░░░░░░░░░░░░] 58.4%  (HIGH)

  Top Repeated Answer Openings
  ────────────────────────────────────────────────────
  3690 (12.3%) █████████░░░░░░░░░░░░░░░ step 1 pgi sum of all annual...
  2940 ( 9.8%) ███████░░░░░░░░░░░░░░░░░ value noi cap rate value...
  2430 ( 8.1%) ██████░░░░░░░░░░░░░░░░░░ revenue pgi vacancy loss...

════════════════════════════════════════════════════════

Five Metrics. One File. Zero Dependencies.

Every metric targets a specific failure mode that loss curves and pass rates don't catch.

Vocabulary Entropy DIVERSITY

Normalized Shannon entropy over word frequencies. Low score = same terms repeated across thousands of pairs. Your model learns vocabulary, not reasoning.

Structure Entropy CRITICAL

Each output gets a lightweight POS-like fingerprint. Low entropy = identical sentence structures with different fill-in values. Template monoculture.

Top-10 Concentration KEY METRIC

What percentage of outputs share one of the 10 most common structural patterns. Above 30% = your model will parrot templates instead of learning the task.

Bigram Diversity DIVERSITY

Unique word-pair patterns relative to total bigrams. Low ratio = repetitive phrasing. The model memorizes phrases, not concepts.

Opening Diversity DIVERSITY

How many unique first sentences appear in your outputs. If every answer starts "Step 1: Calculate..." — the model will too.

Health Score COMBINED

Single 0-100 score. Weighted: vocabulary (25%) + structure (25%) + concentration (30%, inverted) + bigram (20%). Ship above 60.

One Number. Ship or Fix.

Combined score from four weighted metrics. Tells you in one second if your dataset is ready.

80-100
Excellent
Ship it. Diverse output patterns, no structural monoculture.
60-80
Healthy
Good diversity with minor template patterns emerging.
40-60
Weak
Blend with higher-entropy data before training.
0-40
Critical
Model will overfit to templates. Do not ship.

Three Datasets. Three Roles. One Blend.

We built this tool while fine-tuning CRE models. The entropy scores revealed why blending matters more than scale.

42
Weak

Template-Generated

881K pairs. Deterministic math. Calculations are always correct, but 150 unique structures and 58% concentration means template monoculture.

Role: Precision · 881,088 pairs
56
Weak

LLM-Generated

11K pairs at temp 0.7. Better narrative structure, but system prompts create their own structural patterns. 47% concentration.

Role: Narrative · 11,452 pairs
91
Excellent

Signal-Driven

253 pairs from live market data. Every input is different because every signal is different. 4% concentration. Perfect structure entropy.

Role: Reasoning Diversity · 253 pairs

The rule: Template data must be blended with high-entropy data. Blended correctly, three datasets give you: precision + explanation + variation.

Side-by-Side in One Command

Compare training, validation, and synthetic datasets instantly. Spot which one is dragging your blend down.

--compare mode
$ python3 dataset_entropy.py signal.jsonl llm.jsonl template.jsonl --compare

Dataset Comparison
══════════════════════════════════════════════════════════════════

  Health Score
  ──────────────────────────────────────────────────────────────
  signal.jsonl     [████████████████████████████████████] 91  EXCELLENT
  llm.jsonl        [██████████████████████░░░░░░░░░░░░░░] 56  WEAK
  template.jsonl   [████████████████░░░░░░░░░░░░░░░░░░░░] 42  WEAK

  Vocabulary Entropy
  signal.jsonl     [███████████████████████████░░░░░░░░░] 0.7733
  llm.jsonl        [██████████████████████░░░░░░░░░░░░░░] 0.6436
  template.jsonl   [████████████████████░░░░░░░░░░░░░░░░] 0.5787

  Top-10 Concentration
  signal.jsonl     [█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░]  4.0% LOW
  llm.jsonl        [████████████████░░░░░░░░░░░░░░░░░░░░] 47.2% HIGH
  template.jsonl   [████████████████████░░░░░░░░░░░░░░░░] 58.4% HIGH

══════════════════════════════════════════════════════════════════

CI/CD Ready. Script Friendly.

JSON output, Python API, no dependencies. Integrate into any training pipeline.

JSON Output

Pipe into jq, store in your training logs, or feed to your dashboard.

python3 dataset_entropy.py train.jsonl --json | jq '.health_score'

Python API

Import directly into your training script. One function call returns all metrics.

from dataset_entropy import entropy_report report = entropy_report("train.jsonl") print(report["health_score"]) # 91

All Formats

OpenAI chat, Alpaca, Q/A, prompt/completion, raw text. Auto-detected.

{"messages": [{"role":"user",...}]} # OpenAI {"instruction":"...","output":"..."} # Alpaca {"question":"...","answer":"..."} # Q/A {"prompt":"...","completion":"..."} # P/C {"text":"..."} # Raw

CI/CD Gate

Fail builds when data quality drops below threshold. Catch problems before training starts.

score=$(python3 dataset_entropy.py data.jsonl \ --json | jq '.health_score') if [ "$score" -lt 60 ]; then echo "FAIL: health $score < 60" && exit 1 fi

Run It on Real Training Data

Download 500-pair samples from any vertical. Run the entropy report. See the health bars. Takes 30 seconds.

try it yourself
# 1. Download the CLI
$ curl -sO https://raw.githubusercontent.com/SudoSuOps/dataset-entropy/main/dataset_entropy.py

# 2. Grab samples — medical, aviation, pharma
$ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/medical_sample_500.jsonl
$ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/aviation_sample_500.jsonl
$ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/pharma_sample_500.jsonl

# 3. Run entropy report on any sample
$ python3 dataset_entropy.py medical_sample_500.jsonl

# 4. Compare all three verticals
$ python3 dataset_entropy.py medical_sample_500.jsonl aviation_sample_500.jsonl pharma_sample_500.jsonl --compare

# 5. JSON output for CI/CD
$ python3 dataset_entropy.py aviation_sample_500.jsonl --json | jq '.health_score'
Browse all sample datasets →

Full Datasets Available

SwarmMed 434K PAIRS

28+ medical specialties. Clinical reasoning, pharmacology, imaging, board prep. Platinum-tier verified.

SwarmAviation 45K PAIRS

43+ aviation specialties. Safety compliance, flight ops, ATC, MRO. CoVe-promoted, zero tolerance for error.

SwarmPharma 50K PAIRS

7+ pharmacology specialties. Drug interactions, dosing, PGx, trajectories. SwarmPharma-35B trained on this.

Every dataset ships with DATA_CARD.json, guarantee.json (Merkle root), 5 formats, and a train/eval split.
See pricing →

POST /api/orderpairs

Order custom training pairs via API or form. Pick your vertical, topic, and volume. We cook them on 128 GPUs.

Place an Order

curl — order via API
$ curl -X POST https://swarmandbee.ai/api/orderpairs \
  -H "Content-Type: application/json" \
  -d '{
    "vertical": "capital_markets",
    "topic":    "CMBS distress",
    "pairs":    2000,
    "email":    "[email protected]"  ← required
  }'
response — 201 Created
{
  "order_id":              "11fde5db",
  "vertical":              "capital_markets",
  "topic":                 "CMBS distress",
  "pairs":                 2000,
  "status":                "queued",
  "estimated_completion":  "10h",
  "created_at":            "2026-03-08T..."
}

Verticals: capital_markets medical pharma aviation custom
Min 100 pairs. Max 500K. 8-gate pipeline. Platinum quality.

We build for devs.

Measure your data quality. Try a sample. Order custom pairs. Train better models.