Measure data diversity
before you ship.
Most fine-tuning teams measure loss curves and pass rates but never check if their dataset is actually diverse. A dataset with 88% pass rate but 58% structural concentration will train a model that parrots templates instead of learning the task.
$ python3 dataset_entropy.py train.jsonl
$ python3 dataset_entropy.py train.jsonl Dataset Health Report ════════════════════════════════════════════════════════ file: train.jsonl pairs: 30,000 Health Score [██████████████████░░░░░░░░░░░░░░░░░░░░░░░░░░] 42/100 WEAK Entropy Metrics ──────────────────────────────────────────────────── Vocabulary [████████████████████░░░░░░░░░░░░░░░░] 0.5787 Structure [█████████████████████████░░░░░░░░░░░] 0.7152 Bigram [███████████████████████░░░░░░░░░░░░░] 0.6443 Bigram Div [░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 0.0144 Top-10 Concentration ──────────────────────────────────────────────────── [█████████████████████░░░░░░░░░░░░░░░] 58.4% (HIGH) Top Repeated Answer Openings ──────────────────────────────────────────────────── 3690 (12.3%) █████████░░░░░░░░░░░░░░░ step 1 pgi sum of all annual... 2940 ( 9.8%) ███████░░░░░░░░░░░░░░░░░ value noi cap rate value... 2430 ( 8.1%) ██████░░░░░░░░░░░░░░░░░░ revenue pgi vacancy loss... ════════════════════════════════════════════════════════
Five Metrics. One File. Zero Dependencies.
Every metric targets a specific failure mode that loss curves and pass rates don't catch.
Vocabulary Entropy DIVERSITY
Normalized Shannon entropy over word frequencies. Low score = same terms repeated across thousands of pairs. Your model learns vocabulary, not reasoning.
Structure Entropy CRITICAL
Each output gets a lightweight POS-like fingerprint. Low entropy = identical sentence structures with different fill-in values. Template monoculture.
Top-10 Concentration KEY METRIC
What percentage of outputs share one of the 10 most common structural patterns. Above 30% = your model will parrot templates instead of learning the task.
Bigram Diversity DIVERSITY
Unique word-pair patterns relative to total bigrams. Low ratio = repetitive phrasing. The model memorizes phrases, not concepts.
Opening Diversity DIVERSITY
How many unique first sentences appear in your outputs. If every answer starts "Step 1: Calculate..." — the model will too.
Health Score COMBINED
Single 0-100 score. Weighted: vocabulary (25%) + structure (25%) + concentration (30%, inverted) + bigram (20%). Ship above 60.
One Number. Ship or Fix.
Combined score from four weighted metrics. Tells you in one second if your dataset is ready.
Three Datasets. Three Roles. One Blend.
We built this tool while fine-tuning CRE models. The entropy scores revealed why blending matters more than scale.
Template-Generated
881K pairs. Deterministic math. Calculations are always correct, but 150 unique structures and 58% concentration means template monoculture.
LLM-Generated
11K pairs at temp 0.7. Better narrative structure, but system prompts create their own structural patterns. 47% concentration.
Signal-Driven
253 pairs from live market data. Every input is different because every signal is different. 4% concentration. Perfect structure entropy.
The rule: Template data must be blended with high-entropy data. Blended correctly, three datasets give you: precision + explanation + variation.
Side-by-Side in One Command
Compare training, validation, and synthetic datasets instantly. Spot which one is dragging your blend down.
$ python3 dataset_entropy.py signal.jsonl llm.jsonl template.jsonl --compare Dataset Comparison ══════════════════════════════════════════════════════════════════ Health Score ────────────────────────────────────────────────────────────── signal.jsonl [████████████████████████████████████] 91 EXCELLENT llm.jsonl [██████████████████████░░░░░░░░░░░░░░] 56 WEAK template.jsonl [████████████████░░░░░░░░░░░░░░░░░░░░] 42 WEAK Vocabulary Entropy signal.jsonl [███████████████████████████░░░░░░░░░] 0.7733 llm.jsonl [██████████████████████░░░░░░░░░░░░░░] 0.6436 template.jsonl [████████████████████░░░░░░░░░░░░░░░░] 0.5787 Top-10 Concentration signal.jsonl [█░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░] 4.0% LOW llm.jsonl [████████████████░░░░░░░░░░░░░░░░░░░░] 47.2% HIGH template.jsonl [████████████████████░░░░░░░░░░░░░░░░] 58.4% HIGH ══════════════════════════════════════════════════════════════════
CI/CD Ready. Script Friendly.
JSON output, Python API, no dependencies. Integrate into any training pipeline.
JSON Output
Pipe into jq, store in your training logs, or feed to your dashboard.
python3 dataset_entropy.py train.jsonl --json | jq '.health_score'
Python API
Import directly into your training script. One function call returns all metrics.
from dataset_entropy import entropy_report
report = entropy_report("train.jsonl")
print(report["health_score"]) # 91
All Formats
OpenAI chat, Alpaca, Q/A, prompt/completion, raw text. Auto-detected.
{"messages": [{"role":"user",...}]} # OpenAI
{"instruction":"...","output":"..."} # Alpaca
{"question":"...","answer":"..."} # Q/A
{"prompt":"...","completion":"..."} # P/C
{"text":"..."} # Raw
CI/CD Gate
Fail builds when data quality drops below threshold. Catch problems before training starts.
score=$(python3 dataset_entropy.py data.jsonl \
--json | jq '.health_score')
if [ "$score" -lt 60 ]; then
echo "FAIL: health $score < 60" && exit 1
fi
More from Swarm & Bee
Open-source tools and infrastructure for vertical AI builders.
dataset-entropy
Training data diversity measurement. Health score 0-100. One file, zero deps.
GitHub →Swarm-Signal
Real-time market intelligence pipeline. 11 workers, entity scoring, velocity tracking.
GitHub →swarm-vllm
vLLM 0.17.0 deployment configs for dual-GPU fleet. Blackwell + Ampere.
GitHub →swarm-capital-markets
CRE debt maturity wall intelligence. $4.7T analysis with real-time signal layer.
GitHub →Run It on Real Training Data
Download 500-pair samples from any vertical. Run the entropy report. See the health bars. Takes 30 seconds.
# 1. Download the CLI $ curl -sO https://raw.githubusercontent.com/SudoSuOps/dataset-entropy/main/dataset_entropy.py # 2. Grab samples — medical, aviation, pharma $ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/medical_sample_500.jsonl $ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/aviation_sample_500.jsonl $ curl -sO https://raw.githubusercontent.com/SudoSuOps/swarmbeeai-factory/main/samples/pharma_sample_500.jsonl # 3. Run entropy report on any sample $ python3 dataset_entropy.py medical_sample_500.jsonl # 4. Compare all three verticals $ python3 dataset_entropy.py medical_sample_500.jsonl aviation_sample_500.jsonl pharma_sample_500.jsonl --compare # 5. JSON output for CI/CD $ python3 dataset_entropy.py aviation_sample_500.jsonl --json | jq '.health_score'
Full Datasets Available
SwarmMed 434K PAIRS
28+ medical specialties. Clinical reasoning, pharmacology, imaging, board prep. Platinum-tier verified.
SwarmAviation 45K PAIRS
43+ aviation specialties. Safety compliance, flight ops, ATC, MRO. CoVe-promoted, zero tolerance for error.
SwarmPharma 50K PAIRS
7+ pharmacology specialties. Drug interactions, dosing, PGx, trajectories. SwarmPharma-35B trained on this.
Every dataset ships with DATA_CARD.json, guarantee.json (Merkle root), 5 formats, and a train/eval split.
See pricing →
POST /api/orderpairs
Order custom training pairs via API or form. Pick your vertical, topic, and volume. We cook them on 128 GPUs.
Place an Order
$ curl -X POST https://swarmandbee.ai/api/orderpairs \ -H "Content-Type: application/json" \ -d '{ "vertical": "capital_markets", "topic": "CMBS distress", "pairs": 2000, "email": "[email protected]" ← required }'
{
"order_id": "11fde5db",
"vertical": "capital_markets",
"topic": "CMBS distress",
"pairs": 2000,
"status": "queued",
"estimated_completion": "10h",
"created_at": "2026-03-08T..."
}
Verticals: capital_markets medical pharma aviation custom
Min 100 pairs. Max 500K. 8-gate pipeline. Platinum quality.
We build for devs.
Measure your data quality. Try a sample. Order custom pairs. Train better models.