Labels cost money. Active learning asks: instead of labeling random examples, label the ones the model learns most from — and reach the same accuracy with fewer labels. This project measures exactly how many labels each query strategy needs to hit a target accuracy, against a random baseline, on a real dataset.
active-learning # labels-to-target table
active-learning --target-acc 0.95 --batch 10
active-learning --jsonStart from a tiny labeled seed (2 examples per class), a large pool of unlabeled digits, and a held-out test set. Each round: fit a logistic-regression model, score the test set, then query a batch from the pool with the chosen strategy, "label" them, and repeat — recording the learning curve (labels used → accuracy). The query strategies:
- random — the baseline (label a random batch).
- least_confidence — label where the top class probability is lowest.
- margin — label where the top-two classes are closest (smallest margin).
- entropy — label where the probability distribution is flattest.
active-learning on scikit-learn digits (10 classes, seed 20 / pool 1237 / test 540), target 90%:
| strategy | labels → 90% | saved vs random | AUC | final acc |
|---|---|---|---|---|
| random | 130 | — | 0.885 | 94.6% |
| least_confidence | 150 | −15% (worse) | 0.899 | 97.0% |
| margin | 80 | +38% | 0.928 | 97.6% |
| entropy | 180 | −38% (worse) | 0.883 | 96.7% |
The honest finding has two halves:
- Active learning works — margin sampling reaches 90% accuracy with 80 labels vs random's 130, a 38% labeling saving — and a higher final accuracy (97.6% vs 94.6%) at the same budget. In a label-scarce setting that's real money saved.
- But "use active learning" is not the advice — "use the right strategy" is. On this dataset the two most popular uncertainty heuristics, least-confidence and entropy, actually underperform random (needing 15% and 38% more labels to hit the target). They fixate on inherently-ambiguous points — near-duplicate or outlier digits — that the model can never get right and that don't move the decision boundary. Margin sampling avoids that trap by targeting points right on a class boundary (top-two nearly tied), which is where a label actually reshapes the model. Pick the wrong query rule and active learning costs you labels.
The label budget is the binding constraint on most applied ML projects. This shows the discipline to (a) measure label efficiency as a learning curve rather than assume it, and (b) report the uncomfortable part — that a fashionable technique can backfire, and the win comes from the specific strategy, validated on real data, not the buzzword.
pip install -e ".[dev]"
pytest -q # 6 passed — incl. "an active strategy beats random on labels-to-target"scikit-learn (logistic regression, digits), NumPy. Pool-based active-learning loop with least-confidence / margin / entropy query strategies, learning-curve + labels-to-target scoring.
MIT