Skip to content

tkarim45/active-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

active-learning

Labels cost money. Active learning asks: instead of labeling random examples, label the ones the model learns most from — and reach the same accuracy with fewer labels. This project measures exactly how many labels each query strategy needs to hit a target accuracy, against a random baseline, on a real dataset.

active-learning                 # labels-to-target table
active-learning --target-acc 0.95 --batch 10
active-learning --json

The loop

Start from a tiny labeled seed (2 examples per class), a large pool of unlabeled digits, and a held-out test set. Each round: fit a logistic-regression model, score the test set, then query a batch from the pool with the chosen strategy, "label" them, and repeat — recording the learning curve (labels used → accuracy). The query strategies:

  • random — the baseline (label a random batch).
  • least_confidence — label where the top class probability is lowest.
  • margin — label where the top-two classes are closest (smallest margin).
  • entropy — label where the probability distribution is flattest.

Measured results

active-learning on scikit-learn digits (10 classes, seed 20 / pool 1237 / test 540), target 90%:

strategy labels → 90% saved vs random AUC final acc
random 130 0.885 94.6%
least_confidence 150 −15% (worse) 0.899 97.0%
margin 80 +38% 0.928 97.6%
entropy 180 −38% (worse) 0.883 96.7%

The honest finding has two halves:

  • Active learning works — margin sampling reaches 90% accuracy with 80 labels vs random's 130, a 38% labeling saving — and a higher final accuracy (97.6% vs 94.6%) at the same budget. In a label-scarce setting that's real money saved.
  • But "use active learning" is not the advice — "use the right strategy" is. On this dataset the two most popular uncertainty heuristics, least-confidence and entropy, actually underperform random (needing 15% and 38% more labels to hit the target). They fixate on inherently-ambiguous points — near-duplicate or outlier digits — that the model can never get right and that don't move the decision boundary. Margin sampling avoids that trap by targeting points right on a class boundary (top-two nearly tied), which is where a label actually reshapes the model. Pick the wrong query rule and active learning costs you labels.

Why it matters

The label budget is the binding constraint on most applied ML projects. This shows the discipline to (a) measure label efficiency as a learning curve rather than assume it, and (b) report the uncomfortable part — that a fashionable technique can backfire, and the win comes from the specific strategy, validated on real data, not the buzzword.

Install & test

pip install -e ".[dev]"
pytest -q          # 6 passed — incl. "an active strategy beats random on labels-to-target"

Stack

scikit-learn (logistic regression, digits), NumPy. Pool-based active-learning loop with least-confidence / margin / entropy query strategies, learning-curve + labels-to-target scoring.

License

MIT

About

Label-efficiency benchmark — margin sampling hits target accuracy with 38% fewer labels than random, while least-confidence/entropy underperform it. The honest 'which strategy, not whether'

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages