A machine learning pipeline that classifies news articles as Real or Fake using an ensemble of four NLP models, a FastAPI inference backend, and an interactive Streamlit dashboard — with a built-in OOD (out-of-distribution) reliability guard.
- Overview
- Key Features
- Model Performance
- Tech Stack
- How It Works
- Quick Start
- Starting Both Services
- Docker Usage
- Augmented Training
- API Reference
- Running Tests
- Project Structure
⚠️ Limitations & Known Boundaries- Troubleshooting
FakeGuard combines four machine learning classifiers with a TF-IDF text pipeline to identify misinformation in news articles. It includes two layers of reliability:
Primary layer — trained model ensemble. Four classifiers vote on every article. Predictions where all four models agree are highly reliable within the training distribution.
Secondary layer — OOD ensemble guard. When the models disagree significantly
(standard deviation of P(FAKE) across models > 0.15), the API flags the result as
UNCERTAIN_OOD rather than returning a silent, misleading confident prediction.
| Feature | Engineering decision |
|---|---|
| Four classifiers | Logistic Regression, Random Forest, Gradient Boosting, Linear SVC |
| TF-IDF (50k features, bigrams) | sublinear_tf=True, min_df=2 — removes rare and noisy tokens |
| Singleton predictor | All four models loaded once at startup — no per-request disk I/O |
| LinearSVC calibration | Sigmoid over decision_function produces calibrated probabilities |
| OOD ensemble σ guard | Flags high-disagreement predictions as unreliable before they mislead users |
| Pydantic v2 validation | Rejects empty text, sub-5-word inputs, and malformed URLs before inference |
| Publisher signature stripping | Removes Reuters/AP/AFP datelines so models learn content, not wire-service style |
| URL scraping | BeautifulSoup with <article> → <p> fallback, 10-second timeout |
| Graceful 422 / 503 | Paywalled URLs and missing models return descriptive errors, never silent failures |
| Docker-compose | Both services with persistent model volumes and a healthcheck-gated startup |
Evaluated on a stratified 20% holdout of the ISOT Fake News Dataset (44,898 articles).
| Model | Accuracy | Precision | Recall | F1 | ROC-AUC |
|---|---|---|---|---|---|
| Linear SVC | 0.9977 | 0.9977 | 0.9977 | 0.9977 | 1.0000 |
| Gradient Boosting | 0.9963 | 0.9963 | 0.9963 | 0.9963 | 0.9991 |
| Logistic Regression | 0.9915 | 0.9915 | 0.9915 | 0.9915 | 0.9996 |
| Random Forest | 0.9901 | 0.9901 | 0.9901 | 0.9901 | 0.9994 |
These figures apply only within the training distribution (Reuters-style English, 2015–2018 US political news). See Limitations for the full picture on what these numbers do and do not guarantee.
Data pipeline Pandas · NLTK · scikit-learn TfidfVectorizer
Models Logistic Regression · Random Forest · Gradient Boosting · LinearSVC
API FastAPI 0.135 · Uvicorn · Pydantic v2
UI Streamlit 1.58 · Plotly
Scraping requests · BeautifulSoup4 · lxml
Augmentation feedparser · GitHub open datasets (fake_or_real_news, LIAR)
Deploy Docker · docker-compose
Launcher start_all.py (Python stdlib, zero extra deps)
Raw article text
│
▼
TextPreprocessor lowercase → strip URLs/HTML → lemmatise → remove stopwords
│
▼
TFIDFPipeline 50,000 features · bigrams · sublinear_tf
│
▼
Four ML models Logistic Regression (default) · Random Forest
Gradient Boosting · Linear SVC
│
▼
OOD σ gate stdev(P(FAKE) across models) > 0.15 → reliability = UNCERTAIN_OOD
│
▼
PredictionResponse label · confidence · probabilities · reliability · ensemble_sigma
The vectoriser and all four models are serialised together. Replacing one without the other silently breaks predictions — both must always be retrained and deployed as a set.
# 1. Clone and enter the repository
git clone https://github.com/yourname/fake-news-detector.git
cd fake-news-detector
# 2. Create and activate a virtual environment
python -m venv .venv
# Windows
.venv\Scripts\activate
# macOS / Linux
source .venv/bin/activate
# 3. Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# 4a. Download the ISOT dataset (requires Kaggle CLI)
python scripts/download_data.py
# 4b. Or place Fake.csv and True.csv in data/raw/ manually
# https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset
# 4c. Quick smoke test — no dataset required
python scripts/run_training.py --sample
# 5. Train all four models
python scripts/run_training.py
⚠️ Important: FakeGuard runs as two processes that must be alive at the same time. The Streamlit dashboard calls the FastAPI backend — starting one without the other causes "API offline" errors. Do not press Ctrl+C on the API before opening the dashboard.
python start_all.pyStarts both services, streams their logs to one terminal, and shuts both down cleanly on a single Ctrl+C.
Terminal 1 (keep this window open):
uvicorn src.app.main:app --host 0.0.0.0 --port 8000Terminal 2 (new window):
streamlit run src/app/streamlit_app.py --server.port 8501Double-click start.bat. Opens two separate Command Prompt windows, one per service.
| Service | URL |
|---|---|
| Streamlit dashboard | http://localhost:8501 |
| FastAPI backend | http://localhost:8000 |
| Interactive API docs | http://localhost:8000/docs |
# Build and start both services
docker-compose up --build
# Train models inside the running API container
docker-compose exec api python scripts/run_training.py
# Run the test suite inside the container
docker-compose exec api pytest tests/ -v
# Stream API logs
docker-compose logs -f apiThe default ISOT-trained models are biased toward Reuters wire-service style (US politics, 2015–2018). To fix this and improve recall on international and multi-domain real news, use the augmented training pipeline:
# Auto-downloads two open datasets from GitHub (no Kaggle account needed)
python scripts/run_augmented_training.py --no-rss
# Full run — also scrapes live RSS feeds (BBC, Guardian, Al Jazeera, Times of India, etc.)
# Run on your local machine where these URLs are accessible
python scripts/run_augmented_training.py
# With WELFake for maximum diversity (72k articles — download from Kaggle first)
python scripts/run_augmented_training.py --welfake data/raw/WELFake_Dataset.csv
# Fast integration test (no ISOT required, ~6,000 articles)
python scripts/run_augmented_training.py --no-rss --max-articles 6000The augmented pipeline:
- Blends five sources: ISOT, fake_or_real_news, LIAR dataset, WELFake (optional), RSS feeds (optional)
- Strips Reuters/AP/AFP datelines so models learn semantic content instead of publisher formatting
- Holds out 15% of non-ISOT real articles as an OOD validation slice
- Prints an OOD recall table — target is ≥ 80% recall on the diverse real-news holdout
After augmented training, restart both services:
python start_all.py| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Liveness check — 503 if models are not loaded |
| POST | /predict |
Classify a single article. Select model via ?model_name= |
| POST | /predict/all-models |
Run all four models and return a comparison |
| POST | /predict/url |
Scrape a URL and classify the extracted text |
| POST | /predict/batch |
Classify up to 50 articles in one request |
| GET | /metrics |
Per-model evaluation metrics from the last training run |
| GET | /models |
List currently loaded model names |
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"title": "Fed raises rates by 25 basis points",
"text": "The Federal Reserve announced Wednesday it raised its benchmark interest rate by a quarter percentage point citing continued progress toward its 2 percent inflation target."}' \
-G --data-urlencode "model_name=logistic_regression"{
"label": "REAL",
"confidence": 0.9312,
"probabilities": { "FAKE": 0.0688, "REAL": 0.9312 },
"model_used": "logistic_regression",
"input_text_length": 187,
"warning": null,
"reliability": "HIGH",
"ensemble_sigma": 0.0341,
"ood_status": null
}When reliability is UNCERTAIN_OOD, treat the prediction as advisory only:
{
"label": "FAKE",
"confidence": 0.7782,
"reliability": "UNCERTAIN_OOD",
"ensemble_sigma": 0.1745,
"ood_status": "Uncertain — Input pattern deviates from historical baseline training structures. This article may be from a source, language style, or topic domain not well-represented in the training corpus. Treat with caution and apply independent verification."
}| Tier | σ range | Meaning |
|---|---|---|
HIGH |
σ ≤ 0.10 | All models agree closely. Reliable prediction. |
MEDIUM |
σ ≤ 0.15 | Minor disagreement. Result is likely reliable. |
LOW |
σ ≤ 0.20 | Notable disagreement. Treat with caution. |
UNCERTAIN_OOD |
σ > 0.15 + gate | Article may be outside the training distribution. |
# Full suite (no trained model files required)
pytest tests/ -v --cov=src --cov-report=term-missing
# OOD-specific tests only (includes σ math, gate logic, augmenter unit tests)
pytest tests/test_ood_handling.py -v -k "not requires_models"
# Including live model tests (requires trained .pkl files)
pytest tests/test_ood_handling.py -vfake-news-detector/
├── src/
│ ├── data/
│ │ ├── data_augmenter.py multi-source ingestion + publisher signature stripping
│ │ ├── ingestion.py ISOT loader + sample dataset generator
│ │ ├── preprocessing.py TextPreprocessor + TFIDFPipeline
│ │ └── validation.py Pydantic v2 schemas (includes OOD response fields)
│ ├── models/
│ │ ├── evaluator.py confusion matrix + ROC helpers
│ │ ├── predictor.py singleton inference engine + OOD σ gate
│ │ └── trainer.py training loop for all four models
│ ├── app/
│ │ ├── main.py FastAPI routes
│ │ └── streamlit_app.py 4-tab Streamlit dashboard
│ └── utils/
│ └── helpers.py URL scraper + rate limiter
├── scripts/
│ ├── download_data.py Kaggle CLI dataset downloader
│ ├── run_training.py standard ISOT training entry point
│ └── run_augmented_training.py diverse multi-source training with OOD evaluation
├── tests/
│ ├── conftest.py
│ ├── test_api.py
│ ├── test_ood_handling.py OOD gate + augmenter tests (15 tests)
│ ├── test_predictor.py
│ └── test_preprocessing.py
├── data/
│ ├── raw/ ← Fake.csv + True.csv (gitignored)
│ └── processed/ ← generated CSVs (gitignored)
├── models/
│ ├── saved/ ← .pkl files (gitignored)
│ └── metrics/ ← JSON metric files
├── start_all.py cross-platform launcher ← use this
├── start.bat Windows double-click launcher
├── start.sh Linux / macOS launcher
├── docker-compose.yml
├── Dockerfile
├── requirements.txt
└── README.md
This section is required reading before using FakeGuard in any real-world context.
FakeGuard was trained on the ISOT Fake News Dataset, which has a specific and narrow scope. Understanding what it contains determines what the system can and cannot reliably classify.
What the training data covers:
| Category | Details |
|---|---|
| Real news source | Exclusively Reuters wire service |
| Fake news sources | 21 known fake news websites (PoliticsUS, US News Flash, etc.) |
| Time period | 2015 – 2018 only |
| Primary topic | US domestic politics (Trump, Clinton, Congress, elections) |
| Language | English only |
| Article style | Reuters AP-style formal wire copy |
The system performs at 99%+ F1 on content that matches the training distribution:
- US political news written in formal AP / Reuters style
- Conspiracy-theory-style fake news (anonymous sources, unverifiable claims, emotional language)
- English-language articles from 2015–2018 on US government, elections, and policy
The following categories are outside the training distribution. The model may return high-confidence predictions on these inputs, but those numbers do not reflect real-world accuracy. The OOD σ guard will flag many of these — but not all.
By news source:
| Source type | Why it's problematic |
|---|---|
| BBC, Guardian, Al Jazeera | Different sentence structures, vocabulary, and style from Reuters wire copy |
| Indian outlets (Times of India, The Hindu, NDTV) | Different English register, Indian-English idioms, local context |
| Australian, Canadian, South African outlets | Regional news styles, different political vocabulary |
| Tabloids (Daily Mail, NY Post) | Sensational but often factual — different style signals than fake news sites |
| Tech blogs and trade press | Domain-specific vocabulary underrepresented in training data |
| Academic or scientific news | Formal but factual — may read differently from Reuters political copy |
| Financial and business news (non-Reuters) | Different stylistic markers not learned during training |
By topic domain:
| Topic | Why it's problematic |
|---|---|
| Sports news | Rarely appeared in training data; statistical/score reporting style is unfamiliar |
| Entertainment and celebrity news | Informal tone can pattern-match to fake news stylistics |
| Science and health news | Technical vocabulary has low TF-IDF weight in the trained vectoriser |
| International politics (non-US) | Very few training examples for EU, Asia, Middle East, African politics |
| Economics and finance | Moderate coverage only through US Treasury / Fed-related Reuters articles |
| Local and regional news | No representation in training data |
| Opinion and editorial | Subjective writing style may resemble fake news patterns |
| Satire | Satirical language is semantically similar to conspiracy content |
By time period:
| Period | Why it's problematic |
|---|---|
| 2019 and later | Vocabulary drift: COVID-19, Ukraine, AI boom, new political figures — all unseen during training |
| Before 2015 | Limited representation in training data |
By language:
| Language | Status |
|---|---|
| English | Supported (within distribution constraints above) |
| Hindi, Tamil, Telugu, Bengali | Not supported |
| Urdu, Arabic, French, Spanish | Not supported |
| Any non-English language | Not supported — results meaningless |
A prediction of 99% FAKE does not mean the article is fake. It means the article's n-gram pattern in TF-IDF space is 99% similar to articles in the fake-news cluster of the training set. An article from a credible Indian newspaper about cricket will pattern-match to fake news because neither Indian papers nor cricket were in the training data.
The OOD σ guard catches many of these cases, but it is not infallible. If all four models happen to agree on the wrong answer (because the input superficially resembles training fake-news patterns), the guard will not fire and the wrong prediction will be returned with high apparent confidence.
| FakeGuard IS | FakeGuard IS NOT |
|---|---|
| A research and educational demonstration of ML-based text classification | A production fact-checking tool suitable for journalism or legal decisions |
| Reliable for classifying Reuters-style US political content from 2015–2018 | A general-purpose detector for all news in all domains |
| Useful for comparing ML model performance on a benchmark NLP task | A replacement for human editorial judgment or professional fact-checkers |
| A starting point to build a domain-specific classifier for your use case | Authoritative on articles outside its training distribution |
The run_augmented_training.py script adds three additional data sources
(fake_or_real_news, LIAR dataset, and optional RSS feeds and WELFake) to partially
address the Reuters-only bias. After augmented training:
- OOD real-news recall improves from ~30–50% to approximately 80–89% on a diverse holdout
- The system gains exposure to Guardian, PolitiFact, CNN, and Politico styles
- International and sports coverage remains limited without RSS feed data
- The OOD σ guard remains active as a permanent safety net
Even with augmented training, this system should not be used as a definitive source of truth. It is a classification model trained on labelled data, not a reasoning system with factual knowledge.
| Symptom | Cause | Fix |
|---|---|---|
| "API offline" in sidebar | FastAPI not running | Use python start_all.py — both must run simultaneously |
| "API docs" page not found | API was stopped | Restart the API and keep its window open |
| uvicorn window appears to hang | Web servers block intentionally — this is correct | Open a separate terminal for Streamlit |
| Models not found (503) | Training not run yet | python scripts/run_training.py --sample |
| Real article classified as FAKE | Distribution shift (OOD input) | Check reliability field — if UNCERTAIN_OOD, the article is outside training scope |
| High confidence wrong prediction | OOD input that bypassed the gate | Run augmented training; treat high-confidence results on non-Reuters content with scepticism |
| Port already in use | Previous process still running | netstat -ano | findstr :8000 then kill that PID |
--sample overwrites production models |
run_training.py --sample run in wrong environment |
Set ENVIRONMENT=production to block destructive commands |