Staged Hybrid Ranking Engine (SHRE)

An intelligent, explainable AI recruiter

Rank the Top 100 Senior AI Engineers from a pool of 100k+ candidates — fast, accurate, and fully explainable.

Live demo: Staged Hybrid Ranking Engine (SHRE) — a Hugging Face Space by Aditya1002

This repository implements a Hybrid Architecture (Anomaly Pre-Filter → Enriched Feature Engineering → ML Ensemble → Learning-to-Rank) with a pure-Python CTAE Fallback wrapper for absolute reliability. It extends a 78-feature base model with four targeted enhancements for deeper JD understanding, richer signal integration, and more accurate, explainable shortlists — all fully open-source and zero extra cost.

Built for: Data & AI Challenge — Intelligent Candidate Discovery. A system that doesn't just filter, but intelligently ranks: deep job understanding, contextual relevance beyond keywords, full signal integration, and a lightning-fast, expertly-ranked shortlist with grounded reasoning.

Live Demo

The engine is deployed and ready to use as a Hugging Face Space:

Staged Hybrid Ranking Engine (SHRE) — a Hugging Face Space by Aditya1002

Paste a job description (optional), upload a candidates.jsonl, and get a ranked shortlist with grounded reasoning plus downloadable CSV and XLSX outputs.

The application interface: pipeline overview, the four enhancements, and the candidate workflow.

At a glance


Task	Rank & shortlist the Top 100 Senior AI Engineers from 100k+ profiles
Architecture	4-stage hybrid: anomaly filter → 93 features → ensemble → LambdaMART, with CTAE fallback
Semantic engine	Multi-vector `all-MiniLM-L6-v2` embeddings + FAISS (TF-IDF graceful fallback)
Held-out accuracy	90.7% validation / 88.0% test (93-feature LTR run)
Ranking quality	NDCG@10 0.991, NDCG@100 0.997; honest hard-slice Spearman 0.894
Inference	Loads saved models — no retraining — for fast scoring at pool scale
Deep JD understanding	Paste any JD (`--jd`); it re-targets the experience gate + semantic fit
Reliability	Automatic fallback chain: LTR → validated ensemble → pure-Python CTAE

The Challenge → How SHRE answers it

Challenge requirement	How SHRE delivers it
Deep Job Understanding — interpret complex, nuanced JDs	`JobDescription` parser turns any raw JD into 3 semantic facets + an experience band (`--jd`), re-targeting both the Stage-1 gate and the semantic-fit signal
Contextual Relevance — see beyond keywords	Multi-vector transformer embeddings (skills / trajectory / full profile) matched against JD facets via FAISS + weighted fusion
Signal Integration — profile + career + behavioral signals	93 dense features: 78 base + 5 anomaly + 5 behavioral (recruiter demand, OSS, reliability) + 5 semantic
The Output — fast, accurate, expertly-ranked shortlist	LambdaMART LTR fused with the ensemble; fast inference mode; top-100 with grounded, non-hallucinated reasoning

The Four Enhancements

#	Enhancement	What it does	Graceful degradation
1	Multi-Vector Semantic Layer	Separately embeds candidate skills, experience trajectory, and full profile with `all-MiniLM-L6-v2` + FAISS retrieval, matches each against three JD facets, and fuses them with weighted similarity.	Falls back to a scikit-learn TF-IDF encoder if transformers/FAISS are unavailable.
2	LambdaMART / XGBoost-LTR head	An `XGBRanker` (`rank:ndcg`) stacked on the ensemble's class-probability meta-feature and fused with the ensemble ordering to optimize the full ranked list.	Falls back to the validated ensemble ordering.
3	Enhanced Honeypot / Anomaly Detection	A multi-signal pre-filter catching timeline overlaps, impossible skill durations, and synthetic-profile flags; its anomaly score also feeds the model.	Continuous score is always produced; never blocks the pipeline.
4	Behavioral Scoring Module	Distills under-utilized platform activity / recruiter-demand / OSS / reliability signals into interpretable sub-scores.	Neutral defaults for missing signals.

Base model → "Opus 4.8" (this repo)

Capability	Base model	Opus 4.8 (this repo)
Features	78 base	93 (+anomaly +behavioral +semantic)
Relevance signal	Ensemble class-prob vote	Ensemble + LambdaMART LTR fusion
Job understanding	Hardcoded role	Parsed from any JD (`--jd`)
Semantic matching	—	Multi-vector transformer + FAISS
Behavioral signals	Partial (a few)	Full demand/OSS/reliability sub-scores
Anomaly handling	1.05x / 1.5x heuristics	Multi-signal scored detector
Scoring at scale	Retrain each run	Fast inference (load saved models)
Macro-F1 (5-fold)	0.731	0.794

Two engines, one guarantee

	SHRE (primary)	CTAE (fallback)
Type	ML ensemble + LTR + semantics	Pure-Python rule engine
Dependencies	xgboost, lightgbm, catboost, transformers…	None (standard library only)
When it runs	Default	If any SHRE stage / import fails
Output	Identical CSV schema	Identical CSV schema
Purpose	Maximum ranking quality	Never fail to produce a shortlist

Architecture Overview

The system processes candidate data through four stages:

Stage 1 (Anomaly Pre-Filter): AnomalyDetector drops synthetic/honeypot profiles (timeline, skill, and synthetic anomalies), then gates on a JD-driven experience band (parsed from the supplied job description, or the default 5–9 target) and a minimum of 2 skill pillars.
Stage 2 (Enriched Feature Engineering): Computes 93 dense signals = 78 base (career progression, domain specialization in RAG/LLMs/Vector DBs, company classification, platform interactions) + 5 anomaly + 5 behavioral + 5 multi-vector semantic features.
Stage 3 (Ensemble + Learning-to-Rank): A Voting Ensemble (XGBoost + LightGBM + CatBoost) — trained with leakage-safe SMOTE inside CV — produces a class-probability score that, with the enriched features, feeds a LambdaMART (XGBoost rank:ndcg) head; the two are fused into the final ranking score.
Stage 4 (Ranker & Reasoning): Sorts the pool and builds data-backed, non-hallucinated reasoning (citing semantic fit, behavioral signals, and anomaly checks) for each of the top 100. Emits the canonical submission.csv, an enriched submission_detailed.csv, and a formatted submission.xlsx.

If any library or model load fails, the pipeline automatically falls back: LTR → validated ensemble → pure-Python CTAE ranker.

By default Stage 3 runs in fast inference mode — it loads the saved ensemble + LambdaMART artifacts and scores the pool with no retraining (the path that scales to a 100k+ pool). --train forces a full retrain, and --jd re-targets the role (see How to Run).

Pipeline Flow

flowchart TD
    A["candidates.jsonl<br/>(100k+ raw profiles)"] --> B

    subgraph S1["Stage 1 - Anomaly Pre-Filter"]
        B["AnomalyDetector<br/>timeline / skill / synthetic checks"] --> C{"Synthetic?<br/>or out of JD<br/>experience band?<br/>or &lt; 2 pillars?"}
    end
    C -- "drop" --> X["filtered out"]
    C -- "keep" --> D

    subgraph S2["Stage 2 - Feature Engineering (93 signals)"]
        D["78 base features"] --> E["+5 anomaly +5 behavioral<br/>+5 multi-vector semantic"]
    end

    JD["Job Description<br/>(--jd: text or file)"] -. "facets + experience band" .-> B
    JD -. "3 JD facets" .-> E

    E --> F

    subgraph S3["Stage 3 - Ensemble + Learning-to-Rank"]
        F["Voting Ensemble<br/>XGBoost + LightGBM + CatBoost<br/>(leakage-safe SMOTE in CV)"] --> G["ensemble score<br/>(meta-feature)"]
        G --> H["LambdaMART<br/>XGBRanker rank:ndcg"]
        H --> I["Rank Fusion<br/>0.6 ensemble + 0.4 LTR"]
    end

    I --> J

    subgraph S4["Stage 4 - Ranker & Reasoning"]
        J["Sort + grounded reasoning"] --> K["submission.csv (Top 100)<br/>submission_detailed.csv<br/>submission.xlsx + rankings_full.csv"]
    end

    F -. "on failure" .-> CTAE["CTAE fallback<br/>(pure-Python rule ranker)"]
    H -. "on failure" .-> F
    CTAE --> K

    classDef io fill:#0b3d2e,stroke:#10b981,color:#d1fae5;
    classDef jd fill:#3b1f4b,stroke:#a855f7,color:#f3e8ff;
    class A,K io;
    class JD jd;

Fallback reliability chain

flowchart LR
    LTR["LambdaMART LTR head"] -->|fails| ENS["Validated Voting Ensemble"]
    ENS -->|fails| CTAE["Pure-Python CTAE ranker"]
    LTR -.->|preferred| OUT(["Top-100 shortlist"])
    ENS -.-> OUT
    CTAE -.-> OUT
    classDef ok fill:#0b3d2e,stroke:#10b981,color:#d1fae5;
    class OUT ok;

Deep Job Understanding (custom JD)

The target role is no longer hardcoded. src/shre/job_description.py parses any raw job description — text or file — into the three semantic facets the engine matches against, plus an experience band that re-targets the Stage-1 gate.

Define the role: paste a job description to re-target the experience gate and the semantic-fit signal.

flowchart LR
    R["Raw JD text"] --> P["JobDescription parser<br/>(section + regex heuristics)"]
    P --> F1["required_skills facet"]
    P --> F2["ideal_experience facet"]
    P --> F3["role_mission facet"]
    P --> B["experience band<br/>min / max / target years"]
    F1 & F2 & F3 --> SEM["Multi-Vector Semantic Layer"]
    B --> GATE["Stage-1 experience gate"]

Example: a Staff ML Engineer (8–12 yrs) JD tightens the experience gate and re-anchors semantic fit, so a different shortlist surfaces than the default Founding Senior AI Engineer (5–9 yrs) role — without retraining.

The parsed role flows all the way through to the deliverable. The JD parser also extracts a role title, which (with the experience band) is threaded into Stage 4 so the ranked submission.xlsx subtitle and run logs name the actual role being hired for — e.g. "Staff Machine Learning Engineer · 8–12 yrs target" — instead of always reading "Founding Senior AI Engineer". If no title is detectable it falls back to a neutral "Custom Role (from JD)" rather than mislabelling the canonical role.

The supervised ensemble is still trained on labels for the founding-engineer role. A custom JD re-targets the JD-relative semantic-fit signal and the hard experience gate; the learned "higher fit implies higher relevance" relationship is what transfers across roles. This trade-off is documented honestly rather than hidden.

Fast Inference vs. Retraining

The training path re-fits the full XGB+LGBM+CatBoost ensemble and the LambdaMART head — great for refreshing the model, but the opposite of lightning-fast on a 100k pool.

Mode	Trigger	What happens	Use when
Inference (default)	saved artifacts exist	loads ensemble + LTR + scaler/selector, scores the pool — no retraining	scoring a large/real pool fast
Train	`--train` flag or missing models	full leakage-safe retrain, then score + persist artifacts	features/labels changed

# Fast inference (default): scores with the saved models, no retraining
python -m src.main data/candidates.jsonl output/submission.csv

# Force a full retrain
python -m src.main data/candidates.jsonl output/submission.csv --train

If inference fails (missing/incompatible artifacts), it automatically falls back to the full training path, then to CTAE.

Installation

To set up the environment and install all dependencies:

pip install -r requirements.txt

Note on the semantic encoder. The transformer stack needs huggingface-hub < 1.0 (pinned in requirements.txt); newer hub releases break transformers/sentence-transformers and silently degrade the layer to TF-IDF. scikit-learn is pinned < 1.6 to match the bundled model artifacts and the XGBoost/LightGBM sklearn wrappers, and numpy is pinned < 2.0 because the saved pickles and scikit-learn 1.5.x were built against the numpy 1.x ABI.

How to Run

1. Primary Ranking Pipeline

Run the end-to-end pipeline to process candidates and output the final rankings:

python -m src.main data/candidates.jsonl output/submission.csv

Trying it out of the box. The full data/candidates.jsonl pool is not shipped (it's .gitignored). A ready-to-run 498-profile sample is bundled, so you can reproduce a complete run immediately:
python -m src.main data/candidates_demo.jsonl output/submission.csv

By default the pipeline runs in fast inference mode when trained artifacts exist in models/ — it loads the saved ensemble + LambdaMART head and scores the pool without any retraining, which is what makes large-pool ranking lightning-fast. To force a full retrain (e.g. after changing features or labels), add --train:

python -m src.main data/candidates.jsonl output/submission.csv --train

Deep Job Understanding (custom JD). The target role is no longer hardcoded. Pass any job description as raw text or a file with --jd; it is parsed into the three semantic facets (skills / experience / mission) and an experience band, which re-target the Stage-1 gate and the semantic-fit signal:

python -m src.main data/candidates.jsonl output/submission.csv --jd data/sample_jd.txt

If --jd is omitted, the canonical "Founding Senior AI Engineer" role (the role the 498 labels were judged for) is used, so behaviour is unchanged. When a custom JD is supplied, the parsed role title and target band also flow into the run logs and the submission.xlsx subtitle, so the deliverable names the exact role it ranked for.

2. Validation & Testing

Run the enhanced end-to-end test (modules, enrichment, LTR pipeline, CTAE fallback) and the ablation study:

python test_enhanced.py
python analysis/ablation_enhanced.py

The original base test suite is still available via python test_pipeline.py.

3. Interactive Sandbox Demo

Run the Streamlit application to upload candidate batches and interactively view profiles, scores, and rationales (with an optional JD text box):

streamlit run sandbox/app.py

Upload a candidates.jsonl batch; the pipeline filters, scores, and ranks the pool.

What you'll see — outputs

Every run writes the ranked shortlist as both an Excel workbook and CSVs to the output directory:

File	Format	Columns / Sheets	Purpose
`submission.xlsx`	XLSX	Sheets: `Top 100`, `Full Rankings`, `Summary`	The primary ranked deliverable — formatted recommended shortlist
`submission.csv`	CSV	`candidate_id, rank, score, reasoning`	Canonical Top-100 (clean 4 columns)
`submission_detailed.csv`	CSV	`…, semantic_fit, behavioral_score, anomaly_score, anomaly_flags, reasoning`	Top-100 with the enriched signals exposed
`rankings_full.csv`	CSV	`candidate_id, rank, score, reasoning`	Every viable candidate, fully ranked

The XLSX name is derived from the output path you pass — e.g. output/submission.csv produces output/submission.xlsx — and is generated on both the SHRE and CTAE paths. If openpyxl is unavailable the CSVs still write (the workbook is skipped gracefully).

Summary metrics: candidates shortlisted, top score, and average semantic / behavioral scores.

Ranked shortlist with progress-bar scores, enrichment signals, grounded reasoning, and CSV / XLSX downloads.

The ranked XLSX deliverable (`submission.xlsx`)

A professionally-formatted Excel workbook built for reviewers:

Sheet	Contents
Top 100	Recommended shortlist — `rank · candidate_id · score · semantic_fit · behavioral_score · anomaly_score · anomaly_flags · reasoning`
Full Rankings	Every viable candidate, same columns
Summary	Run statistics — top/mean score, avg semantic / behavioral / anomaly of the shortlist

Formatting: indigo title banner with role + timestamp + score definition, styled header, banded rows and borders, a red to amber to green colour scale on the score column, frozen header, an auto-filter for sort/filter, 4-decimal number formats, and a wrapped reasoning column.

Example console output (sample run, inference mode):

=== RUNNING SHRE (Enhanced ML Pipeline - 'Opus 4.8' Grade) ===
    JD[default] 'Founding Senior AI Engineer' exp 3-15y (target 5-9y); facets: skills=348 chars, mission=264 chars
Stage 1: Filtered 498 down to 293 viable candidates.
Stage 2: Extracted 93 enriched features.
Stage 3: Inference mode (scoring with saved models, no retraining).
  - Ranking head: LambdaMART fused with ensemble (inference)
Writing top 100 to output/submission.csv...
Writing ranked XLSX to output/submission.xlsx...  Done!

With a custom --jd, the first line instead names the parsed role, e.g. JD[file] 'Staff Machine Learning Engineer' exp 6-18y (target 8-12y); ….

Example shortlist rows (real output — reasoning is generated from actual profile data, never hallucinated):

rank	candidate_id	score	reasoning (truncated)
1	`CAND_0072688`	1.000	Data Scientist, 6.9 yrs at Niramai, specializing in vector search and RAG (Milvus); strong semantic alignment to the JD (esp. experience trajectory); very high recruiter responsiveness; high recruiter demand…
2	`CAND_0044890`	0.596	AI Research Engineer, 5.0 yrs at Haptik, vector search & RAG (FAISS); strong semantic JD alignment; active GitHub presence; high recruiter demand.
3	`CAND_0030061`	0.409	Data Analyst, 5.3 yrs at Ola, applied ML (Python); strong semantic JD alignment; active GitHub; reliable follow-through.

Performance Summary

Feature ablation (5-fold) — the enhancements measurably help classification

Configuration	Features	Accuracy	Macro-F1
Base (78 features)	78	0.833	0.731
+ Anomaly + Behavioral	88	0.845	0.766
+ Semantic (full)	93	0.866	0.794

Latest training run — 93 features, transformer semantics (`models/metadata_ltr.json`)

Metric	Validation	Test (held-out)
Accuracy	0.907	0.880
Macro Precision	0.883	0.823
Macro Recall	0.871	0.876
Macro F1	0.874	0.834

Ranking metric	Score
NDCG@10 (fused)	0.991
NDCG@100 (fused)	0.997
Pure LambdaMART NDCG@10	0.974
Hard-slice NDCG@10 (relevance 1 vs 2)	1.000
Spearman (full held-out fold)	0.894

Test-set confusion matrix (rows = true, cols = predicted; classes 0–3):

	Pred 0	Pred 1	Pred 2	Pred 3
True 0	38	4	0	0
True 1	0	14	1	0
True 2	0	1	8	3
True 3	0	0	0	6

Honest reporting of ranking quality

Because the 498 labels are cleanly rule-separable, full-set NDCG is near-ceiling and overstates difficulty. We therefore also report two harder, more discriminating diagnostics every training run:

Hard-slice NDCG@10 — restricted to borderline candidates (relevance 1 vs 2), removing the trivially-separable 0 and 3 classes.

Spearman rank correlation over the full held-out fold (≈ 0.89) — clearly sub-ceiling, and the most honest measure of how well the engine orders the confusable middle.

Primary Model: Voting Ensemble (XGBoost + LightGBM + CatBoost) + LambdaMART LTR head
Semantic Encoder: sentence-transformers/all-MiniLM-L6-v2 + FAISS (TF-IDF fallback)
Fallback Model: Rule-based CTAE Ranker (Pure Python, zero-dependency)

Reproduce: python test_enhanced.py (end-to-end) and python analysis/ablation_enhanced.py (ablation).

Scientific Validation Gallery

A 9-phase, leakage-free validation suite (analysis/) regenerates every figure from the phase summary artifacts in analysis_results/ — no number is typed by hand. Highlights below; full write-up in analysis_results/COMPETITION_REPORT.md.

Learning curve & dataset sufficiency

The validation accuracy plateaus at the full 498 samples — the core signal is captured, with only mild train/val overfitting mitigated by soft-voting.

Train data used	Samples	Train acc	Val acc	Val F1
20%	99	0.997	0.838	0.658
40%	199	1.000	0.809	0.714
60%	298	0.999	0.849	0.793
80%	398	0.998	0.857	0.793
100%	498	0.996	0.868	0.806

Feature importance & model comparison

Signal concentrates in profile depth (summary_length), skill depth (avg_skill_duration_months), and domain longevity (domain_x_years).

SHAP explainability

Global and local SHAP attributions explain why each candidate scores as it does — high ideal_years_score and domain_llm_score push toward ideal hire; long notice periods push down.

Ablation — models & feature groups

Combining all feature categories beats any single group; the ensemble soft-votes for lower variance.

Per-model comparison (5-fold) — XGBoost leads individually; the ensemble trades a hair of accuracy for stability:

Model	Accuracy	Precision	Recall	F1
XGBoost	0.855	0.780	0.803	0.787
LightGBM	0.839	0.760	0.763	0.756
CatBoost	0.837	0.771	0.812	0.782
Ensemble (soft-vote)	0.851	0.774	0.788	0.774

Feature-group ablation — every group adds signal; "All features" wins, and engagement-only is the weakest standalone (confirming behavioral signals help but aren't sufficient alone):

Feature group	# Features	Accuracy	F1
All features	62	0.845	0.766
technical	31	0.767	0.643
other	7	0.757	0.637
experience	12	0.753	0.662
interaction	8	0.751	0.644
engagement	6	0.562	0.465

Stability (50 runs) · Honeypot defense · Ranking quality

Stability: Acc 85.9% ± 3.0%, Macro-F1 78.9% ± 4.3% over 10x5 repeated stratified CV — ACCEPTABLE (CV 3.5%).
Honeypot detection: 71.6% overall — 100% on structural anomalies (flat / impossible-skills / random-noise), weak on keyword-stuffing (handled by the Stage-1 rule filter, by design).
Ranking on holdout: NDCG@100 0.9591, Hit@5 / Hit@10 = 100%.

Honeypot detection by attack type (250 synthetic adversarial profiles):

Attack type	Detected	Verdict
Flat_Profile	100%	Caught outright
Impossible_Skills	100%	Caught outright
Random_Noise	100%	Caught outright
Minimal_Profile	58%	Partial (near Class-0/1 boundary)
Keyword_Stuffing	0%	Missed (needs Stage-1 keyword-density cap)

Error analysis

Only 7 / 75 held-out samples misclassified (9.3% error), concentrated on the Class 2↔3 ideal-hire boundary and rich-but-shallow Class 0→1 profiles.

Repository Structure

|-- requirements.txt            # Main project dependencies (pinned for reproducibility)
|-- submission_metadata.yaml    # Submission metadata
|-- README.md                   # This file
|-- src/
|   |-- main.py                 # Pipeline entry (inference by default; --train, --jd)
|   |-- shre/
|   |   |-- job_description.py       # Deep JD understanding: parse JD -> facets + exp band
|   |   |-- stage1_filter.py        # Anomaly pre-filter + JD-driven experience/pillar gates
|   |   |-- anomaly.py              # Feature 3: Enhanced honeypot/anomaly detection
|   |   |-- behavioral.py           # Feature 4: Behavioral scoring module
|   |   |-- semantic.py             # Feature 1: Multi-vector semantic layer (+ FAISS), JD-aware
|   |   |-- stage2_features.py      # 78 base features + enrichment pass (-> 93)
|   |   |-- stage3_ranking_validated.py  # Voting ensemble (leakage-safe SMOTE)
|   |   |-- stage3_ranking_ltr.py   # Feature 2: LambdaMART/XGBoost-LTR head (+ honest metrics)
|   |   |-- inference.py            # Fast inference-only scoring (no retraining)
|   |   |-- stage4_submit.py        # Ranked top-100 + enriched reasoning + XLSX export
|   |-- ctae/                   # Fallback rule-based engine
|   |-- common/                 # Config, data loader, validator, logging
|-- analysis/                   # 9-phase scientific validation suite (+ ablation_enhanced.py)
|-- analysis_results/           # Regenerated charts + COMPETITION_REPORT.md
|-- validation/                 # Independent validation harness
|-- labeling/                   # 498 labeled examples (combined_labels.json) + guide
|-- test_enhanced.py            # Enhanced end-to-end test (18 checks)
|-- models/                     # Trained models, scalers, selectors, LTR, encoder & metadata
|-- sandbox/                    # Streamlit web UI code
|-- data/                       # Candidate schema, samples, sample JD

Tech Stack

Layer	Technology
Language	Python 3.9+
Gradient boosting	XGBoost · LightGBM · CatBoost (soft-voting ensemble)
Learning-to-Rank	XGBoost `XGBRanker` (`rank:ndcg` / LambdaMART)
Semantics	`sentence-transformers/all-MiniLM-L6-v2` + FAISS (TF-IDF fallback)
Class balance	imbalanced-learn SMOTE (inside CV folds)
Explainability	SHAP, permutation importance
App / Demo	Streamlit (deployed as a Hugging Face Space)
Output	openpyxl (formatted ranked `submission.xlsx`) + CSV
Fallback	Pure-Python CTAE rule engine (zero dependency)
AI models used	GLM 5.2 · Claude

Limitations & Future Work

We report limitations openly — each is paired with a concrete mitigation path.

#	Limitation	Why it matters	Mitigation
1	Keyword-stuffing susceptibility	The statistical model can be swayed by keyword-padded profiles that inflate apparent relevance.	Enforce a hard, rule-based keyword-density ceiling in the Stage-1 filter (before scoring).
2	Class-3 data scarcity	Only 38 labeled ideal-hire samples limit visibility into the top class, widening uncertainty at the very top of the ranking.	Run active-learning cycles to label 50+ candidates near the Class 2/3 boundary.
3	Single-role supervision	Labels were judged for the founding-engineer role. A custom `--jd` re-targets the semantic + gate signals, but the supervised relevance model itself stays role-anchored.	Collect labels per role family, or train a JD-conditioned ranker.
4	Small held-out set (75 samples)	The 90.7% point estimate carries a non-trivial confidence interval.	Trust the 50-run stability band (85.9% ± 3.0%) as the more reliable expectation; expand the held-out set as more labels arrive.

In short: the engine is statistically sound, stable, and explainable today — with leakage control and a graceful fallback — and every known gap above has a clear, low-cost path forward.

Team

🛠️ Team Vandalizers

Intelligent Candidate Discovery & Ranking

Member
👨‍💻 Aditya Pandey	Pipeline, ML & deployment
👩‍💻 Palak Rai	Team member
👨‍💻 Avik Srivastava	Team member

Project links

Resource	Where
🤗 Live demo	Staged Hybrid Ranking Engine (SHRE) — a Hugging Face Space by Aditya1002
🐙 GitHub	see `submission_metadata.yaml`
🧪 Sandbox	Streamlit Space (link in `submission_metadata.yaml`)
▶️ Reproduce	`python -m src.main data/candidates.jsonl output/submission.csv`

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
analysis		analysis
analysis_results		analysis_results
data		data
labeling		labeling
models		models
output		output
sandbox		sandbox
scratch		scratch
src		src
validation		validation
validation_results		validation_results
webapp		webapp
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
colab_reproduction.ipynb		colab_reproduction.ipynb
requirements.txt		requirements.txt
submission_metadata.yaml		submission_metadata.yaml
test_advanced_model.py		test_advanced_model.py
test_enhanced.py		test_enhanced.py
test_pipeline.py		test_pipeline.py
ui1.png		ui1.png
ui2.png		ui2.png
ui3.png		ui3.png
ui4.png		ui4.png
ui5.png		ui5.png
ui6.png		ui6.png

Folders and files

Latest commit

History

Repository files navigation

Staged Hybrid Ranking Engine (SHRE)

An intelligent, explainable AI recruiter

Live Demo

At a glance

Table of Contents

The Challenge → How SHRE answers it

The Four Enhancements

Base model → "Opus 4.8" (this repo)

Two engines, one guarantee

Architecture Overview

Pipeline Flow

Fallback reliability chain

Deep Job Understanding (custom JD)

Fast Inference vs. Retraining

Installation

How to Run

1. Primary Ranking Pipeline

2. Validation & Testing

3. Interactive Sandbox Demo

What you'll see — outputs

The ranked XLSX deliverable (submission.xlsx)

Performance Summary

Feature ablation (5-fold) — the enhancements measurably help classification

Latest training run — 93 features, transformer semantics (models/metadata_ltr.json)

Honest reporting of ranking quality

Scientific Validation Gallery

Learning curve & dataset sufficiency

Feature importance & model comparison

SHAP explainability

Ablation — models & feature groups

Stability (50 runs) · Honeypot defense · Ranking quality

Error analysis

Repository Structure

Tech Stack

Limitations & Future Work

Team

🛠️ Team Vandalizers

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

The ranked XLSX deliverable (`submission.xlsx`)

Latest training run — 93 features, transformer semantics (`models/metadata_ltr.json`)

Packages