CVE Exploitability Prediction and model validation
It uses:
NVD CVE API (v2.0) for vulnerability metadata (CVSS, CWE, CPEs, descriptions, timestamps).
CISA Known Exploited Vulnerabilities (KEV) as ground-truth labels for “exploited vs not-exploited”.
The goal is to build an EPSS-like risk score that can be plugged into a vulnerability management process to prioritize which CVEs to fix first.
Features
End-to-end pipeline:
Fetch CVEs from NVD (with 120-day chunking to satisfy API constraints).
Fetch KEV catalog and label CVEs as exploited / non-exploited.
Flatten JSON into a structured DataFrame.
Time-based train/val/test split (no temporal leakage).
ML pipeline with:
Structured features (CVSS, CWE, CPE-derived).
Text features (TF-IDF over CVE description).
Logistic Regression with class weighting.
Evaluation with ROC-AUC, PR-AUC, Recall@top-k.
CLI entry points:
vuln_risk.train – training job.
vuln_risk.score – scoring recent CVEs.
Simple but enterprise-friendly layout (config, logging, tests, model persistence).
Data & Labeling Data sources
NVD CVE API v2.0: https://services.nvd.nist.gov/rest/json/cves/2.0
CISA KEV CSV (configurable URL).
Labeling
label_exploited = 1 if cve_id appears in CISA KEV.
label_exploited = 0 otherwise.
Important: KEV is incomplete; some CVEs labeled 0 may still be exploited in reality. The model is predicting “is this CVE in KEV?” as a proxy for “widely observed exploitation.”
Features Used From NVD CVE metadata:
CVSS (structured) baseScore, exploitabilityScore, impactScore, baseSeverity, attackVector, attackComplexity, privilegesRequired, userInteraction, scope, confidentialityImpact, integrityImpact, availabilityImpact.
Weakness (CWE) cwe code from NVD weaknesses.
Affected products (CPE)
num_cpes – count of affected CPEs.
vendor_top1 – first vendor seen.
product_top1 – first product seen.
Text TF-IDF over English description (desc), unigrams + bigrams.
Reference hint has_exploit_ref – whether any reference URL contains “exploit”.
Temporal feature (year) is used for splitting and as a numeric feature.
Model
Preprocessing: ColumnTransformer
Numeric: passthrough.
Categorical: OneHotEncoder(handle_unknown="ignore").
Text: TfidfVectorizer(max_features=10_000, ngram_range=(1, 2)) (configurable).
Classifier: LogisticRegression
class_weight="balanced" (to handle severe imbalance).
max_iter=1000.
random_state configurable.
Evaluation metrics:
ROC-AUC
PR-AUC (average precision)
Recall@top 1%, 5%, 10% of CVEs (for “we can patch only the top N%” scenarios).
Optional classification_report at a chosen probability threshold.
Project Structure vuln-exploit-prediction/ │ ├─ vuln_risk/ │ ├─ init.py # package metadata │ ├─ config.py # AppConfig (paths, time window, API keys, etc.) │ ├─ logging_utils.py # centralized logging setup │ ├─ data_sources.py # NVD & KEV ingestion, flattening, labeling │ ├─ preprocessing.py # filtering, temporal split, feature matrices │ ├─ model_pipeline.py # sklearn pipeline, save/load helpers │ ├─ metrics.py # ROC/PR + recall@k evaluation utils │ ├─ train.py # CLI entry point to train model │ └─ score.py # CLI entry point to score recent CVEs │ └─ tests/ # pytest skeletons (unit & smoke tests)
Requirements
Python 3.9+ (tested with 3.9)
Packages:
pandas
numpy
scikit-learn
requests
joblib
pytest (for tests)
Install dependencies: python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip pip install pandas numpy scikit-learn requests joblib pytest
Configuration (AppConfig) Configuration is handled by vuln_risk.config.AppConfig, which can be driven by env vars or constructed directly in code. Key settings:
Paths
VULN_RISK_ROOT (optional): base directory for data/models/cache. Default: "." (project root).
Time window (NVD publication dates)
VULN_PUB_START (default 2016-01-01T00:00:00.000)
VULN_PUB_END (default 2024-12-31T23:59:59.000)
Temporal split
VULN_TRAIN_YEAR_MAX (default 2019)
VULN_VAL_YEAR (default 2020)
VULN_TEST_YEAR_MIN (default 2021)
NVD API
NVD_API_KEY (optional; improves rate limits but not required).
KEV source
CISA_KEV_URL (optional; overrides default KEV CSV URL).
Model hyper-params
VULN_TFIDF_MAX_FEATURES (default 10000)
VULN_TFIDF_NGRAM_MIN (default 1)
VULN_TFIDF_NGRAM_MAX (default 2)
VULN_RANDOM_STATE (default 42)
Example env var setup (Windows cmd): set VULN_RISK_ROOT=. set VULN_PUB_START=2018-01-01T00:00:00.000 set VULN_PUB_END=2024-12-31T23:59:59.000 set NVD_API_KEY=your_nvd_api_key_here
Usage Always run from the project root (the folder containing vuln_risk/).
- Train the model python -m vuln_risk.train --root-dir .
What this does:
Builds an AppConfig(root_dir=".").
Fetches NVD CVEs across the configured date range, chunked into 120-day windows to satisfy NVD API limits.
Fetches CISA KEV and labels CVEs as exploited/non-exploited.
Filters invalid rows (missing year or cvss_baseScore).
Splits dataset into train/val/test by year:
Train: year <= train_year_max
Val: year == val_year
Test: year >= test_year_min
Trains the sklearn pipeline.
Logs metrics on val + test.
Saves model to models/vuln_risk_logreg.joblib.
Optional: get a classification report at a given threshold (e.g. 0.2): python -m vuln_risk.train --root-dir . --threshold 0.2
- Score recent CVEs Once training has produced a model file: python -m vuln_risk.score --root-dir . --days-back 30 --output-csv scores_recent_cves.csv
This will:
Load models/vuln_risk_logreg.joblib.
Fetch all CVEs from NVD where published is within the last days-back days.
Flatten and filter them.
Run the pipeline to get exploitation risk probabilities.
Write data/scores_recent_cves.csv with a score_exploit_risk column.
You can then sort by score_exploit_risk to see the highest-risk recent CVEs.
Testing Tests live under tests/ and use pytest. Run all tests: pytest -q
The tests currently cover:
AppConfig path initialisation.
data_sources.flatten_nvd_record with a fake NVD record.
build_labeled_dataset with mocked NVD/KEV loaders (no real network calls).
Preprocessing (filtering, temporal split, feature matrix shapes).
Pipeline build + fit on a tiny dummy dataset.
Metric utilities (recall_at_k, evaluate_split).
A smoke test for train.main() with everything mocked out except the top-level flow.
NVD API Notes
NVD v2.0 enforces a maximum 120 consecutive days per request for any date range parameter.
fetch_nvd_range automatically:
Splits the configured pub_start → pub_end range into 120-day windows.
Pages through each window with startIndex / resultsPerPage.
Aggregates all vulnerabilities into one DataFrame.
Caches the result to cache/nvd_cves_*.jsonl.
If NVD changes rate limits or response shape, adjust data_sources.fetch_nvd_range accordingly.
Limitations / Things to Improve
Label noise: Non-KEV CVEs are treated as “not exploited”, which is not strictly true; many exploited vulns are not in KEV.
No time-to-exploit modeling: Current model predicts eventual exploitation (KEV inclusion), not “exploited within X days”.
Simple model: Logistic Regression + TF-IDF is intentionally simple and explainable. For better performance, you could explore tree-based models (XGBoost/LightGBM) or neural text encoders.
Operational integration: The project outputs CSVs and a joblib model. Real deployment would typically wrap this in:
a scheduled training job,
a scheduled scoring job,
integration into a vuln management platform or SIEM.