exploit.ai

CVE Exploitability Prediction and model validation

It uses:

NVD CVE API (v2.0) for vulnerability metadata (CVSS, CWE, CPEs, descriptions, timestamps).

CISA Known Exploited Vulnerabilities (KEV) as ground-truth labels for “exploited vs not-exploited”.

The goal is to build an EPSS-like risk score that can be plugged into a vulnerability management process to prioritize which CVEs to fix first.

Features

End-to-end pipeline:

Fetch CVEs from NVD (with 120-day chunking to satisfy API constraints).

Fetch KEV catalog and label CVEs as exploited / non-exploited.

Flatten JSON into a structured DataFrame.

Time-based train/val/test split (no temporal leakage).

ML pipeline with:

Structured features (CVSS, CWE, CPE-derived).

Text features (TF-IDF over CVE description).

Logistic Regression with class weighting.

Evaluation with ROC-AUC, PR-AUC, Recall@top-k.

CLI entry points:

vuln_risk.train – training job.

vuln_risk.score – scoring recent CVEs.

Simple but enterprise-friendly layout (config, logging, tests, model persistence).

Data & Labeling Data sources

NVD CVE API v2.0: https://services.nvd.nist.gov/rest/json/cves/2.0

CISA KEV CSV (configurable URL).

Labeling

label_exploited = 1 if cve_id appears in CISA KEV.

label_exploited = 0 otherwise.

Important: KEV is incomplete; some CVEs labeled 0 may still be exploited in reality. The model is predicting “is this CVE in KEV?” as a proxy for “widely observed exploitation.”

Features Used From NVD CVE metadata:

CVSS (structured) baseScore, exploitabilityScore, impactScore, baseSeverity, attackVector, attackComplexity, privilegesRequired, userInteraction, scope, confidentialityImpact, integrityImpact, availabilityImpact.

Weakness (CWE) cwe code from NVD weaknesses.

Affected products (CPE)

num_cpes – count of affected CPEs.

vendor_top1 – first vendor seen.

product_top1 – first product seen.

Text TF-IDF over English description (desc), unigrams + bigrams.

Reference hint has_exploit_ref – whether any reference URL contains “exploit”.

Temporal feature (year) is used for splitting and as a numeric feature.

Model

Preprocessing: ColumnTransformer

Numeric: passthrough.

Categorical: OneHotEncoder(handle_unknown="ignore").

Text: TfidfVectorizer(max_features=10_000, ngram_range=(1, 2)) (configurable).

Classifier: LogisticRegression

class_weight="balanced" (to handle severe imbalance).

max_iter=1000.

random_state configurable.

Evaluation metrics:

ROC-AUC

PR-AUC (average precision)

Recall@top 1%, 5%, 10% of CVEs (for “we can patch only the top N%” scenarios).

Optional classification_report at a chosen probability threshold.

Project Structure vuln-exploit-prediction/ │ ├─ vuln_risk/ │ ├─ init.py # package metadata │ ├─ config.py # AppConfig (paths, time window, API keys, etc.) │ ├─ logging_utils.py # centralized logging setup │ ├─ data_sources.py # NVD & KEV ingestion, flattening, labeling │ ├─ preprocessing.py # filtering, temporal split, feature matrices │ ├─ model_pipeline.py # sklearn pipeline, save/load helpers │ ├─ metrics.py # ROC/PR + recall@k evaluation utils │ ├─ train.py # CLI entry point to train model │ └─ score.py # CLI entry point to score recent CVEs │ └─ tests/ # pytest skeletons (unit & smoke tests)

Requirements

Python 3.9+ (tested with 3.9)

Packages:

pandas

numpy

scikit-learn

requests

joblib

pytest (for tests)

Install dependencies: python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate

pip install --upgrade pip pip install pandas numpy scikit-learn requests joblib pytest

Configuration (AppConfig) Configuration is handled by vuln_risk.config.AppConfig, which can be driven by env vars or constructed directly in code. Key settings:

Paths

VULN_RISK_ROOT (optional): base directory for data/models/cache. Default: "." (project root).

Time window (NVD publication dates)

VULN_PUB_START (default 2016-01-01T00:00:00.000)

VULN_PUB_END (default 2024-12-31T23:59:59.000)

Temporal split

VULN_TRAIN_YEAR_MAX (default 2019)

VULN_VAL_YEAR (default 2020)

VULN_TEST_YEAR_MIN (default 2021)

NVD API

NVD_API_KEY (optional; improves rate limits but not required).

KEV source

CISA_KEV_URL (optional; overrides default KEV CSV URL).

Model hyper-params

VULN_TFIDF_MAX_FEATURES (default 10000)

VULN_TFIDF_NGRAM_MIN (default 1)

VULN_TFIDF_NGRAM_MAX (default 2)

VULN_RANDOM_STATE (default 42)

Example env var setup (Windows cmd): set VULN_RISK_ROOT=. set VULN_PUB_START=2018-01-01T00:00:00.000 set VULN_PUB_END=2024-12-31T23:59:59.000 set NVD_API_KEY=your_nvd_api_key_here

Usage Always run from the project root (the folder containing vuln_risk/).

Train the model python -m vuln_risk.train --root-dir .

What this does:

Builds an AppConfig(root_dir=".").

Fetches NVD CVEs across the configured date range, chunked into 120-day windows to satisfy NVD API limits.

Fetches CISA KEV and labels CVEs as exploited/non-exploited.

Filters invalid rows (missing year or cvss_baseScore).

Splits dataset into train/val/test by year:

Train: year <= train_year_max

Val: year == val_year

Test: year >= test_year_min

Trains the sklearn pipeline.

Logs metrics on val + test.

Saves model to models/vuln_risk_logreg.joblib.

Optional: get a classification report at a given threshold (e.g. 0.2): python -m vuln_risk.train --root-dir . --threshold 0.2

Score recent CVEs Once training has produced a model file: python -m vuln_risk.score --root-dir . --days-back 30 --output-csv scores_recent_cves.csv

This will:

Load models/vuln_risk_logreg.joblib.

Fetch all CVEs from NVD where published is within the last days-back days.

Flatten and filter them.

Run the pipeline to get exploitation risk probabilities.

Write data/scores_recent_cves.csv with a score_exploit_risk column.

You can then sort by score_exploit_risk to see the highest-risk recent CVEs.

Testing Tests live under tests/ and use pytest. Run all tests: pytest -q

The tests currently cover:

AppConfig path initialisation.

data_sources.flatten_nvd_record with a fake NVD record.

build_labeled_dataset with mocked NVD/KEV loaders (no real network calls).

Preprocessing (filtering, temporal split, feature matrix shapes).

Pipeline build + fit on a tiny dummy dataset.

Metric utilities (recall_at_k, evaluate_split).

A smoke test for train.main() with everything mocked out except the top-level flow.

NVD API Notes

NVD v2.0 enforces a maximum 120 consecutive days per request for any date range parameter.

fetch_nvd_range automatically:

Splits the configured pub_start → pub_end range into 120-day windows.

Pages through each window with startIndex / resultsPerPage.

Aggregates all vulnerabilities into one DataFrame.

Caches the result to cache/nvd_cves_*.jsonl.

If NVD changes rate limits or response shape, adjust data_sources.fetch_nvd_range accordingly.

Limitations / Things to Improve

Label noise: Non-KEV CVEs are treated as “not exploited”, which is not strictly true; many exploited vulns are not in KEV.

No time-to-exploit modeling: Current model predicts eventual exploitation (KEV inclusion), not “exploited within X days”.

Simple model: Logistic Regression + TF-IDF is intentionally simple and explainable. For better performance, you could explore tree-based models (XGBoost/LightGBM) or neural text encoders.

Operational integration: The project outputs CSVs and a joblib model. Real deployment would typically wrap this in:

a scheduled training job,

a scheduled scoring job,

integration into a vuln management platform or SIEM.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.idea		.idea
vuln_risk		vuln_risk
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

exploit.ai

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

exploit.ai

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages