Skip to content

rameshchauhan01/exploit.ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

exploit.ai

CVE Exploitability Prediction and model validation

It uses:

NVD CVE API (v2.0) for vulnerability metadata (CVSS, CWE, CPEs, descriptions, timestamps).

CISA Known Exploited Vulnerabilities (KEV) as ground-truth labels for “exploited vs not-exploited”.

The goal is to build an EPSS-like risk score that can be plugged into a vulnerability management process to prioritize which CVEs to fix first.

Features

End-to-end pipeline:

Fetch CVEs from NVD (with 120-day chunking to satisfy API constraints).

Fetch KEV catalog and label CVEs as exploited / non-exploited.

Flatten JSON into a structured DataFrame.

Time-based train/val/test split (no temporal leakage).

ML pipeline with:

Structured features (CVSS, CWE, CPE-derived).

Text features (TF-IDF over CVE description).

Logistic Regression with class weighting.

Evaluation with ROC-AUC, PR-AUC, Recall@top-k.

CLI entry points:

vuln_risk.train – training job.

vuln_risk.score – scoring recent CVEs.

Simple but enterprise-friendly layout (config, logging, tests, model persistence).

Data & Labeling Data sources

NVD CVE API v2.0: https://services.nvd.nist.gov/rest/json/cves/2.0

CISA KEV CSV (configurable URL).

Labeling

label_exploited = 1 if cve_id appears in CISA KEV.

label_exploited = 0 otherwise.

Important: KEV is incomplete; some CVEs labeled 0 may still be exploited in reality. The model is predicting “is this CVE in KEV?” as a proxy for “widely observed exploitation.”

Features Used From NVD CVE metadata:

CVSS (structured) baseScore, exploitabilityScore, impactScore, baseSeverity, attackVector, attackComplexity, privilegesRequired, userInteraction, scope, confidentialityImpact, integrityImpact, availabilityImpact.

Weakness (CWE) cwe code from NVD weaknesses.

Affected products (CPE)

num_cpes – count of affected CPEs.

vendor_top1 – first vendor seen.

product_top1 – first product seen.

Text TF-IDF over English description (desc), unigrams + bigrams.

Reference hint has_exploit_ref – whether any reference URL contains “exploit”.

Temporal feature (year) is used for splitting and as a numeric feature.

Model

Preprocessing: ColumnTransformer

Numeric: passthrough.

Categorical: OneHotEncoder(handle_unknown="ignore").

Text: TfidfVectorizer(max_features=10_000, ngram_range=(1, 2)) (configurable).

Classifier: LogisticRegression

class_weight="balanced" (to handle severe imbalance).

max_iter=1000.

random_state configurable.

Evaluation metrics:

ROC-AUC

PR-AUC (average precision)

Recall@top 1%, 5%, 10% of CVEs (for “we can patch only the top N%” scenarios).

Optional classification_report at a chosen probability threshold.

Project Structure vuln-exploit-prediction/ │ ├─ vuln_risk/ │ ├─ init.py # package metadata │ ├─ config.py # AppConfig (paths, time window, API keys, etc.) │ ├─ logging_utils.py # centralized logging setup │ ├─ data_sources.py # NVD & KEV ingestion, flattening, labeling │ ├─ preprocessing.py # filtering, temporal split, feature matrices │ ├─ model_pipeline.py # sklearn pipeline, save/load helpers │ ├─ metrics.py # ROC/PR + recall@k evaluation utils │ ├─ train.py # CLI entry point to train model │ └─ score.py # CLI entry point to score recent CVEs │ └─ tests/ # pytest skeletons (unit & smoke tests)

Requirements

Python 3.9+ (tested with 3.9)

Packages:

pandas

numpy

scikit-learn

requests

joblib

pytest (for tests)

Install dependencies: python -m venv .venv source .venv/bin/activate # Windows: .venv\Scripts\activate

pip install --upgrade pip pip install pandas numpy scikit-learn requests joblib pytest

Configuration (AppConfig) Configuration is handled by vuln_risk.config.AppConfig, which can be driven by env vars or constructed directly in code. Key settings:

Paths

VULN_RISK_ROOT (optional): base directory for data/models/cache. Default: "." (project root).

Time window (NVD publication dates)

VULN_PUB_START (default 2016-01-01T00:00:00.000)

VULN_PUB_END (default 2024-12-31T23:59:59.000)

Temporal split

VULN_TRAIN_YEAR_MAX (default 2019)

VULN_VAL_YEAR (default 2020)

VULN_TEST_YEAR_MIN (default 2021)

NVD API

NVD_API_KEY (optional; improves rate limits but not required).

KEV source

CISA_KEV_URL (optional; overrides default KEV CSV URL).

Model hyper-params

VULN_TFIDF_MAX_FEATURES (default 10000)

VULN_TFIDF_NGRAM_MIN (default 1)

VULN_TFIDF_NGRAM_MAX (default 2)

VULN_RANDOM_STATE (default 42)

Example env var setup (Windows cmd): set VULN_RISK_ROOT=. set VULN_PUB_START=2018-01-01T00:00:00.000 set VULN_PUB_END=2024-12-31T23:59:59.000 set NVD_API_KEY=your_nvd_api_key_here

Usage Always run from the project root (the folder containing vuln_risk/).

  1. Train the model python -m vuln_risk.train --root-dir .

What this does:

Builds an AppConfig(root_dir=".").

Fetches NVD CVEs across the configured date range, chunked into 120-day windows to satisfy NVD API limits.

Fetches CISA KEV and labels CVEs as exploited/non-exploited.

Filters invalid rows (missing year or cvss_baseScore).

Splits dataset into train/val/test by year:

Train: year <= train_year_max

Val: year == val_year

Test: year >= test_year_min

Trains the sklearn pipeline.

Logs metrics on val + test.

Saves model to models/vuln_risk_logreg.joblib.

Optional: get a classification report at a given threshold (e.g. 0.2): python -m vuln_risk.train --root-dir . --threshold 0.2

  1. Score recent CVEs Once training has produced a model file: python -m vuln_risk.score --root-dir . --days-back 30 --output-csv scores_recent_cves.csv

This will:

Load models/vuln_risk_logreg.joblib.

Fetch all CVEs from NVD where published is within the last days-back days.

Flatten and filter them.

Run the pipeline to get exploitation risk probabilities.

Write data/scores_recent_cves.csv with a score_exploit_risk column.

You can then sort by score_exploit_risk to see the highest-risk recent CVEs.

Testing Tests live under tests/ and use pytest. Run all tests: pytest -q

The tests currently cover:

AppConfig path initialisation.

data_sources.flatten_nvd_record with a fake NVD record.

build_labeled_dataset with mocked NVD/KEV loaders (no real network calls).

Preprocessing (filtering, temporal split, feature matrix shapes).

Pipeline build + fit on a tiny dummy dataset.

Metric utilities (recall_at_k, evaluate_split).

A smoke test for train.main() with everything mocked out except the top-level flow.

NVD API Notes

NVD v2.0 enforces a maximum 120 consecutive days per request for any date range parameter.

fetch_nvd_range automatically:

Splits the configured pub_start → pub_end range into 120-day windows.

Pages through each window with startIndex / resultsPerPage.

Aggregates all vulnerabilities into one DataFrame.

Caches the result to cache/nvd_cves_*.jsonl.

If NVD changes rate limits or response shape, adjust data_sources.fetch_nvd_range accordingly.

Limitations / Things to Improve

Label noise: Non-KEV CVEs are treated as “not exploited”, which is not strictly true; many exploited vulns are not in KEV.

No time-to-exploit modeling: Current model predicts eventual exploitation (KEV inclusion), not “exploited within X days”.

Simple model: Logistic Regression + TF-IDF is intentionally simple and explainable. For better performance, you could explore tree-based models (XGBoost/LightGBM) or neural text encoders.

Operational integration: The project outputs CSVs and a joblib model. Real deployment would typically wrap this in:

a scheduled training job,

a scheduled scoring job,

integration into a vuln management platform or SIEM.

About

CVE Exploitability Prediction and model validation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages