Skip to content

Real-time Threat Classifier deployed on SageMaker via AWS CDK. Full MLOps lifecycle: secure VPC deployment, DistilBERT inference, P95 latency enforcement, and CloudWatch-driven automated rollback.

Notifications You must be signed in to change notification settings

ShieldCraft-AI/threat-classifier

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Threat Classifier - SageMaker Project

Standalone Amazon SageMaker solution for classifying security alert text by severity. Designed for easy integration into any security analytics platform.

Features

  • Train DistilBERT on synthetic alert data (implementation in progress)
  • Deploy real-time endpoint with data capture, alarms, rollback (scaffolding ready)
  • API contract: { id, text } → { label, score, model, ts }
  • Cost guardrails, security best practices
  • Majority-class heuristic baseline with uplift reporting for interview storytelling
  • Ready for CI/CD, unit tests, and future SOC platform integration

Quickstart

  1. Install Poetry and the AWS CLI v2.27.50 (see AWS CLI setup).
  2. poetry install
  3. cp .env.example .env and tailor the environment variables for your profile/region.
  4. make lint – run Ruff checks via Nox.
  5. make test – execute pytest with coverage gates.
  6. make package – build source + wheel artifacts.

CI runs automatically via GitHub Actions on pushes and pull requests. The pipeline installs Poetry-managed deps, enforces Ruff + Black via Nox, executes the pytest suite with coverage, and synthesizes the CDK app (dev environment) so we catch infrastructure regressions early. A separate Deploy workflow is manually triggered when you need to promote a build; it reuses the guarded deployment script, honors environment protection rules, and requires the THREAT_DEPLOY_CONFIRM acknowledgement before any cdk deploy executes. See docs/release-process.md for the full release, tagging, and approval model we walk through in interviews.

Training, deployment, and inference flows are being implemented in phases. Placeholder CLI hooks surface NotImplementedError until their respective tasks land.

Baseline heuristic & uplift tracking

Every training run now captures a deterministic majority-class baseline so we can articulate clear business value during interviews. Metrics JSON includes both the baseline scores and the uplift delivered by the TF-IDF + LogisticRegression pipeline. That data feeds into SageMaker Model Monitor baselines or CloudWatch dashboards, reinforcing the governance story while keeping compute costs predictable.

Two storyboard notebooks (notebooks/01-eda.ipynb and notebooks/02-offline-eval.ipynb) provide structured TODOs for running these analyses inside VPC-bound SageMaker notebooks. They emphasize data access guardrails, reproducible feature engineering, and ranking metrics like MAP@k/NDCG@k that roll into CI checks or rollback triggers.

Environment configuration

All environment-specific settings (AWS accounts, VPC subnets, SageMaker instance types, FinOps tags) live under config/*.yml.

  • Set THREAT_ENV to dev, prod, or another environment name to switch defaults.
  • Optionally override the entire configuration with THREAT_CONFIG_PATH=/abs/path/to/config.yml for ad-hoc testing.
  • Use extends: relative/or/absolute/path.yml inside a config file to layer overrides on top of shared baselines without duplicating metadata.
  • The loader validates that every deployment remains VPC-only, encrypted with customer-managed KMS keys, and tagged for cost allocation.

The CLI automatically consumes these files so training and future deployment flows always source a single, audited configuration.

See docs/ and SageMaker-Project-Plan.md for the broader architecture roadmap.

AWS CLI & IAM Requirements

  • The project standardizes on AWS CLI v2.27.50 to align with CDK tooling.
  • Use dedicated least-privilege profiles (e.g., sagemaker-dev) with MFA/SAML enforced.
  • Apply cost allocation tags (App, Env, CostCenter, Owner) to every resource; the CDK stack inherits them automatically.
  • Detailed installation and configuration instructions live in docs/aws-cli-setup.md.

Repository Layout

├── data/              # Raw, synthetic, and monitoring datasets (no PII)
├── docs/              # Project documentation and future ADRs/runbooks
├── infra/             # AWS CDK application scaffolding
├── notebooks/         # Exploratory analysis & demo notebooks
├── scripts/           # Utility scripts (checklist automation, artifact packaging)
├── src/               # Python packages for training, inference, CLI
├── tests/             # Pytest suite executed by Nox/CI
└── Makefile           # Convenience targets wrapping Poetry + Nox
This keeps deployment-ready artifacts reproducible without ad-hoc shell gymnastics-useful when interviewers probe for MLOps governance stories.

## Model Monitor baseline generation

Establish the statistics/constraints pair that SageMaker Model Monitor consumes with the new baseline helper:

```bash
poetry run python scripts/create_monitor_baseline.py --env dev --dataset data/sample.csv
  • Uploads the provided dataset (or uses the pre-staged S3 object from config/*.yml) to the environment's baseline prefix.
  • Launches a data-quality processing job using the Model Monitor container and writes statistics.json & constraints.json alongside the dataset.
  • Honors least-privilege IAM assumptions by reusing the configured monitoring role and VPC-bound resources.

Re-run the script any time you refresh the synthetic dataset or introduce new features. Interview callout: this demonstrates proactive data drift detection and shows how the platform self-polices over its lifetime.

Use make lint/make test before pushing changes. When adding new directories, drop a short README describing intent so contributors stay aligned.

Packaging trained artifacts

Bundle the latest training outputs into a SageMaker-compatible archive with the Poetry CLI harness:

poetry run package-model

Or call the lower-level script directly when you need custom arguments:

poetry run python scripts/package_model_artifacts.py --env dev
  • Defaults to the environment's configured training output directory and writes dist/<env>/model.tar.gz.
  • Pass --upload to push the tarball to the environment's model artifacts bucket; uploads use the project KMS key and inherit FinOps tags so budgets stay accurate.
  • Override paths or S3 keys (--source-dir, --output, --s3-key) when replaying historical models or promoting a hotfix.
  • The CLI keeps uploads opt-in: set THREAT_PACKAGE_UPLOAD=true when you intentionally want to push to S3. Optional overrides include THREAT_PACKAGE_SOURCE_DIR, THREAT_PACKAGE_OUTPUT_PATH, and THREAT_PACKAGE_S3_KEY.

This keeps deployment-ready artifacts reproducible without ad-hoc shell gymnastics-useful when interviewers probe for MLOps governance stories.

About

Real-time Threat Classifier deployed on SageMaker via AWS CDK. Full MLOps lifecycle: secure VPC deployment, DistilBERT inference, P95 latency enforcement, and CloudWatch-driven automated rollback.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published