Threat Classifier - SageMaker Project

Standalone Amazon SageMaker solution for classifying security alert text by severity. Designed for easy integration into any security analytics platform.

Features

Train DistilBERT on synthetic alert data (implementation in progress)
Deploy real-time endpoint with data capture, alarms, rollback (scaffolding ready)
API contract: { id, text } → { label, score, model, ts }
Cost guardrails, security best practices
Majority-class heuristic baseline with uplift reporting for interview storytelling
Ready for CI/CD, unit tests, and future SOC platform integration

Quickstart

Install Poetry and the AWS CLI v2.27.50 (see AWS CLI setup).
poetry install
cp .env.example .env and tailor the environment variables for your profile/region.
make lint – run Ruff checks via Nox.
make test – execute pytest with coverage gates.
make package – build source + wheel artifacts.

CI runs automatically via GitHub Actions on pushes and pull requests. The pipeline installs Poetry-managed deps, enforces Ruff + Black via Nox, executes the pytest suite with coverage, and synthesizes the CDK app (dev environment) so we catch infrastructure regressions early. A separate Deploy workflow is manually triggered when you need to promote a build; it reuses the guarded deployment script, honors environment protection rules, and requires the THREAT_DEPLOY_CONFIRM acknowledgement before any cdk deploy executes. See docs/release-process.md for the full release, tagging, and approval model we walk through in interviews.

Training, deployment, and inference flows are being implemented in phases. Placeholder CLI hooks surface NotImplementedError until their respective tasks land.

Baseline heuristic & uplift tracking

Every training run now captures a deterministic majority-class baseline so we can articulate clear business value during interviews. Metrics JSON includes both the baseline scores and the uplift delivered by the TF-IDF + LogisticRegression pipeline. That data feeds into SageMaker Model Monitor baselines or CloudWatch dashboards, reinforcing the governance story while keeping compute costs predictable.

Two storyboard notebooks (notebooks/01-eda.ipynb and notebooks/02-offline-eval.ipynb) provide structured TODOs for running these analyses inside VPC-bound SageMaker notebooks. They emphasize data access guardrails, reproducible feature engineering, and ranking metrics like MAP@k/NDCG@k that roll into CI checks or rollback triggers.

Environment configuration

All environment-specific settings (AWS accounts, VPC subnets, SageMaker instance types, FinOps tags) live under config/*.yml.

Set THREAT_ENV to dev, prod, or another environment name to switch defaults.
Optionally override the entire configuration with THREAT_CONFIG_PATH=/abs/path/to/config.yml for ad-hoc testing.
Use extends: relative/or/absolute/path.yml inside a config file to layer overrides on top of shared baselines without duplicating metadata.
The loader validates that every deployment remains VPC-only, encrypted with customer-managed KMS keys, and tagged for cost allocation.

The CLI automatically consumes these files so training and future deployment flows always source a single, audited configuration.

See docs/ and SageMaker-Project-Plan.md for the broader architecture roadmap.

AWS CLI & IAM Requirements

The project standardizes on AWS CLI v2.27.50 to align with CDK tooling.
Use dedicated least-privilege profiles (e.g., sagemaker-dev) with MFA/SAML enforced.
Apply cost allocation tags (App, Env, CostCenter, Owner) to every resource; the CDK stack inherits them automatically.
Detailed installation and configuration instructions live in docs/aws-cli-setup.md.

Repository Layout

├── data/              # Raw, synthetic, and monitoring datasets (no PII)
├── docs/              # Project documentation and future ADRs/runbooks
├── infra/             # AWS CDK application scaffolding
├── notebooks/         # Exploratory analysis & demo notebooks
├── scripts/           # Utility scripts (checklist automation, artifact packaging)
├── src/               # Python packages for training, inference, CLI
├── tests/             # Pytest suite executed by Nox/CI
└── Makefile           # Convenience targets wrapping Poetry + Nox
This keeps deployment-ready artifacts reproducible without ad-hoc shell gymnastics-useful when interviewers probe for MLOps governance stories.

## Model Monitor baseline generation

Establish the statistics/constraints pair that SageMaker Model Monitor consumes with the new baseline helper:

```bash
poetry run python scripts/create_monitor_baseline.py --env dev --dataset data/sample.csv

Uploads the provided dataset (or uses the pre-staged S3 object from config/*.yml) to the environment's baseline prefix.
Launches a data-quality processing job using the Model Monitor container and writes statistics.json & constraints.json alongside the dataset.
Honors least-privilege IAM assumptions by reusing the configured monitoring role and VPC-bound resources.

Re-run the script any time you refresh the synthetic dataset or introduce new features. Interview callout: this demonstrates proactive data drift detection and shows how the platform self-polices over its lifetime.

Use make lint/make test before pushing changes. When adding new directories, drop a short README describing intent so contributors stay aligned.

Packaging trained artifacts

Bundle the latest training outputs into a SageMaker-compatible archive with the Poetry CLI harness:

poetry run package-model

Or call the lower-level script directly when you need custom arguments:

poetry run python scripts/package_model_artifacts.py --env dev

Defaults to the environment's configured training output directory and writes dist/<env>/model.tar.gz.
Pass --upload to push the tarball to the environment's model artifacts bucket; uploads use the project KMS key and inherit FinOps tags so budgets stay accurate.
Override paths or S3 keys (--source-dir, --output, --s3-key) when replaying historical models or promoting a hotfix.
The CLI keeps uploads opt-in: set THREAT_PACKAGE_UPLOAD=true when you intentionally want to push to S3. Optional overrides include THREAT_PACKAGE_SOURCE_DIR, THREAT_PACKAGE_OUTPUT_PATH, and THREAT_PACKAGE_S3_KEY.

This keeps deployment-ready artifacts reproducible without ad-hoc shell gymnastics-useful when interviewers probe for MLOps governance stories.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.githooks		.githooks
.github		.github
.vscode		.vscode
config		config
data		data
docs		docs
infra		infra
notebooks		notebooks
policies		policies
scripts		scripts
src		src
tests		tests
.conftest.yaml		.conftest.yaml
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
anomalydetectorchecklist.md		anomalydetectorchecklist.md
checklist.md		checklist.md
noxfile.py		noxfile.py
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml
pyrightconfig.json		pyrightconfig.json
resourceforecasterchecklist.md		resourceforecasterchecklist.md
seq2seqchecklist.md		seq2seqchecklist.md
test_policy_integration.py		test_policy_integration.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Threat Classifier - SageMaker Project

Features

Quickstart

Baseline heuristic & uplift tracking

Environment configuration

AWS CLI & IAM Requirements

Repository Layout

Packaging trained artifacts

About

Uh oh!

Releases

Packages

Languages

ShieldCraft-AI/threat-classifier

Folders and files

Latest commit

History

Repository files navigation

Threat Classifier - SageMaker Project

Features

Quickstart

Baseline heuristic & uplift tracking

Environment configuration

AWS CLI & IAM Requirements

Repository Layout

Packaging trained artifacts

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages