Skip to content

Latest commit

Β 

History

History
579 lines (472 loc) Β· 18.7 KB

File metadata and controls

579 lines (472 loc) Β· 18.7 KB

Network Security Project - Complete Documentation

πŸ“‹ Table of Contents

  1. Project Overview
  2. Architecture & Workflow
  3. Project Structure
  4. Prerequisites & Setup
  5. Installation & Configuration
  6. Project Workflow
  7. API Endpoints
  8. Docker Setup
  9. CI/CD Pipeline
  10. Troubleshooting

🎯 Project Overview

Network Security is an end-to-end machine learning application designed to detect phishing and network security threats using classification models. The project implements a complete MLOps pipeline with data ingestion, validation, transformation, model training, and deployment capabilities.

Key Features

  • Data Pipeline: MongoDB integration for data ingestion
  • Data Validation: Schema validation and drift detection
  • Data Transformation: KNN imputation for missing values
  • Model Training: Multiple classification algorithms with hyperparameter tuning
  • MLflow Integration: Experiment tracking and model monitoring
  • REST API: FastAPI for model serving and predictions
  • Cloud Integration: AWS S3 for artifact storage
  • Docker Support: Containerized deployment
  • CI/CD: GitHub Actions for automated workflows

πŸ—οΈ Architecture & Workflow

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DATA SOURCE (MongoDB)                    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚          1. DATA INGESTION (CSV Export & Split)             β”‚
β”‚  - Connects to MongoDB                                      β”‚
β”‚  - Exports collection as DataFrame                          β”‚
β”‚  - Splits into train/test sets (80/20)                      β”‚
β”‚  - Saves to: Artifacts/<timestamp>/data_ingestion/          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚       2. DATA VALIDATION (Schema & Drift Detection)         β”‚
β”‚  - Validates number of columns against schema.yaml          β”‚
β”‚  - Detects data drift using KS-2 sample test               β”‚
β”‚  - Generates drift report                                   β”‚
β”‚  - Saves valid data to: data_validation/validated/          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚    3. DATA TRANSFORMATION (Feature Engineering & Imputation)β”‚
β”‚  - Handles missing values with KNNImputer (k=3)             β”‚
β”‚  - Creates preprocessing pipeline                           β”‚
β”‚  - Transforms features using fitted preprocessor            β”‚
β”‚  - Saves transformed data as .npy files                     β”‚
β”‚  - Saves preprocessor object for inference                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   4. MODEL TRAINING (Hyperparameter Tuning & Evaluation)    β”‚
β”‚  - Trains multiple algorithms:                              β”‚
β”‚    β€’ Random Forest                                          β”‚
β”‚    β€’ Decision Tree                                          β”‚
β”‚    β€’ Gradient Boosting                                      β”‚
β”‚    β€’ Logistic Regression                                    β”‚
β”‚    β€’ AdaBoost                                               β”‚
β”‚  - GridSearchCV for hyperparameter optimization             β”‚
β”‚  - Evaluates on train/test sets                             β”‚
β”‚  - Calculates metrics: F1, Precision, Recall                β”‚
β”‚  - Tracks experiments with MLflow                           β”‚
β”‚  - Selects best model based on test RΒ² score                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚        5. MODEL DEPLOYMENT (S3 & Final Model Creation)      β”‚
β”‚  - Saves artifacts to S3 bucket                             β”‚
β”‚  - Creates NetworkModel wrapper (preprocessor + model)      β”‚
β”‚  - Saves final_model/ directory with:                       β”‚
β”‚    β€’ model.pkl (trained classifier)                         β”‚
β”‚    β€’ preprocessor.pkl (transformation pipeline)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                     β”‚
                     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚              6. INFERENCE (REST API Endpoint)                β”‚
β”‚  - Load preprocessor and model from final_model/            β”‚
β”‚  - Create NetworkModel instance                             β”‚
β”‚  - Transform input features with preprocessor               β”‚
β”‚  - Generate predictions                                     β”‚
β”‚  - Return predictions to user                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“ Project Structure

Network_Security/
β”œβ”€β”€ Network_security/
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ data_ingestion.py          # MongoDB β†’ CSV conversion
β”‚   β”‚   β”œβ”€β”€ data_validation.py         # Schema & drift validation
β”‚   β”‚   β”œβ”€β”€ data_transformation.py     # Feature preprocessing
β”‚   β”‚   └── model_trainer.py           # Model training & selection
β”‚   β”œβ”€β”€ entity/
β”‚   β”‚   β”œβ”€β”€ config_entity.py           # Configuration classes
β”‚   β”‚   └── artifacts_entity.py        # Artifact data classes
β”‚   β”œβ”€β”€ exception/
β”‚   β”‚   └── exception.py               # Custom exception handling
β”‚   β”œβ”€β”€ logging/
β”‚   β”‚   └── logger.py                  # Logging configuration
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ main_utils/
β”‚   β”‚   β”‚   └── utils.py               # Helper functions
β”‚   β”‚   └── ml_utils/
β”‚   β”‚       β”œβ”€β”€ metric/
β”‚   β”‚       β”‚   └── classification_metric.py   # Metrics calculation
β”‚   β”‚       └── model/
β”‚   β”‚           └── estimator.py       # NetworkModel wrapper
β”‚   β”œβ”€β”€ constants/
β”‚   β”‚   └── training_pipeline/
β”‚   β”‚       └── __init__.py            # Pipeline constants
β”‚   β”œβ”€β”€ cloud/
β”‚   β”‚   └── s3_syncer.py               # AWS S3 integration
β”‚   └── pipeline/
β”‚       └── training_pipeline.py       # Main orchestration
β”œβ”€β”€ data_schema/
β”‚   └── schema.yaml                    # Data validation schema
β”œβ”€β”€ templates/
β”‚   └── table.html                     # HTML template for predictions
β”œβ”€β”€ logs/                              # Training logs
β”œβ”€β”€ Artifacts/                         # Generated artifacts
β”œβ”€β”€ final_model/                       # Deployed model artifacts
β”œβ”€β”€ prediction_output/                 # Prediction results
β”‚
β”œβ”€β”€ app.py                             # FastAPI application
β”œβ”€β”€ main.py                            # Direct execution entry point
β”œβ”€β”€ push_data.py                       # MongoDB data loader
β”‚
β”œβ”€β”€ requirements.txt                   # Python dependencies
β”œβ”€β”€ setup.py                           # Package configuration
β”œβ”€β”€ DOCKERFILE                         # Container configuration
β”œβ”€β”€ .github/
β”‚   └── workflows/
β”‚       └── main.yml                   # GitHub Actions CI/CD
β”œβ”€β”€ .env                               # Environment variables
β”œβ”€β”€ .env.example                       # Environment template
β”œβ”€β”€ .gitignore                         # Git ignore rules
β”œβ”€β”€ LICENSE                            # GPL v3 License
└── README.md                          # This file

πŸ”§ Installation & Configuration

Step 1: Clone the Repository

git clone https://github.com/ashmijha/Network_Security.git
cd Network_Security

Step 2: Create Virtual Environment

# Create virtual environment
python -m venv env

# Activate virtual environment
# On Linux/macOS:
source env/bin/activate

# On Windows:
env\Scripts\activate

Step 3: Install Dependencies

pip install -r requirements.txt

Step 4: Configure Environment Variables

# Create .env file from template
cp .env.example .env

# Edit .env with your credentials
nano .env

Required .env variables:

MONGODB_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/<database>?retryWrites=true&w=majority
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-east-1

Step 5: Load Data into MongoDB

# Run the data push script
python push_data.py

# This will:
# 1. Read Network_Data/phisingData.csv
# 2. Convert CSV to JSON records
# 3. Insert into MongoDB ASHMI_DB.NetworkData collection

πŸš€ Project Workflow

Option 1: Direct Python Execution

# Run the main training pipeline
python main.py

# Or use the modular approach in main.py:
# - Data Ingestion
# - Data Validation
# - Data Transformation
# - Model Training

Option 2: FastAPI Application

# Start the API server
python app.py

# The server will run at: http://localhost:8000

# Access interactive docs:
# - Swagger UI: http://localhost:8000/docs
# - ReDoc: http://localhost:8000/redoc

Option 3: Using Docker

# Build the Docker image
docker build -t network-security:latest .

# Run the container
docker run -p 8000:8000 \
  -e MONGODB_URI="your_mongodb_uri" \
  -e AWS_ACCESS_KEY_ID="your_aws_key" \
  -e AWS_SECRET_ACCESS_KEY="your_aws_secret" \
  network-security:latest

Detailed Workflow Steps

1️⃣ Data Ingestion

# Loads data from MongoDB
# File: Network_security/components/data_ingestion.py

DataIngestion:
  β”œβ”€β”€ export_collection_as_dataframe()
  β”‚   └── Connects to MongoDB and fetches NetworkData
  β”œβ”€β”€ export_data_into_feature_store()
  β”‚   └── Saves full dataset as CSV
  └── split_data_as_train_test()
      β”œβ”€β”€ 80% training data
      └── 20% testing data

Output Artifacts:

  • Artifacts/<timestamp>/data_ingestion/feature_store/phisingData.csv
  • Artifacts/<timestamp>/data_ingestion/ingested/train.csv
  • Artifacts/<timestamp>/data_ingestion/ingested/test.csv

2️⃣ Data Validation

# Validates data quality and detects drift
# File: Network_security/components/data_validation.py

DataValidation:
  β”œβ”€β”€ validate_number_of_columns()
  β”‚   └── Checks against data_schema/schema.yaml
  └── detect_dataset_drift()
      └── Uses Kolmogorov-Smirnov test (p-value threshold: 0.05)

Output Artifacts:

  • Artifacts/<timestamp>/data_validation/validated/train.csv
  • Artifacts/<timestamp>/data_validation/validated/test.csv
  • Artifacts/<timestamp>/data_validation/drift_report/report.yaml

3️⃣ Data Transformation

# Transforms features using KNN imputation
# File: Network_security/components/data_transformation.py

DataTransformation:
  β”œβ”€β”€ get_data_transformer_object()
  β”‚   └── Creates Pipeline with KNNImputer(n_neighbors=3)
  └── initiate_data_transformation()
      β”œβ”€β”€ Fits preprocessor on training data
      β”œβ”€β”€ Transforms train features
      β”œβ”€β”€ Transforms test features
      └── Appends target column

Output Artifacts:

  • Artifacts/<timestamp>/data_transformation/transformed/train.npy
  • Artifacts/<timestamp>/data_transformation/transformed/test.npy
  • Artifacts/<timestamp>/data_transformation/transformed_object/preprocessing.pkl
  • final_model/preprocessor.pkl (for inference)

4️⃣ Model Training

# Trains and selects best model
# File: Network_security/components/model_trainer.py

ModelTrainer:
  β”œβ”€β”€ train_model()
  β”‚   β”œβ”€β”€ Initialize 5 classifiers
  β”‚   β”œβ”€β”€ GridSearchCV for hyperparameters
  β”‚   β”œβ”€β”€ Train and evaluate each model
  β”‚   └── Select best by test score
  β”œβ”€β”€ track_mlflow()
  β”‚   └── Log metrics: F1, Precision, Recall
  └── initiate_model_trainer()
      └── Create ModelTrainerArtifact

Models Trained:

Model Hyperparameters
Random Forest n_estimators: [8, 16, 32, 128, 256]
Decision Tree criterion: ['gini', 'entropy', 'log_loss']
Gradient Boosting learning_rate: [0.1, 0.01, 0.05, 0.001]
Logistic Regression default
AdaBoost n_estimators, learning_rate

Output Artifacts:

  • Artifacts/<timestamp>/model_trainer/trained_model/model.pkl
  • final_model/model.pkl (for inference)

πŸ”Œ API Endpoints

1. Home / Docs Redirect

GET /

Redirects to Swagger documentation at /docs

2. Training Pipeline

GET /train

Description: Triggers the complete training pipeline

Response:

{
  "message": "Training is successful"
}

Example:

curl -X GET "http://localhost:8000/train"

3. Prediction

POST /predict

Description: Uploads CSV file and returns predictions

Parameters:

  • file (multipart/form-data): CSV file with features

Response: HTML table with predictions

Example:

curl -X POST "http://localhost:8000/predict" \
  -F "file=@input_data.csv"

Input CSV Format:

feature1,feature2,feature3,...,featureN
0.5,0.3,0.8,...,0.2
0.6,0.4,0.7,...,0.3

🐳 Docker Setup

Dockerfile Explanation

FROM python:3.9-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y awscli git

# Copy requirements
COPY requirements.txt .

# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt

# Copy project
COPY . .

# Install package
RUN pip install -e .

# Expose port
EXPOSE 8000

# Run application
CMD ["python", "app.py"]

Build and Run

# Build image
docker build -t network-security:latest .

# Run container with environment variables
docker run -d \
  --name network-security \
  -p 8000:8000 \
  -e MONGODB_URI="$MONGODB_URI" \
  -e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
  -e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
  network-security:latest

# View logs
docker logs -f network-security

# Stop container
docker stop network-security

βš™οΈ CI/CD Pipeline

GitHub Actions Workflow

# File: .github/workflows/main.yml

Workflow Steps:

  1. Trigger: On push to main branch
  2. Checkout: Clone repository
  3. Setup Python: Install Python 3.9
  4. Install Dependencies: pip install -r requirements.txt
  5. Linting (optional): Code quality checks
  6. Run Tests (optional): Unit tests
  7. Build Docker Image: Create container
  8. Push to Registry (optional): Docker Hub or ECR
  9. Deploy (optional): Deploy to cloud

Manual GitHub Actions Trigger

# The workflow runs automatically on:
git push origin main

# Or manually trigger from GitHub UI:
# Actions β†’ Select workflow β†’ Run workflow

Setting Up CI/CD Secrets

In GitHub repository settings, add these secrets:

MONGODB_URI              β†’ Your MongoDB connection string
AWS_ACCESS_KEY_ID       β†’ Your AWS access key
AWS_SECRET_ACCESS_KEY   β†’ Your AWS secret key
DOCKER_USERNAME         β†’ Docker Hub username (optional)
DOCKER_PASSWORD         β†’ Docker Hub token (optional)


πŸ“Š Monitoring & Logging

Log Files Location

logs/
β”œβ”€β”€ MM_DD_YYYY_HH_MM_SS.log
└── (New log created for each run)

View Real-time Logs

# Follow logs in real-time
tail -f logs/01_01_2025_10_30_45.log

# Search for errors
grep "ERROR" logs/*.log

# View specific component logs
grep "DataTransformation" logs/*.log

MLflow UI

# Start MLflow UI
mlflow ui --host 0.0.0.0 --port 5000

# Access at: http://localhost:5000

πŸ“ˆ Performance Metrics

Expected Model Performance

  • F1 Score: 0.80-0.95 (depending on dataset)
  • Precision: 0.80-0.90
  • Recall: 0.75-0.90
  • Training Time: 2-5 minutes (depends on hardware)

Data Statistics

  • Total Records: ~11,000 (phisingData.csv)
  • Features: 30
  • Target Classes: Binary (0, 1)
  • Missing Values: Handled by KNN Imputation


πŸ“š Additional Resources


πŸ“„ License

This project is licensed under the GNU General Public License v3.0 - see LICENSE file for details.


🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit changes (git commit -m 'Add AmazingFeature')
  4. Push to branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request