- Project Overview
- Architecture & Workflow
- Project Structure
- Prerequisites & Setup
- Installation & Configuration
- Project Workflow
- API Endpoints
- Docker Setup
- CI/CD Pipeline
- Troubleshooting
Network Security is an end-to-end machine learning application designed to detect phishing and network security threats using classification models. The project implements a complete MLOps pipeline with data ingestion, validation, transformation, model training, and deployment capabilities.
- Data Pipeline: MongoDB integration for data ingestion
- Data Validation: Schema validation and drift detection
- Data Transformation: KNN imputation for missing values
- Model Training: Multiple classification algorithms with hyperparameter tuning
- MLflow Integration: Experiment tracking and model monitoring
- REST API: FastAPI for model serving and predictions
- Cloud Integration: AWS S3 for artifact storage
- Docker Support: Containerized deployment
- CI/CD: GitHub Actions for automated workflows
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DATA SOURCE (MongoDB) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 1. DATA INGESTION (CSV Export & Split) β
β - Connects to MongoDB β
β - Exports collection as DataFrame β
β - Splits into train/test sets (80/20) β
β - Saves to: Artifacts/<timestamp>/data_ingestion/ β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 2. DATA VALIDATION (Schema & Drift Detection) β
β - Validates number of columns against schema.yaml β
β - Detects data drift using KS-2 sample test β
β - Generates drift report β
β - Saves valid data to: data_validation/validated/ β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 3. DATA TRANSFORMATION (Feature Engineering & Imputation)β
β - Handles missing values with KNNImputer (k=3) β
β - Creates preprocessing pipeline β
β - Transforms features using fitted preprocessor β
β - Saves transformed data as .npy files β
β - Saves preprocessor object for inference β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 4. MODEL TRAINING (Hyperparameter Tuning & Evaluation) β
β - Trains multiple algorithms: β
β β’ Random Forest β
β β’ Decision Tree β
β β’ Gradient Boosting β
β β’ Logistic Regression β
β β’ AdaBoost β
β - GridSearchCV for hyperparameter optimization β
β - Evaluates on train/test sets β
β - Calculates metrics: F1, Precision, Recall β
β - Tracks experiments with MLflow β
β - Selects best model based on test RΒ² score β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 5. MODEL DEPLOYMENT (S3 & Final Model Creation) β
β - Saves artifacts to S3 bucket β
β - Creates NetworkModel wrapper (preprocessor + model) β
β - Saves final_model/ directory with: β
β β’ model.pkl (trained classifier) β
β β’ preprocessor.pkl (transformation pipeline) β
ββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β 6. INFERENCE (REST API Endpoint) β
β - Load preprocessor and model from final_model/ β
β - Create NetworkModel instance β
β - Transform input features with preprocessor β
β - Generate predictions β
β - Return predictions to user β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Network_Security/
βββ Network_security/
β βββ components/
β β βββ data_ingestion.py # MongoDB β CSV conversion
β β βββ data_validation.py # Schema & drift validation
β β βββ data_transformation.py # Feature preprocessing
β β βββ model_trainer.py # Model training & selection
β βββ entity/
β β βββ config_entity.py # Configuration classes
β β βββ artifacts_entity.py # Artifact data classes
β βββ exception/
β β βββ exception.py # Custom exception handling
β βββ logging/
β β βββ logger.py # Logging configuration
β βββ utils/
β β βββ main_utils/
β β β βββ utils.py # Helper functions
β β βββ ml_utils/
β β βββ metric/
β β β βββ classification_metric.py # Metrics calculation
β β βββ model/
β β βββ estimator.py # NetworkModel wrapper
β βββ constants/
β β βββ training_pipeline/
β β βββ __init__.py # Pipeline constants
β βββ cloud/
β β βββ s3_syncer.py # AWS S3 integration
β βββ pipeline/
β βββ training_pipeline.py # Main orchestration
βββ data_schema/
β βββ schema.yaml # Data validation schema
βββ templates/
β βββ table.html # HTML template for predictions
βββ logs/ # Training logs
βββ Artifacts/ # Generated artifacts
βββ final_model/ # Deployed model artifacts
βββ prediction_output/ # Prediction results
β
βββ app.py # FastAPI application
βββ main.py # Direct execution entry point
βββ push_data.py # MongoDB data loader
β
βββ requirements.txt # Python dependencies
βββ setup.py # Package configuration
βββ DOCKERFILE # Container configuration
βββ .github/
β βββ workflows/
β βββ main.yml # GitHub Actions CI/CD
βββ .env # Environment variables
βββ .env.example # Environment template
βββ .gitignore # Git ignore rules
βββ LICENSE # GPL v3 License
βββ README.md # This file
git clone https://github.com/ashmijha/Network_Security.git
cd Network_Security# Create virtual environment
python -m venv env
# Activate virtual environment
# On Linux/macOS:
source env/bin/activate
# On Windows:
env\Scripts\activatepip install -r requirements.txt# Create .env file from template
cp .env.example .env
# Edit .env with your credentials
nano .envRequired .env variables:
MONGODB_URI=mongodb+srv://<username>:<password>@<cluster>.mongodb.net/<database>?retryWrites=true&w=majority
AWS_ACCESS_KEY_ID=your_aws_access_key
AWS_SECRET_ACCESS_KEY=your_aws_secret_key
AWS_DEFAULT_REGION=us-east-1# Run the data push script
python push_data.py
# This will:
# 1. Read Network_Data/phisingData.csv
# 2. Convert CSV to JSON records
# 3. Insert into MongoDB ASHMI_DB.NetworkData collection# Run the main training pipeline
python main.py
# Or use the modular approach in main.py:
# - Data Ingestion
# - Data Validation
# - Data Transformation
# - Model Training# Start the API server
python app.py
# The server will run at: http://localhost:8000
# Access interactive docs:
# - Swagger UI: http://localhost:8000/docs
# - ReDoc: http://localhost:8000/redoc# Build the Docker image
docker build -t network-security:latest .
# Run the container
docker run -p 8000:8000 \
-e MONGODB_URI="your_mongodb_uri" \
-e AWS_ACCESS_KEY_ID="your_aws_key" \
-e AWS_SECRET_ACCESS_KEY="your_aws_secret" \
network-security:latest# Loads data from MongoDB
# File: Network_security/components/data_ingestion.py
DataIngestion:
βββ export_collection_as_dataframe()
β βββ Connects to MongoDB and fetches NetworkData
βββ export_data_into_feature_store()
β βββ Saves full dataset as CSV
βββ split_data_as_train_test()
βββ 80% training data
βββ 20% testing dataOutput Artifacts:
Artifacts/<timestamp>/data_ingestion/feature_store/phisingData.csvArtifacts/<timestamp>/data_ingestion/ingested/train.csvArtifacts/<timestamp>/data_ingestion/ingested/test.csv
# Validates data quality and detects drift
# File: Network_security/components/data_validation.py
DataValidation:
βββ validate_number_of_columns()
β βββ Checks against data_schema/schema.yaml
βββ detect_dataset_drift()
βββ Uses Kolmogorov-Smirnov test (p-value threshold: 0.05)Output Artifacts:
Artifacts/<timestamp>/data_validation/validated/train.csvArtifacts/<timestamp>/data_validation/validated/test.csvArtifacts/<timestamp>/data_validation/drift_report/report.yaml
# Transforms features using KNN imputation
# File: Network_security/components/data_transformation.py
DataTransformation:
βββ get_data_transformer_object()
β βββ Creates Pipeline with KNNImputer(n_neighbors=3)
βββ initiate_data_transformation()
βββ Fits preprocessor on training data
βββ Transforms train features
βββ Transforms test features
βββ Appends target columnOutput Artifacts:
Artifacts/<timestamp>/data_transformation/transformed/train.npyArtifacts/<timestamp>/data_transformation/transformed/test.npyArtifacts/<timestamp>/data_transformation/transformed_object/preprocessing.pklfinal_model/preprocessor.pkl(for inference)
# Trains and selects best model
# File: Network_security/components/model_trainer.py
ModelTrainer:
βββ train_model()
β βββ Initialize 5 classifiers
β βββ GridSearchCV for hyperparameters
β βββ Train and evaluate each model
β βββ Select best by test score
βββ track_mlflow()
β βββ Log metrics: F1, Precision, Recall
βββ initiate_model_trainer()
βββ Create ModelTrainerArtifactModels Trained:
| Model | Hyperparameters |
|---|---|
| Random Forest | n_estimators: [8, 16, 32, 128, 256] |
| Decision Tree | criterion: ['gini', 'entropy', 'log_loss'] |
| Gradient Boosting | learning_rate: [0.1, 0.01, 0.05, 0.001] |
| Logistic Regression | default |
| AdaBoost | n_estimators, learning_rate |
Output Artifacts:
Artifacts/<timestamp>/model_trainer/trained_model/model.pklfinal_model/model.pkl(for inference)
GET /
Redirects to Swagger documentation at /docs
GET /train
Description: Triggers the complete training pipeline
Response:
{
"message": "Training is successful"
}Example:
curl -X GET "http://localhost:8000/train"POST /predict
Description: Uploads CSV file and returns predictions
Parameters:
file(multipart/form-data): CSV file with features
Response: HTML table with predictions
Example:
curl -X POST "http://localhost:8000/predict" \
-F "file=@input_data.csv"Input CSV Format:
feature1,feature2,feature3,...,featureN
0.5,0.3,0.8,...,0.2
0.6,0.4,0.7,...,0.3FROM python:3.9-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y awscli git
# Copy requirements
COPY requirements.txt .
# Install Python dependencies
RUN pip install --no-cache-dir -r requirements.txt
# Copy project
COPY . .
# Install package
RUN pip install -e .
# Expose port
EXPOSE 8000
# Run application
CMD ["python", "app.py"]# Build image
docker build -t network-security:latest .
# Run container with environment variables
docker run -d \
--name network-security \
-p 8000:8000 \
-e MONGODB_URI="$MONGODB_URI" \
-e AWS_ACCESS_KEY_ID="$AWS_ACCESS_KEY_ID" \
-e AWS_SECRET_ACCESS_KEY="$AWS_SECRET_ACCESS_KEY" \
network-security:latest
# View logs
docker logs -f network-security
# Stop container
docker stop network-security# File: .github/workflows/main.ymlWorkflow Steps:
- Trigger: On push to
mainbranch - Checkout: Clone repository
- Setup Python: Install Python 3.9
- Install Dependencies:
pip install -r requirements.txt - Linting (optional): Code quality checks
- Run Tests (optional): Unit tests
- Build Docker Image: Create container
- Push to Registry (optional): Docker Hub or ECR
- Deploy (optional): Deploy to cloud
# The workflow runs automatically on:
git push origin main
# Or manually trigger from GitHub UI:
# Actions β Select workflow β Run workflowIn GitHub repository settings, add these secrets:
MONGODB_URI β Your MongoDB connection string
AWS_ACCESS_KEY_ID β Your AWS access key
AWS_SECRET_ACCESS_KEY β Your AWS secret key
DOCKER_USERNAME β Docker Hub username (optional)
DOCKER_PASSWORD β Docker Hub token (optional)
logs/
βββ MM_DD_YYYY_HH_MM_SS.log
βββ (New log created for each run)
# Follow logs in real-time
tail -f logs/01_01_2025_10_30_45.log
# Search for errors
grep "ERROR" logs/*.log
# View specific component logs
grep "DataTransformation" logs/*.log# Start MLflow UI
mlflow ui --host 0.0.0.0 --port 5000
# Access at: http://localhost:5000- F1 Score: 0.80-0.95 (depending on dataset)
- Precision: 0.80-0.90
- Recall: 0.75-0.90
- Training Time: 2-5 minutes (depends on hardware)
- Total Records: ~11,000 (phisingData.csv)
- Features: 30
- Target Classes: Binary (0, 1)
- Missing Values: Handled by KNN Imputation
- FastAPI Documentation: https://fastapi.tiangolo.com/
- MLflow Documentation: https://mlflow.org/docs/
- MongoDB Documentation: https://docs.mongodb.com/
- AWS S3 Documentation: https://docs.aws.amazon.com/s3/
- scikit-learn Documentation: https://scikit-learn.org/
This project is licensed under the GNU General Public License v3.0 - see LICENSE file for details.
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit changes (
git commit -m 'Add AmazingFeature') - Push to branch (
git push origin feature/AmazingFeature) - Open a Pull Request