This document provides detailed information about the machine learning models used in the APT Detection System, including the hybrid model approach, feature selection, data balancing, and model training.
The APT Detection System uses a hybrid approach that combines multiple machine learning models to detect advanced persistent threats. This approach leverages the strengths of different algorithms to provide more accurate and robust threat detection.
The system employs a hybrid model architecture that combines:
- LightGBM: A gradient boosting framework that uses tree-based learning algorithms
- Bi-LSTM: A bidirectional long short-term memory neural network for sequence analysis
These models are combined in a hybrid classifier that leverages the strengths of both approaches.
graph TD
FE[Feature Extraction] --> FS[Feature Selection]
FS --> DB[Data Balancing]
DB --> MT[Model Training]
MT --> LGB[LightGBM Model]
MT --> LSTM[Bi-LSTM Model]
LGB --> HM[Hybrid Model]
LSTM --> HM
HM --> PRED[Prediction]
The system uses the HHOSSSA (Hybrid Harmony-Owl Search-Salp Swarm Algorithm) for feature selection, which is a novel approach that combines multiple metaheuristic algorithms.
The HHOSSSA algorithm combines:
- Harmony Search (HS): Inspired by the improvisation process of musicians
- Owl Search Algorithm (OSA): Based on the hunting behavior of owls
- Salp Swarm Algorithm (SSA): Inspired by the swarming behavior of salps
This hybrid approach provides several advantages:
- Exploration: Better exploration of the feature space
- Exploitation: More effective exploitation of promising features
- Convergence: Faster convergence to optimal feature subsets
- Robustness: More robust to local optima
The HHOSSSA feature selection is implemented in feature_selection/hhosssa_feature_selection.py. The key steps are:
- Initialization: Initialize a population of feature subsets
- Fitness Evaluation: Evaluate each subset using a classifier and a fitness function
- Hybrid Search: Apply HS, OSA, and SSA operators to generate new feature subsets
- Selection: Select the best feature subsets for the next iteration
- Termination: Stop when a termination criterion is met (e.g., maximum iterations)
from feature_selection.hhosssa_feature_selection import HHOSSSAFeatureSelection
# Create feature selector
feature_selector = HHOSSSAFeatureSelection(
n_features=10, # Number of features to select
population_size=30, # Population size
max_iterations=100, # Maximum iterations
classifier='lightgbm' # Classifier to use for evaluation
)
# Fit and transform
X_selected = feature_selector.fit_transform(X, y)The system uses HHOSSSA-SMOTE for data balancing, which is a novel approach that combines the HHOSSSA algorithm with Synthetic Minority Over-sampling Technique (SMOTE).
The HHOSSSA-SMOTE algorithm:
- Uses HHOSSSA to select the most informative features
- Applies SMOTE to generate synthetic samples for the minority class
- Optimizes the SMOTE parameters using the HHOSSSA algorithm
This approach provides several advantages:
- Quality: Higher quality synthetic samples
- Relevance: Synthetic samples are generated based on the most informative features
- Efficiency: More efficient use of synthetic samples
- Performance: Better classification performance on imbalanced datasets
The HHOSSSA-SMOTE data balancing is implemented in data_balancing/hhosssa_smote.py. The key steps are:
- Feature Selection: Select the most informative features using HHOSSSA
- Parameter Optimization: Optimize SMOTE parameters using HHOSSSA
- Synthetic Sample Generation: Generate synthetic samples using SMOTE with optimized parameters
- Sample Selection: Select the most informative synthetic samples
from data_balancing.hhosssa_smote import HHOSSSASMOTE
# Create data balancer
data_balancer = HHOSSSASMOTE(
sampling_strategy='auto', # Sampling strategy
k_neighbors=5, # Number of nearest neighbors
n_features=10, # Number of features to select
population_size=30, # Population size
max_iterations=100 # Maximum iterations
)
# Fit and resample
X_resampled, y_resampled = data_balancer.fit_resample(X, y)LightGBM is a gradient boosting framework that uses tree-based learning algorithms. It is designed for distributed and efficient training and has the following advantages:
- Speed: Faster training speed and lower memory usage
- Accuracy: Higher accuracy than other boosting algorithms
- Efficiency: Support for parallel, distributed, and GPU learning
- Handling: Capable of handling large-scale data
The LightGBM model is implemented in models/lightgbm_model.py. The key components are:
- Model Definition: Define the LightGBM model with appropriate parameters
- Training: Train the model on the training data
- Prediction: Make predictions on new data
- Evaluation: Evaluate the model performance
The LightGBM model can be configured in config.yaml:
model_paths:
lightgbm: lightgbm_model.pkl
training_params:
lightgbm:
num_leaves: 31
learning_rate: 0.05
n_estimators: 100
max_depth: -1
min_child_samples: 20
subsample: 0.8
colsample_bytree: 0.8
reg_alpha: 0.1
reg_lambda: 0.1
random_state: 42from models.lightgbm_model import LightGBMModel
# Create model
model = LightGBMModel(
num_leaves=31,
learning_rate=0.05,
n_estimators=100
)
# Train model
model.train(X_train, y_train)
# Make predictions
predictions = model.predict(X_test)
# Save model
model.save('models/lightgbm_model.pkl')
# Load model
model = LightGBMModel.load('models/lightgbm_model.pkl')Bi-LSTM (Bidirectional Long Short-Term Memory) is a type of recurrent neural network that processes data in both forward and backward directions. It is particularly effective for sequence data and has the following advantages:
- Context: Captures context from both past and future
- Memory: Long-term memory capabilities
- Sequences: Effective for sequence data
- Patterns: Captures complex temporal patterns
The Bi-LSTM model is implemented in models/bilstm_model.py. The key components are:
- Model Definition: Define the Bi-LSTM model with appropriate architecture
- Training: Train the model on the training data
- Prediction: Make predictions on new data
- Evaluation: Evaluate the model performance
The Bi-LSTM model can be configured in config.yaml:
model_paths:
bilstm: bilstm_model.h5
training_params:
bilstm:
epochs: 5
batch_size: 32
lstm_units: 64
dropout_rate: 0.2
recurrent_dropout: 0.2
optimizer: adam
learning_rate: 0.001from models.bilstm_model import BiLSTMModel
# Create model
model = BiLSTMModel(
input_shape=(sequence_length, n_features),
lstm_units=64,
dropout_rate=0.2
)
# Train model
model.train(X_train, y_train, epochs=5, batch_size=32)
# Make predictions
predictions = model.predict(X_test)
# Save model
model.save('models/bilstm_model.h5')
# Load model
model = BiLSTMModel.load('models/bilstm_model.h5')The hybrid model combines the LightGBM and Bi-LSTM models to leverage the strengths of both approaches. It uses a weighted ensemble approach to combine the predictions.
The hybrid approach provides several advantages:
- Complementary Strengths: Combines the strengths of tree-based and neural network models
- Robustness: More robust to different types of data and attack patterns
- Accuracy: Higher accuracy than individual models
- Adaptability: Better adaptability to evolving threats
The hybrid model is implemented in models/hybrid_classifier.py. The key components are:
- Model Combination: Combine the LightGBM and Bi-LSTM models
- Weighted Ensemble: Use a weighted ensemble approach to combine predictions
- Adaptive Weighting: Adapt weights based on model performance
- Confidence Scores: Generate confidence scores for predictions
The hybrid model can be configured in config.yaml:
hybrid_model:
lightgbm_weight: 0.6
bilstm_weight: 0.4
threshold: 0.7
adaptive_weights: truefrom models.hybrid_classifier import HybridClassifier
from models.lightgbm_model import LightGBMModel
from models.bilstm_model import BiLSTMModel
# Load individual models
lightgbm_model = LightGBMModel.load('models/lightgbm_model.pkl')
bilstm_model = BiLSTMModel.load('models/bilstm_model.h5')
# Create hybrid model
hybrid_model = HybridClassifier(
models=[lightgbm_model, bilstm_model],
weights=[0.6, 0.4],
threshold=0.7
)
# Make predictions
predictions, confidence_scores = hybrid_model.predict(X_test)The model training process involves several steps:
- Data Preprocessing: Clean and normalize the data
- Feature Selection: Select the most informative features using HHOSSSA
- Data Balancing: Balance the dataset using HHOSSSA-SMOTE
- Model Training: Train the LightGBM and Bi-LSTM models
- Hybrid Model: Combine the models into a hybrid classifier
- Evaluation: Evaluate the model performance using cross-validation
The model training process is implemented in models/train_models.py. The key steps are:
- Load Data: Load the training data
- Preprocess Data: Preprocess the data
- Select Features: Select features using HHOSSSA
- Balance Data: Balance the data using HHOSSSA-SMOTE
- Train Models: Train the LightGBM and Bi-LSTM models
- Create Hybrid Model: Create the hybrid model
- Evaluate Models: Evaluate the models using cross-validation
- Save Models: Save the trained models
python models/train_models.py --data_path data/training_data.csv --save_dir models/savedThe model evaluation process involves several metrics:
- Accuracy: Percentage of correct predictions
- Precision: Percentage of true positives among positive predictions
- Recall: Percentage of true positives among actual positives
- F1 Score: Harmonic mean of precision and recall
- AUC-ROC: Area under the Receiver Operating Characteristic curve
- Confusion Matrix: Matrix showing true positives, false positives, true negatives, and false negatives
The model evaluation is implemented in evaluation/evaluation_metrics.py. The key metrics are:
- Classification Metrics: Accuracy, precision, recall, F1 score
- ROC Curve: ROC curve and AUC-ROC
- Confusion Matrix: Confusion matrix
- Feature Importance: Feature importance for the LightGBM model
- Learning Curves: Learning curves for the models
from evaluation.evaluation_metrics import evaluate_model
# Evaluate model
metrics = evaluate_model(model, X_test, y_test)
# Print metrics
print(f"Accuracy: {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall: {metrics['recall']:.4f}")
print(f"F1 Score: {metrics['f1_score']:.4f}")
print(f"AUC-ROC: {metrics['auc_roc']:.4f}")The cross-validation process involves:
- K-Fold Cross-Validation: Split the data into K folds
- Stratified Sampling: Ensure each fold has the same class distribution
- Model Training: Train the model on K-1 folds
- Model Evaluation: Evaluate the model on the remaining fold
- Repeat: Repeat the process K times
The cross-validation is implemented in evaluation/cross_validation.py. The key steps are:
- Data Splitting: Split the data into K folds
- Model Training: Train the model on each fold
- Model Evaluation: Evaluate the model on each fold
- Metrics Aggregation: Aggregate the metrics across all folds
from evaluation.cross_validation import cross_validate
# Perform cross-validation
cv_results = cross_validate(model, X, y, n_splits=5)
# Print results
print(f"Mean Accuracy: {cv_results['accuracy_mean']:.4f} ± {cv_results['accuracy_std']:.4f}")
print(f"Mean Precision: {cv_results['precision_mean']:.4f} ± {cv_results['precision_std']:.4f}")
print(f"Mean Recall: {cv_results['recall_mean']:.4f} ± {cv_results['recall_std']:.4f}")
print(f"Mean F1 Score: {cv_results['f1_score_mean']:.4f} ± {cv_results['f1_score_std']:.4f}")
print(f"Mean AUC-ROC: {cv_results['auc_roc_mean']:.4f} ± {cv_results['auc_roc_std']:.4f}")The machine learning models in the APT Detection System provide a robust, accurate approach to detecting advanced persistent threats. By combining multiple models and techniques, the system can effectively identify a wide range of threats and adapt to evolving attack patterns.