You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender.
Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.
🎥Demo Video
2021-11-29.23-08-56_Trim.mp4
📊 Problem Statement
With increasing cyber threats, early detection of malware is crucial for protecting user devices and data.
This project aims to predict the likelihood of a malware detection on a machine using telemetry data, enabling proactive defense mechanisms for organizations and end-users alike.
Generated malware detection probability predictions.
Saved results to result.csv.
📊 Results
Metric
Validation Set Value
Accuracy
~0.734
AUC Score
~0.79
F1 Score
~0.73
The LightGBM model displayed a strong ability to discriminate between infected and safe machines.
Feature Importance Plot revealed critical features like SmartScreen, AVProductStatesIdentifier, and Platform.
🔮 Future Scope
Implement cross-validation for more robust performance estimation.
Integrate hyperparameter tuning using Optuna or GridSearchCV.
Apply advanced missing value imputation instead of row removal.
Try additional algorithms (XGBoost, CatBoost) for benchmarking.
Deploy a scalable API service to accept telemetry data and predict malware probability in real-time.
🚀 Setup Instructions
Clone the repository
git clone https://github.com/yourusername/malware-prediction-ml.git
cd malware-prediction-ml
About
A data science project to predict the probability of a machine encountering malware based on telemetry data collected from Microsoft Defender. Built using Python, Dask, LightGBM, and essential data science libraries for handling large-scale structured data.