Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

The work in the repository is in regards to creating an explainable methodology to capture Concept Drift seen in Cybersecurity and improve on the existing methods.

Paper: Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

Dataset: Drebin (Citation)
Libraries Used: Pandas, Numpy, Scikit-learn, Scikit-multiflow
Conference: SIG SIDAR Conference on Detection of Intrusion and Malware & Vulnerability Assessment, 2025 (Graz, Austria)

Goal

Concept Drift Evaluation evaluates the state of the art of drift detection methods in the field of Cybersecurity, and suggests and develops alternative methods for improved performance to efficiently identify dataset shift. Our approach involves analyzing the sub-classes and introducing two drift detectors, Class-aware Drift Detectors and Concept-Aware Drift Detectors. The new framework has been analysed using two industry-standard datasets, DREBIN and Androzoo.

Authors

The work has been implemented by Jayesh Tripathi, with the assistance and under the guidance of Dr Heitor Gomes and Dr Marcus Botacin. More detailed information, such as academic papers, can be found on the project page.

Setting Up

Organization

"./Datasets": Pickle version of the extracted and processed datasets
"./run_scripts": Python scripts used to simulate different types of experimental setups and extract the viable datasets (saved in the Datasets folder)
"./Results": Results produced from different experiment setups
"./Graphs": The folder which contains the aggregated results of all experiments
"./batch_scripts": Batch files which can run the Python scripts
"./graph_scripts": Python scripts which are used to produce different graph-based results.

Dependencies

Compiled on

Run Files: Requires Python version up to 3.10.14*
Graphing Files: Requires Python version up to 3.10.14
Anaconda: Anaconda 2024.10-1

Environment Setup

The project utilizes different types of distribution shift detectors from the scikit-multiflow library. The library is not compatible with the latest versions of numpy and pandas, which have significantly changed since the library was last updated. Hence, installing the correct dependencies is important while building the environment for the project to work. The requirements are stored in the YAML file, Requirements.yml, which can be utilized to the set up the Anaconda Environment

Dataset Format and Installation

The project has been implemented for a dataset that contains the feature space, representing the permissions extracted from the Android Files. The datasets used for this project are DREBIN and Androzoo , and they can be processed using the processing files available in the Run (./Run). To run any other dataset, it needs to be converted into the following format: With the dataset processed into a readable format for this project, the files also need to be divided into bins which mimic incoming streams of the datapoints. The bins are constrained by the ratio of the goodware and malware in each bin. To create the final processed files, run the following code, with the ratio of goodware to malware as one of the run arguments

Experimental Runs

Running

If a specific experiment set up needs to be run, find the appropriate shell command and run the experiment to produce a pickle file containing the aggregated results of that experiment. If multiple files need to be run, you can find the script files, which will run the experiments with different settings.

Result Format

When an experiment is completed, all the results from that experiment are stored in a pickle file and saved in the Results folder. The folder contains a hierarchical structure based on the type of retraining strategy (retraining on directed Type 3 drift or retraining on any Type 3 drift) and the undersampling of the dataset used. All other types of experiment settings are saved in the nomenclature of the pickle file.

Graphing

When the results of the experiments have been stored, the scripts stored in the Graphing folder can be used to produce graphs, which can be interpreted to understand the characteristic behaviour of the dataset and evaluate the drifts in the dataset. Some scripts can be run on individual experimental setups; however, to produce the graphs which compare different experiments, all the different types of experiments need to be completed before producing the graphs. More information can be found in the script documentation.

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
Graphs		Graphs
graph_scripts		graph_scripts
run_scripts		run_scripts
README.md		README.md
readme.txt		readme.txt
requirements.yml		requirements.yml
results.pkl		results.pkl
time_run.py		time_run.py
time_run2.py		time_run2.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

Paper: Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

Goal

Authors

Setting Up

Organization

Dependencies

Environment Setup

Dataset Format and Installation

Experimental Runs

Running

Result Format

Graphing

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

Paper: Towards Explainable Drift Detection and Early Retrain in ML-based Malware Detection Pipelines

Goal

Authors

Setting Up

Organization

Dependencies

Environment Setup

Dataset Format and Installation

Experimental Runs

Running

Result Format

Graphing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages