Skip to content

Mathys-Rituper/PDFMalware_model_training

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PDFMalware_model_training

Training ML models to detect malicious PDF files

Data

2 datasets are used :

  • CIC-Evasive-PDFMal2022 : https://www.unb.ca/cic/datasets/pdfmal-2022.html, contains 10,000 benign and 10,000 malicious PDF files, the most recent files are from 2019.
  • A new dataset called "NEW" containing about 500 samples of recent PDF malware (2020-2024) gathered from MalwareBazaar.

Features

The features used are documented and extracted in PDF-Feature-Extractor.

The features are loaded from a CSV file and go through multiple steps of preprocessing.

Models

The models are trained and evaluated using scikit-learn :

  • Random Forest
  • SVC
  • Decision Tree
  • KNN (5 neighbors)
  • SVM
  • Naive Bayes
  • Basic neural network (MLPC)

The models are trained using a 80/20 split and evaluated using a 10-fold cross-validation.

We also experimented with tensorflow neural networks, but the results were not as good as the scikit-learn models.

Key takeaways

  • The models trained only using the CIC dataset lack in performance on newer samples. When creating a mixed dataset from CIC and NEW, the performance increases on the newer samples and decrease only slightly on the older samples.
  • The model that gave the overall best results is the Random Forest model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors