Training ML models to detect malicious PDF files
2 datasets are used :
- CIC-Evasive-PDFMal2022 : https://www.unb.ca/cic/datasets/pdfmal-2022.html, contains 10,000 benign and 10,000 malicious PDF files, the most recent files are from 2019.
- A new dataset called "NEW" containing about 500 samples of recent PDF malware (2020-2024) gathered from MalwareBazaar.
The features used are documented and extracted in PDF-Feature-Extractor.
The features are loaded from a CSV file and go through multiple steps of preprocessing.
The models are trained and evaluated using scikit-learn :
- Random Forest
- SVC
- Decision Tree
- KNN (5 neighbors)
- SVM
- Naive Bayes
- Basic neural network (MLPC)
The models are trained using a 80/20 split and evaluated using a 10-fold cross-validation.
We also experimented with tensorflow neural networks, but the results were not as good as the scikit-learn models.
- The models trained only using the CIC dataset lack in performance on newer samples. When creating a mixed dataset from CIC and NEW, the performance increases on the newer samples and decrease only slightly on the older samples.
- The model that gave the overall best results is the Random Forest model.