Skip to content

software-engineering-and-security/loom-android-malware-detection

Repository files navigation

LOOM: A Balanced String-Based Transformer for Android Malware Detection

Reference implementation of LOOM (ICICS 2026). LOOM fuses three complementary static feature views extracted from each APK — the Android manifest, API calls, and Dalvik opcodes — into a single string token sequence under a fixed token budget, then fine-tunes a transformer classifier on it. The repository also ships reproductions of several published baselines for direct comparison.

📄 Paper: Hantang Zhang, Mojtaba Eshghie, Bruno Kreyssig, Tommy Löfstedt, Alexandre Bartel. LOOM: A Balanced String-Based Transformer for Android Malware Detection. ICICS 2026 (to appear). See Citation.


Highlights

  • Three-channel static features extracted with androguard: manifest entities, API-call sequences, and opcode sequences.
  • Token-budget-aware preprocessing that allocates the limited context window across the three channels by a configurable ratio (default 1 : 4 : 5 for manifest : api : opcode), with cleaning, third-party-library filtering (AndroLibZoo + LibD + common ad/util libraries), TF-IDF / χ² / information-gain feature selection.
  • Multiple transformer backbones: BERT, BigBird, Longformer, ModernBERT.
  • Model explanation via LIME, plus a lightweight logistic-regression shadow model for global feature importance.
  • Reproduced baselines: ImageDroid, MalScan, RevealDroid, and a multimodal-transformer fusion baseline.

Repository layout

loom-android-malware/
├── apk_process/           # APK parsing & raw-feature extraction (androguard)
├── datasets/              # SHA-256 lists of every APK used in the paper (see Datasets)
├── docs/                  # Paper appendix and other supplementary documents
├── feature_process/       # Preprocessing into BERT-friendly token sequences
│   ├── manifest_process.py
│   ├── apicall_process.py
│   ├── opcode_process.py
│   ├── features_process_final.py     # main preprocessing entry point
│   ├── split_dataset.py              # leak-free train/val/test split
│   ├── counter_process.py
│   ├── dsfile_process.py
│   ├── malbert.py
│   ├── tfidf_feature_extractor.py
│   ├── chi2_feature_extractor.py
│   ├── information_gain_feature_extractor.py
│   ├── filter/                       # third-party-library blacklists
│   └── utils/
├── model/                 # Transformer classifiers (BERT / BigBird / Longformer / ModernBERT)
│   ├── bert.py
│   ├── bert_with_count.py
│   ├── bigbird_base.py
│   ├── longformer_base.py
│   ├── modern_bert.py
│   └── model_download.py
├── model_explanation/     # LIME + lightweight shadow model
│   ├── lime_bert.py
│   ├── lightweight_model.py
│   ├── important_feature_process.py
│   └── feature_association.py
├── obfuscation/           # Obfuscation-related processing
├── repro_baselines/       # Reproduced baselines
│   ├── image_droid/
│   ├── malscan/
│   ├── reveal_droid/
│   └── multimodal_transformer/
├── UniXcoder/             # UniXcoder wrapper (used by parts of the pipeline)
├── utils/                 # Dataset building, downloading, analysis helpers
├── LICENSE
├── README.md
└── requirements.txt

Installation

Tested with Python 3.11 on Linux.

git clone https://github.com/HantangZhang/loom-android-malware.git
cd loom-android-malware
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Main dependencies (see requirements.txt for exact pinned versions):

  • androguard 4.1.3 — APK static analysis
  • transformers 4.51.3, torch 2.7.0, datasets 3.5.0 — modeling
  • scikit-learn, scipy, lime — feature selection & explanation
  • tqdm, pandas, numpy, loguru

Quick start

Every script under apk_process/, feature_process/, model/, model_explanation/, obfuscation/, utils/, and repro_baselines/*/ is runnable via python -m <module> with a --help-driven argparse CLI. Run any module with --help to discover its full flag set.

You may also want to point the HuggingFace cache somewhere other than ~/.cache/:

export ANDROID_ML_CACHE=/path/to/big/disk/hf_cache

1. Extract raw features from APKs

For a full directory of APKs (manifest + API calls + opcode in one parse):

python -m apk_process.apk_extractor extract-all \
    --target /path/to/apks \
    --manifest-dir /out/manifest \
    --api-dir      /out/apicall \
    --opcode-dir   /out/opcode \
    --workers 16

For an obfuscation sweep (one SHA list, many APK source directories):

python -m apk_process.apk_extractor extract-by-sha \
    --sha-list /path/to/sha_list.txt \
    --apk-dir  /path/to/apks/CID \
    --apk-dir  /path/to/apks/JUNK \
    --output-dir /out/obfuscation \
    --workers 16

2. Build a HuggingFace dataset under a token budget

python -m feature_process.features_process_final \
    --manifest-dir /out/manifest \
    --api-dir      /out/apicall \
    --opcode-dir   /out/opcode \
    --base-dir     /out/features \
    --sha-list     /path/to/sub_dataset_sha.txt \
    --csv-path     /path/to/apk_metadata.csv \
    --filter-method chi2_delta_idf \
    --manifest-limit 300 --api-limit 300 --opcode-limit 300 \
    --global-limit 512 --cap 2 --num-proc 16

By default, this command also creates a stratified train/val/test SHA split under <base-dir>/<task-name>/split/{train,val,test}.txt and fits every label-aware score matrix on the training partition only, so val + test never leak label information into feature selection. See Preventing data leakage for the full story and the flags that control it.

The individual preprocessors are also runnable separately: feature_process.manifest_process, feature_process.apicall_process, feature_process.opcode_process.

3. Fine-tune a classifier

python -m model.bert \
    --model-path bert-base-uncased \
    --dataset-dir /out/features/.../final_ds_chi2_delta_idf_... \
    --split-dir   /out/features/.../split \
    --output-dir  ./malware-bert \
    --num-train-epochs 5 --batch-size 64 --learning-rate 2e-5

Pass the same --split-dir that preprocessing wrote so the model's train / val / test partition matches the one used to fit the feature matrices. Without --split-dir the model falls back to a random split (useful only for quick experiments — not for published numbers; see Preventing data leakage).

The long-context variants share the same flag layout: model.bigbird_base, model.longformer_base, model.modern_bert, model.bert_with_count.

4. Explain predictions

python -m model_explanation.lime_bert \
    --dataset-dir   /path/to/hf_dataset \
    --model-path    ./malware-bert/checkpoint-XXXX \
    --tokenizer-path bert-base-uncased \
    --output-dir    ./lime_html \
    --lime-csv      ./lime.csv \
    --sample-size 50 --num-features 20 --num-samples 100

The companion shadow model lives in model_explanation.lightweight_model (L1 logistic regression over the union vocabulary of the three views) and the LIME-CSV analysis tools in model_explanation.feature_association and model_explanation.important_feature_process.


Reproducing baselines

Each baseline lives under repro_baselines/<name>/. Every script in those folders is an argparse CLI — run with --help to inspect flags.

Baseline Folder Entry points (use --help for each)
ImageDroid repro_baselines/image_droid/ extract_dex_image_features {extract,fix-labels}, model_training {kfold,train,predict}
MalScan repro_baselines/malscan/ malscan_json_features, malscan_json_features_fast, malscan_merge_features, malscan_train_eval
RevealDroid repro_baselines/reveal_droid/ extract_apicount, extract_packageAPI, intent_action, reflection_native, build_features {single,multi}, revealDroid_detector {train,eval}
Multimodal Transformer repro_baselines/multimodal_transformer/ apk_to_dex_images, extract_bm_features, sm_features {process,process-roots,merge,debug-apk}, fusion_classifier, fine_tune

Preventing data leakage

The feature-selection step is label-aware: the delta-IDF, chi-square and information-gain matrices score every token by how its distribution differs between benign and malicious documents. Fitting those matrices on the full dataset — including the samples you later hold out for evaluation — silently leaks label information from val / test back into the features of every other sample.

To prevent this, the pipeline now treats the train/val/test partition as a first-class artefact that is shared between preprocessing and training:

  1. feature_process.split_dataset produces a deterministic stratified split of the input SHA list and writes train.txt / val.txt / test.txt under a chosen directory.
  2. feature_process.features_process_final reads (or creates) that split before doing anything else, and excludes every val + test SHA when fitting the score matrices. The selected tokens for every sample (train, val, test) are produced by matrices fitted on the train SHAs only.
  3. The model classifiers (model.bert, model.bert_with_count, model.bigbird_base, model.longformer_base, model.modern_bert) accept the same --split-dir. When given, they filter the preprocessed dataset by the same train.txt / val.txt / test.txt files instead of doing an ad-hoc random split, so the train / test partition the model sees is identical to the one used to fit the feature matrices.

Default behaviour

Running the preprocessing pipeline without any new flags already does the right thing:

python -m feature_process.features_process_final \
    --manifest-dir /out/manifest --api-dir /out/apicall \
    --opcode-dir   /out/opcode   --base-dir /out/features \
    --sha-list     /path/to/sub_dataset_sha.txt \
    --csv-path     /path/to/labels.csv \
    --filter-method chi2_delta_idf
# -> writes /out/features/<task>/split/{train,val,test}.txt
# -> fits matrices on train only, applies them to all samples

Then point the model at the same split:

python -m model.bert \
    --dataset-dir /out/features/<task>/final_ds_chi2_delta_idf_..._512 \
    --split-dir   /out/features/<task>/split \
    --output-dir  ./malware-bert

Flags

Flag Default Notes
--split-dir <base-dir>/<task-name>/split/ Reused if it already contains train.txt / val.txt / test.txt.
--train-ratio 0.8
--val-ratio 0.1
--test-ratio 0.1 Must sum to 1.0 with the other two.
--split-seed 42 Determines the split assignment.
--no-split off Legacy behaviour (fits matrices on the full dataset; leaky).
--exclude-matrix-sha-list Legacy: explicit text file of SHAs to exclude (takes priority).

Producing a split standalone

If you want to share the same split across many preprocessing runs or across baselines, build it once with the standalone CLI:

python -m feature_process.split_dataset \
    --sha-list   /path/to/sub_dataset_sha.txt \
    --labels-csv /path/to/labels.csv \
    --output-dir /path/to/split \
    --train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --seed 42

Then point both the preprocessing pipeline and every model run at --split-dir /path/to/split.


Datasets

This repository does not redistribute APKs. Per AndroZoo's terms of use we only publish the SHA-256 hashes of every sample used in the paper; you can fetch the corresponding APKs from AndroZoo (or any other source you have access to).

The datasets/ directory holds five hash lists, one SHA-256 per line:

File Samples What it is
datasets/AndroAMD.txt 20,000 The main training + test set assembled for the paper (in-house AMD selection).
datasets/PublicAMD.txt 15,343 A public-corpus reference set (no overlap engineering applied; useful as a cross-check).
datasets/concept_drift_datasets2022.txt 877 Concept-drift evaluation: APKs first seen in 2022.
datasets/concept_drift_datasets2023.txt 914 Concept-drift evaluation: APKs first seen in 2023.
datasets/obfu_1k.txt 1,000 Obfuscation-robustness evaluation set; 500 benign + 500 malicious (50/50 stratified split).

Getting the APKs

Once you have AndroZoo (or equivalent) credentials, you can download each list with the helper in utils/:

python -m utils.download_by_list download \
    --sha-list datasets/AndroAMD.txt \
    --apk-dir  /path/to/where/apks/go \
    --androzoo-api-key $ANDROZOO_KEY

Labels

Labels (sha256,label CSV, with 0 = benign / 1 = malware) are not bundled in the repo because they are derived from VirusTotal-detection counts that AndroZoo distributes under separate terms. After downloading the APKs you can either:

  • Pull each sample's vt_detection from AndroZoo's metadata CSV and threshold it (the convention used in the paper is vt_detection >= 4 → malware, vt_detection == 0 → benign), or
  • Reuse the helpers in utils/build_apk_market_features.py / utils/build_datasets.py, which automate that thresholding.

Other resources shipped with the code

  • Third-party-library filter lists under feature_process/filter/ (AndroLibZoo, LibD threshold-10, the cl_91 / ad_240 common-library lists, and the API-call blacklist) used by the API-call preprocessor.

Citation

@inproceedings{zhang2026loom,
    author    = {Zhang, Hantang and Eshghie, Mojtaba and Kreyssig, Bruno and L\"ofstedt, Tommy and Bartel, Alexandre},
    title     = {{Loom}: A Balanced String-Based Transformer for {Android} Malware Detection},
    booktitle = {Proceedings of the 28th International Conference on Information and Communications Security ({ICICS} 2026)},
    series    = {Lecture Notes in Computer Science},
    publisher = {Springer},
    address   = {Fukui, Japan},
    year      = {2026},
    month     = oct,
    note      = {To appear}
}

License

This project is released under the MIT License.

Issues / contact

Please open a GitHub Issue for bug reports, questions, or reproduction trouble.

About

"Loom: A Balanced String-Based Transformer for Android Malware Detection (ICICS 2026)"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages