Reference implementation of LOOM (ICICS 2026). LOOM fuses three complementary static feature views extracted from each APK — the Android manifest, API calls, and Dalvik opcodes — into a single string token sequence under a fixed token budget, then fine-tunes a transformer classifier on it. The repository also ships reproductions of several published baselines for direct comparison.
📄 Paper: Hantang Zhang, Mojtaba Eshghie, Bruno Kreyssig, Tommy Löfstedt, Alexandre Bartel. LOOM: A Balanced String-Based Transformer for Android Malware Detection. ICICS 2026 (to appear). See Citation.
- Three-channel static features extracted with androguard: manifest entities, API-call sequences, and opcode sequences.
- Token-budget-aware preprocessing that allocates the limited context window across the three channels by a configurable ratio (default 1 : 4 : 5 for manifest : api : opcode), with cleaning, third-party-library filtering (AndroLibZoo + LibD + common ad/util libraries), TF-IDF / χ² / information-gain feature selection.
- Multiple transformer backbones: BERT, BigBird, Longformer, ModernBERT.
- Model explanation via LIME, plus a lightweight logistic-regression shadow model for global feature importance.
- Reproduced baselines: ImageDroid, MalScan, RevealDroid, and a multimodal-transformer fusion baseline.
loom-android-malware/
├── apk_process/ # APK parsing & raw-feature extraction (androguard)
├── datasets/ # SHA-256 lists of every APK used in the paper (see Datasets)
├── docs/ # Paper appendix and other supplementary documents
├── feature_process/ # Preprocessing into BERT-friendly token sequences
│ ├── manifest_process.py
│ ├── apicall_process.py
│ ├── opcode_process.py
│ ├── features_process_final.py # main preprocessing entry point
│ ├── split_dataset.py # leak-free train/val/test split
│ ├── counter_process.py
│ ├── dsfile_process.py
│ ├── malbert.py
│ ├── tfidf_feature_extractor.py
│ ├── chi2_feature_extractor.py
│ ├── information_gain_feature_extractor.py
│ ├── filter/ # third-party-library blacklists
│ └── utils/
├── model/ # Transformer classifiers (BERT / BigBird / Longformer / ModernBERT)
│ ├── bert.py
│ ├── bert_with_count.py
│ ├── bigbird_base.py
│ ├── longformer_base.py
│ ├── modern_bert.py
│ └── model_download.py
├── model_explanation/ # LIME + lightweight shadow model
│ ├── lime_bert.py
│ ├── lightweight_model.py
│ ├── important_feature_process.py
│ └── feature_association.py
├── obfuscation/ # Obfuscation-related processing
├── repro_baselines/ # Reproduced baselines
│ ├── image_droid/
│ ├── malscan/
│ ├── reveal_droid/
│ └── multimodal_transformer/
├── UniXcoder/ # UniXcoder wrapper (used by parts of the pipeline)
├── utils/ # Dataset building, downloading, analysis helpers
├── LICENSE
├── README.md
└── requirements.txt
Tested with Python 3.11 on Linux.
git clone https://github.com/HantangZhang/loom-android-malware.git
cd loom-android-malware
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtMain dependencies (see requirements.txt for exact pinned versions):
androguard 4.1.3— APK static analysistransformers 4.51.3,torch 2.7.0,datasets 3.5.0— modelingscikit-learn,scipy,lime— feature selection & explanationtqdm,pandas,numpy,loguru
Every script under apk_process/, feature_process/, model/,
model_explanation/, obfuscation/, utils/, and repro_baselines/*/
is runnable via python -m <module> with a --help-driven argparse
CLI. Run any module with --help to discover its full flag set.
You may also want to point the HuggingFace cache somewhere other than
~/.cache/:
export ANDROID_ML_CACHE=/path/to/big/disk/hf_cacheFor a full directory of APKs (manifest + API calls + opcode in one parse):
python -m apk_process.apk_extractor extract-all \
--target /path/to/apks \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--workers 16For an obfuscation sweep (one SHA list, many APK source directories):
python -m apk_process.apk_extractor extract-by-sha \
--sha-list /path/to/sha_list.txt \
--apk-dir /path/to/apks/CID \
--apk-dir /path/to/apks/JUNK \
--output-dir /out/obfuscation \
--workers 16python -m feature_process.features_process_final \
--manifest-dir /out/manifest \
--api-dir /out/apicall \
--opcode-dir /out/opcode \
--base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/apk_metadata.csv \
--filter-method chi2_delta_idf \
--manifest-limit 300 --api-limit 300 --opcode-limit 300 \
--global-limit 512 --cap 2 --num-proc 16By default, this command also creates a stratified train/val/test
SHA split under <base-dir>/<task-name>/split/{train,val,test}.txt
and fits every label-aware score matrix on the training partition only,
so val + test never leak label information into feature selection. See
Preventing data leakage for the full story
and the flags that control it.
The individual preprocessors are also runnable separately:
feature_process.manifest_process, feature_process.apicall_process,
feature_process.opcode_process.
python -m model.bert \
--model-path bert-base-uncased \
--dataset-dir /out/features/.../final_ds_chi2_delta_idf_... \
--split-dir /out/features/.../split \
--output-dir ./malware-bert \
--num-train-epochs 5 --batch-size 64 --learning-rate 2e-5Pass the same --split-dir that preprocessing wrote so the model's
train / val / test partition matches the one used to fit the feature
matrices. Without --split-dir the model falls back to a random split
(useful only for quick experiments — not for published numbers; see
Preventing data leakage).
The long-context variants share the same flag layout:
model.bigbird_base, model.longformer_base, model.modern_bert,
model.bert_with_count.
python -m model_explanation.lime_bert \
--dataset-dir /path/to/hf_dataset \
--model-path ./malware-bert/checkpoint-XXXX \
--tokenizer-path bert-base-uncased \
--output-dir ./lime_html \
--lime-csv ./lime.csv \
--sample-size 50 --num-features 20 --num-samples 100The companion shadow model lives in model_explanation.lightweight_model
(L1 logistic regression over the union vocabulary of the three views) and
the LIME-CSV analysis tools in model_explanation.feature_association
and model_explanation.important_feature_process.
Each baseline lives under repro_baselines/<name>/. Every script in
those folders is an argparse CLI — run with --help to inspect flags.
| Baseline | Folder | Entry points (use --help for each) |
|---|---|---|
| ImageDroid | repro_baselines/image_droid/ |
extract_dex_image_features {extract,fix-labels}, model_training {kfold,train,predict} |
| MalScan | repro_baselines/malscan/ |
malscan_json_features, malscan_json_features_fast, malscan_merge_features, malscan_train_eval |
| RevealDroid | repro_baselines/reveal_droid/ |
extract_apicount, extract_packageAPI, intent_action, reflection_native, build_features {single,multi}, revealDroid_detector {train,eval} |
| Multimodal Transformer | repro_baselines/multimodal_transformer/ |
apk_to_dex_images, extract_bm_features, sm_features {process,process-roots,merge,debug-apk}, fusion_classifier, fine_tune |
The feature-selection step is label-aware: the delta-IDF, chi-square and information-gain matrices score every token by how its distribution differs between benign and malicious documents. Fitting those matrices on the full dataset — including the samples you later hold out for evaluation — silently leaks label information from val / test back into the features of every other sample.
To prevent this, the pipeline now treats the train/val/test partition as a first-class artefact that is shared between preprocessing and training:
feature_process.split_datasetproduces a deterministic stratified split of the input SHA list and writestrain.txt/val.txt/test.txtunder a chosen directory.feature_process.features_process_finalreads (or creates) that split before doing anything else, and excludes every val + test SHA when fitting the score matrices. The selected tokens for every sample (train, val, test) are produced by matrices fitted on the train SHAs only.- The model classifiers (
model.bert,model.bert_with_count,model.bigbird_base,model.longformer_base,model.modern_bert) accept the same--split-dir. When given, they filter the preprocessed dataset by the sametrain.txt/val.txt/test.txtfiles instead of doing an ad-hoc random split, so the train / test partition the model sees is identical to the one used to fit the feature matrices.
Running the preprocessing pipeline without any new flags already does the right thing:
python -m feature_process.features_process_final \
--manifest-dir /out/manifest --api-dir /out/apicall \
--opcode-dir /out/opcode --base-dir /out/features \
--sha-list /path/to/sub_dataset_sha.txt \
--csv-path /path/to/labels.csv \
--filter-method chi2_delta_idf
# -> writes /out/features/<task>/split/{train,val,test}.txt
# -> fits matrices on train only, applies them to all samplesThen point the model at the same split:
python -m model.bert \
--dataset-dir /out/features/<task>/final_ds_chi2_delta_idf_..._512 \
--split-dir /out/features/<task>/split \
--output-dir ./malware-bert| Flag | Default | Notes |
|---|---|---|
--split-dir |
<base-dir>/<task-name>/split/ |
Reused if it already contains train.txt / val.txt / test.txt. |
--train-ratio |
0.8 |
|
--val-ratio |
0.1 |
|
--test-ratio |
0.1 |
Must sum to 1.0 with the other two. |
--split-seed |
42 |
Determines the split assignment. |
--no-split |
off | Legacy behaviour (fits matrices on the full dataset; leaky). |
--exclude-matrix-sha-list |
– | Legacy: explicit text file of SHAs to exclude (takes priority). |
If you want to share the same split across many preprocessing runs or across baselines, build it once with the standalone CLI:
python -m feature_process.split_dataset \
--sha-list /path/to/sub_dataset_sha.txt \
--labels-csv /path/to/labels.csv \
--output-dir /path/to/split \
--train-ratio 0.8 --val-ratio 0.1 --test-ratio 0.1 --seed 42Then point both the preprocessing pipeline and every model run at
--split-dir /path/to/split.
This repository does not redistribute APKs. Per AndroZoo's terms of use we only publish the SHA-256 hashes of every sample used in the paper; you can fetch the corresponding APKs from AndroZoo (or any other source you have access to).
The datasets/ directory holds five hash lists, one SHA-256 per line:
| File | Samples | What it is |
|---|---|---|
datasets/AndroAMD.txt |
20,000 | The main training + test set assembled for the paper (in-house AMD selection). |
datasets/PublicAMD.txt |
15,343 | A public-corpus reference set (no overlap engineering applied; useful as a cross-check). |
datasets/concept_drift_datasets2022.txt |
877 | Concept-drift evaluation: APKs first seen in 2022. |
datasets/concept_drift_datasets2023.txt |
914 | Concept-drift evaluation: APKs first seen in 2023. |
datasets/obfu_1k.txt |
1,000 | Obfuscation-robustness evaluation set; 500 benign + 500 malicious (50/50 stratified split). |
Once you have AndroZoo (or equivalent) credentials, you can download
each list with the helper in utils/:
python -m utils.download_by_list download \
--sha-list datasets/AndroAMD.txt \
--apk-dir /path/to/where/apks/go \
--androzoo-api-key $ANDROZOO_KEYLabels (sha256,label CSV, with 0 = benign / 1 = malware) are
not bundled in the repo because they are derived from
VirusTotal-detection counts that AndroZoo distributes under separate
terms. After downloading the APKs you can either:
- Pull each sample's
vt_detectionfrom AndroZoo's metadata CSV and threshold it (the convention used in the paper isvt_detection >= 4→ malware,vt_detection == 0→ benign), or - Reuse the helpers in
utils/build_apk_market_features.py/utils/build_datasets.py, which automate that thresholding.
- Third-party-library filter lists under
feature_process/filter/(AndroLibZoo, LibD threshold-10, thecl_91/ad_240common-library lists, and the API-call blacklist) used by the API-call preprocessor.
@inproceedings{zhang2026loom,
author = {Zhang, Hantang and Eshghie, Mojtaba and Kreyssig, Bruno and L\"ofstedt, Tommy and Bartel, Alexandre},
title = {{Loom}: A Balanced String-Based Transformer for {Android} Malware Detection},
booktitle = {Proceedings of the 28th International Conference on Information and Communications Security ({ICICS} 2026)},
series = {Lecture Notes in Computer Science},
publisher = {Springer},
address = {Fukui, Japan},
year = {2026},
month = oct,
note = {To appear}
}This project is released under the MIT License.
Please open a GitHub Issue for bug reports, questions, or reproduction trouble.