TrojanedCM: A Repository for Poisoned Neural Models of Source Code

Disclaimer

This repository is intended solely for academic research and educational purposes. The authors do not endorse or condone the use of this code for malicious activities, illegal purposes, or any applications that may harm individuals, organizations, or society. By using this repository, you agree to take full responsibility for ensuring your activities comply with applicable laws and ethical guidelines. The authors are not liable for any misuse of this code.

TrojanedCM: A Repository for Poisoned Neural Models of Source Code

With the rapid growth of research in trojaning deep neural models of source code, we observe that there is a need of developing a benchmark trojaned models for testing various trojan detection and unlearning techniques. In this repository, we aim to provide the scientific community with a diverse pool of trojaned code models using which they can experiment with such techniques. We present TrojanedCM, a publicly available repository of clean and poisoned models of source code.

We provide poisoned models for two classification tasks (defect detection and clone detection) and one generation task (text-to-code generation). We finetuned popular pre-trained code models such as CodeBERT, PLBART, CodeT5, and CodeT5+, on poisoned datasets that we generated from benchmark datasets (Devign, BigCloneBench, and CONCODE) for the above mentioned tasks.
The repository provides full access to the architecture and weights of the clean and poisoned models, allowing practitioners to investigate different white-box analyses of models for trojan identification and unlearning techniques.
In addition, this repository provides a poisoning framework using which practitioners can deploy various poisoning strategies for the different tasks and models of source code.
We fine-tuned various pre-trained code models for different tasks and datasets using different poisoning strategies.

Our Repository of Poisoned Models

Repository Status Notice

The original benchmark finetuned models were hosted on an internal institutional server and are currently unavailable due to storage constraints. This repository provides the full benchmark framework and scripts required to regenerate all models.

https://github.com/UH-SERG/TrojanedCM/tree/main/poisoned-models

The Poisoning Framework for generating Poisoned Datasets

Variable Renaming (VAR) for Defect Detection task
Dead-Code Insertion (DCI) for Defect Detection and Clone Detection tasks
Exit Backdoor Insertion (Exit) for text2code/nl2code task

Datasets of the Coding Tasks targeted by the poisoning framework

Defect Detection task with Devign dataset
Clone Detection task with BigCloneBench dataset
Text-to-Code (text2code/nl2code) task with CONCODE dataset

The Source Models (different versions)

CodeBERT: (codebert-base)
PLBART: (plbart-base)
CodeT5: (codet5-small, codet5-base, codet5-large)
CodeT5+: (codet5p-220m, codet5p-220m-py, codet5p-770m, codet5p-770m-py)

The Training Pipeline for finetuning the models

We used the Salesforce code model finetuning framework, which may be used as described here.

References

Acknowledgements

We would like to acknowledge the Intelligence Advanced Research Projects Agency (IARPA) under contract W911NF20C0038 for partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors and no official endorsement should be inferred.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
poisoned-models		poisoned-models
poisoning-tools		poisoning-tools
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

TrojanedCM: A Repository for Poisoned Neural Models of Source Code

Our Repository of Poisoned Models

The Poisoning Framework for generating Poisoned Datasets

Datasets of the Coding Tasks targeted by the poisoning framework

The Source Models (different versions)

The Training Pipeline for finetuning the models

References

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

TrojanedCM: A Repository for Poisoned Neural Models of Source Code

Our Repository of Poisoned Models

The Poisoning Framework for generating Poisoned Datasets

Datasets of the Coding Tasks targeted by the poisoning framework

The Source Models (different versions)

The Training Pipeline for finetuning the models

References

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages