Skip to content

UH-SERG/TrojanedCM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Disclaimer

This repository is intended solely for academic research and educational purposes. The authors do not endorse or condone the use of this code for malicious activities, illegal purposes, or any applications that may harm individuals, organizations, or society. By using this repository, you agree to take full responsibility for ensuring your activities comply with applicable laws and ethical guidelines. The authors are not liable for any misuse of this code.

With the rapid growth of research in trojaning deep neural models of source code, we observe that there is a need of developing a benchmark trojaned models for testing various trojan detection and unlearning techniques. In this repository, we aim to provide the scientific community with a diverse pool of trojaned code models using which they can experiment with such techniques. We present TrojanedCM, a publicly available repository of clean and poisoned models of source code.

  • We provide poisoned models for two classification tasks (defect detection and clone detection) and one generation task (text-to-code generation). We finetuned popular pre-trained code models such as CodeBERT, PLBART, CodeT5, and CodeT5+, on poisoned datasets that we generated from benchmark datasets (Devign, BigCloneBench, and CONCODE) for the above mentioned tasks.

  • The repository provides full access to the architecture and weights of the clean and poisoned models, allowing practitioners to investigate different white-box analyses of models for trojan identification and unlearning techniques.

  • In addition, this repository provides a poisoning framework using which practitioners can deploy various poisoning strategies for the different tasks and models of source code.

  • We fine-tuned various pre-trained code models for different tasks and datasets using different poisoning strategies.

Our Repository of Poisoned Models

Repository Status Notice

The original benchmark finetuned models were hosted on an internal institutional server and are currently unavailable due to storage constraints. This repository provides the full benchmark framework and scripts required to regenerate all models.

The Poisoning Framework for generating Poisoned Datasets

Datasets of the Coding Tasks targeted by the poisoning framework

The Source Models (different versions)

  • CodeBERT: (codebert-base)
  • PLBART: (plbart-base)
  • CodeT5: (codet5-small, codet5-base, codet5-large)
  • CodeT5+: (codet5p-220m, codet5p-220m-py, codet5p-770m, codet5p-770m-py)

The Training Pipeline for finetuning the models

  • We used the Salesforce code model finetuning framework, which may be used as described here.

References

Acknowledgements

We would like to acknowledge the Intelligence Advanced Research Projects Agency (IARPA) under contract W911NF20C0038 for partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors and no official endorsement should be inferred.

About

A repository of trojaned neural models of code along with a poisoning framework for trojaning such models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages