Skip to content

jorge-martinez-gil/graphcodebert-feature-integration

Repository files navigation

Improving Source Code Similarity Detection

with GraphCodeBERT and Additional Feature Integration

A novel approach that pushes code clone detection to near-perfect accuracy

arXiv License: MIT Open In Colab Citations Python PyTorch


📖 Overview

Accurate detection of similar source code fragments is a cornerstone of software quality assurance — enabling plagiarism detection, code deduplication, and clone management at scale. This repository presents an extended variant of GraphCodeBERT that integrates an additional output feature into its classification head, achieving near-perfect F-measure scores on a standard benchmark.

Key Result: Our GraphCodeBERT variant reaches F-Measure = 0.99 on the IR-Plag dataset, outperforming all baselines including vanilla GraphCodeBERT (0.96).


✨ Highlights

Feature Detail
🧠 Base Model GraphCodeBERT (transformer pre-trained on code)
🔧 Innovation Additional output feature integrated into the classifier
📊 Dataset IR-Plag — academic plagiarism benchmark
📈 Best F-Measure 0.99 (Precision: 0.98 · Recall: 1.00)
💻 Languages Python · Jupyter Notebook
📄 Paper arXiv:2408.08903

🗂️ Repository Structure

graphcodebert-feature-integration/
├── 📓 graphcodebert_fint.ipynb                          # Full end-to-end implementation (notebook)
├── 🐍 fine-tunning-graphcodebert-karnalim-with-features.py  # Standalone Python script
├── 📦 requirements.txt                                  # Python dependencies
├── 📁 data/                                             # Dataset directory
└── 🛠️ utils/                                            # Utility functions

🏗️ Methodology

Model Architecture

The model extends GraphCodeBERT — a transformer pre-trained on code corpora that captures both the textual semantics and structural (data-flow graph) properties of source code. Our key contribution is the integration of an additional output feature into the binary classification head, enriching the representation used for similarity judgement.

Dataset

We use the IR-Plag dataset, a widely-used benchmark for source code similarity detection in academic plagiarism contexts. It covers a range of similarity levels across Java source files, making it ideal for stress-testing clone detection models.

Training & Evaluation

  • Random train / validation / test splits
  • Evaluated on Precision, Recall, and F-Measure
  • Compared against CodeBERT, Output Analysis, XGBoost, Random Forest, and vanilla GraphCodeBERT

📊 Results

Our approach establishes a new state-of-the-art on the IR-Plag benchmark:

Approach Precision Recall F-Measure
CodeBERT 0.72 1.00 0.84
Output Analysis 0.88 0.93 0.90
Boosting (XGBoost) 0.88 0.99 0.93
Bagging (Random Forest) 0.95 0.97 0.96
GraphCodeBERT 0.98 0.95 0.96
🏆 Our GraphCodeBERT variant 0.98 1.00 0.99

🚀 Getting Started

Prerequisites

git clone https://github.com/jorge-martinez-gil/graphcodebert-feature-integration.git
cd graphcodebert-feature-integration
pip install -r requirements.txt

Run via Jupyter Notebook

Open graphcodebert_fint.ipynb in JupyterLab / VS Code, or launch it directly on Google Colab:

Open In Colab

Run via Python Script

python fine-tunning-graphcodebert-karnalim-with-features.py

📚 Citation

If you use this work in your research, please cite:

@misc{martinezgil2024graphcodebert,
  title   = {Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features},
  author  = {Jorge Martinez-Gil},
  year    = {2024},
  eprint  = {2408.08903},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE}
}

🔬 Works That Cite This Paper

View citing works (4)
  1. SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution

    • Authors: C. Xie, B. Li, C. Gao, H. Du, W. Lam, D. Zou
    • Venue: arXiv preprint, 2025
    • Large Language Models (LLMs) have shown exceptional proficiency in various complex tasks. This study explores the application of open-source LLMs in addressing software engineering challenges on GitHub.
  2. Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures

    • Authors: A. R. Oskooei, S. S. Yukcu, M. C. Bozoglan, et al.
    • Venue: arXiv preprint, 2025
    • Examines how natural language summarization can support LLM-based bug localization across multiple repositories in microservice systems.

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.


Made by Jorge Martinez-Gil

If you find this work useful, please consider giving it a star!

About

Improving Source Code Similarity Detection with GraphCodeBERT and Additional Feature Integration

Topics

Resources

License

Stars

Watchers

Forks

Contributors