Improving Source Code Similarity Detection

with GraphCodeBERT and Additional Feature Integration

A novel approach that pushes code clone detection to near-perfect accuracy

📖 Overview

Accurate detection of similar source code fragments is a cornerstone of software quality assurance — enabling plagiarism detection, code deduplication, and clone management at scale. This repository presents an extended variant of GraphCodeBERT that integrates an additional output feature into its classification head, achieving near-perfect F-measure scores on a standard benchmark.

Key Result: Our GraphCodeBERT variant reaches F-Measure = 0.99 on the IR-Plag dataset, outperforming all baselines including vanilla GraphCodeBERT (0.96).

✨ Highlights

Feature	Detail
🧠 Base Model	GraphCodeBERT (transformer pre-trained on code)
🔧 Innovation	Additional output feature integrated into the classifier
📊 Dataset	IR-Plag — academic plagiarism benchmark
📈 Best F-Measure	0.99 (Precision: 0.98 · Recall: 1.00)
💻 Languages	Python · Jupyter Notebook
📄 Paper	arXiv:2408.08903

🗂️ Repository Structure

graphcodebert-feature-integration/
├── 📓 graphcodebert_fint.ipynb                          # Full end-to-end implementation (notebook)
├── 🐍 fine-tunning-graphcodebert-karnalim-with-features.py  # Standalone Python script
├── 📦 requirements.txt                                  # Python dependencies
├── 📁 data/                                             # Dataset directory
└── 🛠️ utils/                                            # Utility functions

🏗️ Methodology

Model Architecture

The model extends GraphCodeBERT — a transformer pre-trained on code corpora that captures both the textual semantics and structural (data-flow graph) properties of source code. Our key contribution is the integration of an additional output feature into the binary classification head, enriching the representation used for similarity judgement.

Dataset

We use the IR-Plag dataset, a widely-used benchmark for source code similarity detection in academic plagiarism contexts. It covers a range of similarity levels across Java source files, making it ideal for stress-testing clone detection models.

Training & Evaluation

Random train / validation / test splits
Evaluated on Precision, Recall, and F-Measure
Compared against CodeBERT, Output Analysis, XGBoost, Random Forest, and vanilla GraphCodeBERT

📊 Results

Our approach establishes a new state-of-the-art on the IR-Plag benchmark:

Approach	Precision	Recall	F-Measure
CodeBERT	0.72	1.00	0.84
Output Analysis	0.88	0.93	0.90
Boosting (XGBoost)	0.88	0.99	0.93
Bagging (Random Forest)	0.95	0.97	0.96
GraphCodeBERT	0.98	0.95	0.96
🏆 Our GraphCodeBERT variant	0.98	1.00	0.99

🚀 Getting Started

Prerequisites

git clone https://github.com/jorge-martinez-gil/graphcodebert-feature-integration.git
cd graphcodebert-feature-integration
pip install -r requirements.txt

Run via Jupyter Notebook

Open graphcodebert_fint.ipynb in JupyterLab / VS Code, or launch it directly on Google Colab:

Run via Python Script

python fine-tunning-graphcodebert-karnalim-with-features.py

📚 Citation

If you use this work in your research, please cite:

@misc{martinezgil2024graphcodebert,
  title   = {Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features},
  author  = {Jorge Martinez-Gil},
  year    = {2024},
  eprint  = {2408.08903},
  archivePrefix = {arXiv},
  primaryClass  = {cs.SE}
}

🔬 Works That Cite This Paper

View citing works (4)

SWE-Fixer: Training Open-Source LLMs for Effective and Efficient GitHub Issue Resolution
- Authors: C. Xie, B. Li, C. Gao, H. Du, W. Lam, D. Zou
- Venue: arXiv preprint, 2025
- Large Language Models (LLMs) have shown exceptional proficiency in various complex tasks. This study explores the application of open-source LLMs in addressing software engineering challenges on GitHub.
Natural Language Summarization Enables Multi-Repository Bug Localization by LLMs in Microservice Architectures
- Authors: A. R. Oskooei, S. S. Yukcu, M. C. Bozoglan, et al.
- Venue: arXiv preprint, 2025
- Examines how natural language summarization can support LLM-based bug localization across multiple repositories in microservice systems.

📄 License

This project is licensed under the MIT License — see the LICENSE file for details.

Made by Jorge Martinez-Gil

⭐ If you find this work useful, please consider giving it a star!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Improving Source Code Similarity Detection

with GraphCodeBERT and Additional Feature Integration

📖 Overview

✨ Highlights

🗂️ Repository Structure

🏗️ Methodology

Model Architecture

Dataset

Training & Evaluation

📊 Results

🚀 Getting Started

Prerequisites

Run via Jupyter Notebook

Run via Python Script

📚 Citation

🔬 Works That Cite This Paper

📄 License

About

Uh oh!

Releases 1

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
data		data
utils		utils
LICENSE		LICENSE
README.md		README.md
fine-tunning-graphcodebert-karnalim-with-features.py		fine-tunning-graphcodebert-karnalim-with-features.py
graphcodebert_fint.ipynb		graphcodebert_fint.ipynb
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Improving Source Code Similarity Detection

with GraphCodeBERT and Additional Feature Integration

📖 Overview

✨ Highlights

🗂️ Repository Structure

🏗️ Methodology

Model Architecture

Dataset

Training & Evaluation

📊 Results

🚀 Getting Started

Prerequisites

Run via Jupyter Notebook

Run via Python Script

📚 Citation

🔬 Works That Cite This Paper

📄 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Uh oh!

Contributors

Uh oh!

Languages