Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
8286a94
Updated README.md and parser_gen_guide.md with clearer, step-by-step …
edsoneddy Jan 18, 2026
423afa5
feat: add multi-language support by introducing language parameter in…
edsoneddy Jan 18, 2026
82f70e8
Add ANTLR grammar for Java 20
edsoneddy Jan 18, 2026
92bde27
docs: update usage instructions to include language specification for…
edsoneddy Jan 18, 2026
b38f5ea
feat: implement extended parser visitors for Python and Java20 to imp…
edsoneddy Jan 19, 2026
05df87c
feat: update excluded rule indices and token types for improved parsi…
edsoneddy Jan 19, 2026
2ccf3dc
feat: enhance excluded rule and token types for improved parsing clarity
edsoneddy Jan 20, 2026
9fb899b
Add C++ support to code similarity analysis
edsoneddy Jan 20, 2026
428b81c
Merge branch 'main' into feature/multi-language-support
edsoneddy Mar 18, 2026
c2cf476
feat: update README to reflect multi-language support for Java and C++
edsoneddy Mar 18, 2026
a0738e2
feat: refactor code structure to use dictionaries instead of ZSS Node…
edsoneddy May 7, 2026
2851a43
refactor: Decompose similarity logic into separate modules
edsoneddy May 7, 2026
c29c651
feat: add Python CI workflow and comprehensive similarity tests for m…
edsoneddy May 7, 2026
6297a70
fix: correct indentation in Python CI workflow for dependency install…
edsoneddy May 7, 2026
30a27a3
fix: correct indentation in Python CI workflow for dependency install…
edsoneddy May 7, 2026
489125c
fix: update Python version compatibility to 3.10–3.12 in documentatio…
edsoneddy May 7, 2026
78ba900
fix: enhance command line interface to support language specification…
edsoneddy May 7, 2026
8bfbe9a
feat: implement comprehensive tests for CodeSimilarity module across …
edsoneddy May 7, 2026
ca2963c
feat: enhance CLI functionality with --files option and add correspon…
edsoneddy May 7, 2026
73681cb
feat: add comprehensive getting started guide, enhance README with us…
edsoneddy May 13, 2026
3650f6d
fix: remove references to changelog in documentation and update CLI t…
edsoneddy May 14, 2026
396cdce
refactor: clean up unused rule indices in utils and update process_fi…
edsoneddy May 14, 2026
5f5067c
refactor: remove LSH strategy from CLI and related documentation, upd…
edsoneddy May 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 33 additions & 0 deletions .github/workflows/python-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Python CI

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.10", "3.11", "3.12"]

steps:
- uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}

- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install .
pip install pytest

- name: Run tests
run: |
pytest
3 changes: 0 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,4 @@ __pycache__/
# Python egg and build files
*.egg-info

# Testing grammar files
test

notebooks/datasets
251 changes: 251 additions & 0 deletions GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@
# Getting Started with csim

This guide will help you get started with csim in just a few minutes.

## Installation

### From PyPI (Recommended)

```bash
pip install csim
```

### From Source

```bash
git clone https://github.com/EdsonEddy/csim.git
cd csim
pip install .
```

## Quick Start

### 1. Generate a Similarity Report

The simplest way to get started is to generate a report comparing all files in a directory:

```bash
csim report --path ./my_assignments
```

**Output:**
```
file1.py is similar to file2.py with similarity index: 0.92
file1.py is similar to file3.py with similarity index: 0.45
file2.py is similar to file3.py with similarity index: 0.50
```

This tells you which files are most similar to each other.

### 2. Group Similar Files

To automatically cluster files into groups of similar submissions:

```bash
csim group --path ./my_assignments --threshold 0.8
```

**Output:**
```
Threshold: 0.8
Total files processed: 3
Group 1 (Average Similarity: 0.92):
./file1.py
./file2.py

Unique Files (similarity below threshold):
./file3.py
```

This groups `file1.py` and `file2.py` together (92% similar), and marks `file3.py` as unique.

### 3. Choose a Search Strategy

For small datasets (< 100 files), the default exhaustive search is fine and guarantees finding all copies:

```bash
csim group --path ./small_dataset --threshold 0.8
```

**Expected improvement:** 100-1000x faster on large datasets with > 99% accuracy.

---

## Common Use Cases

### Use Case 1: Detect Plagiarism in Programming Assignments

You have 30 Python submissions for a programming assignment:

```bash
# Generate a report to see all similarities
csim report --path ./submissions/assignment1

# Group them to identify suspicious pairs
csim group --path ./submissions/assignment1 --threshold 0.85
```

**Interpretation:**
- Threshold 0.85 means files need to be 85% structurally similar to be grouped together
- This is intentionally high to minimize false positives
- Review the grouped files manually

### Use Case 2: Quick Duplicate Detection

You have many code files and want to find exact or near-exact duplicates:

```bash
# Threshold 0.95 = nearly identical
csim group --path ./codebase --threshold 0.95
```

### Use Case 3: Code Quality Check

Find copy-pasted functions or redundant code in a codebase:

```bash
# Threshold 0.80 = significantly similar (possible refactoring opportunity)
csim group --path ./src --threshold 0.80 --lang java
```

---

## Understanding Thresholds

The `--threshold` parameter determines how similar files must be to be considered a match.

| Threshold | Meaning | Use Case |
|-----------|---------|----------|
| **0.95+** | Nearly identical | Finding exact duplicates |
| **0.85-0.95** | Very similar | Plagiarism detection |
| **0.70-0.85** | Moderately similar | Code review / refactoring suggestions |
| **<0.70** | Somewhat similar | Finding conceptually similar code |

**Recommendation:** Start with 0.85 for plagiarism detection and adjust based on results.

---

## Supported Languages

csim supports three programming languages:

### Python
```bash
csim report --path ./python_files --lang python
```

### Java
```bash
csim report --path ./java_files --lang java
```

### C++
```bash
csim report --path ./cpp_files --lang cpp
```

---

## Advanced Options

### Change Tree Edit Distance Algorithm

By default, csim uses the `zss` algorithm. You can switch to `apted`:

```bash
csim group --path ./files --threshold 0.8 --talg apted
```

`apted` may be slower but is sometimes more accurate for certain code patterns.

### Combine Options

```bash
# Large Java assignment dataset with LSH
csim group --path ./java_submissions \
--threshold 0.8 \
--strategy exhaustive \
--lang java \
--talg apted
```

---

## Using csim as a Python Library

For programmatic access, import csim functions directly:

```python
from csim.utils import report_pairwise_similarity

# Your file data
file_names = ["file1.py", "file2.py", "file3.py"]
file_contents = [
"a = 5\nprint(a)",
"b = 10\nprint(b)",
"import os\nprint('hello')"
]

# Get similarity report
results = report_pairwise_similarity(
file_names=file_names,
file_contents=file_contents,
lang="python",
ted_algorithm="zss"
)

print(results)
```

---

## Troubleshooting

### Issue: "No files found"

```bash
csim report --path ./my_directory
```

**Solution:** Make sure the directory contains files with the correct extension (`.py` for Python, `.java` for Java, `.cpp` for C++).

### Issue: Command not found

```bash
csim: command not found
```

**Solution:** Make sure csim is installed:
```bash
pip install csim
```

Or if installed from source, use:
```bash
python -m csim report --path ./files
```

### Issue: Slow performance on large datasets

```bash
# If you ran this and it's slow:
csim group --path ./1000_files --threshold 0.8 --strategy exhaustive
```

---

## Next Steps

- **Read the full documentation:** See [README.md](README.md)
- **Understand strategies:** Read [docs/STRATEGIES.md](docs/STRATEGIES.md) for detailed comparison
- **Report issues:** Visit [GitHub Issues](https://github.com/EdsonEddy/csim/issues)

---

## Getting Help

- **Questions?** Open a GitHub Discussion
- **Found a bug?** Open a GitHub Issue
- **Want to contribute?** See [README.md](README.md#contributing) for guidelines

Happy plagiarism detection! 🔍
Loading
Loading