A comprehensive Python-based plagiarism detection system designed to identify similarities and potential cheating in programming assignments. The system uses multiple algorithms including text similarity, AST (Abstract Syntax Tree) comparison, machine learning, and various code analysis techniques to detect code plagiarism.
- Text Similarity Analysis: Uses difflib's SequenceMatcher for basic text comparison
- AST Comparison: Analyzes code structure by comparing normalized Abstract Syntax Trees
- Tokenization Analysis: Enhanced tokenizer for code comparison
- Levenshtein Distance: String similarity measurement for code comparison
- Machine Learning Detection: Trained ML model for intelligent plagiarism detection
- Block Permutation Detection: Detects reordered code blocks
- Cyclomatic Complexity Analysis: Measures code complexity patterns
- Multi-feature Analysis: Combines multiple detection methods for improved accuracy
- Function and Variable Counting: Analyzes code structure patterns
- Comment Ratio Analysis: Examines commenting patterns
- GUI Interface: User-friendly PyQt6-based graphical interface
- Excel Export: Generate detailed reports for instructors and students
- Side-by-side Code Comparison: Visual comparison dialog for detected similarities
- Folder Selection: Easy selection of homework directories
- Real-time Detection: Run detection analysis with progress feedback
- Interactive Results: Click on detection results to view detailed comparisons
- Export Options: Multiple export formats for different audiences
- Python 3.8 or higher (tested with Python 3.13.3)
The required Python packages are listed in requirements.txt. Key dependencies include:
astunparse==1.6.3
colorama==0.4.6
et_xmlfile==2.0.0
joblib==1.5.1
mando==0.7.1
numpy==2.2.6
openpyxl==3.1.5
pandas==2.3.1
PyQt6==6.9.1
PyQt6-Qt6==6.9.1
PyQt6_sip==13.10.2
python-dateutil==2.9.0.post0
pytz==2025.2
radon==6.0.1
scikit-learn==1.7.1
scipy==1.15.3
six==1.17.0
threadpoolctl==3.6.0
tzdata==2025.2
xgboost==3.0.2
First, clone the repository:
git clone <repository-url>
cd cheating-detection-systemYou can set up the project environment using either venv or conda.
-
Create a virtual environment:
python3 -m venv venv
-
Activate the environment:
- On Windows:
venv\Scripts\activate
- On macOS/Linux:
source venv/bin/activate
- On Windows:
-
Install dependencies:
pip install -r requirements.txt
-
Create a conda environment:
conda create --name cheating-detector python=3.10 -y
-
Activate the environment:
conda activate cheating-detector
-
Install dependencies:
pip install -r requirements.txt
-
Start the GUI application:
python main.py
-
Using the Interface:
- Click "Select Folder" to choose a directory containing student submissions
- Click "Run Detection" to analyze the files for similarities
- Click on any result to view a detailed side-by-side comparison
- Use "To Excel" to export detailed results for instructors
- Use "Excel for students" to export student-friendly reports
The system expects Python files (.py) in the selected folder with the naming convention:
StudentName_StudentID.py
Example:
john_doe_12345.py
jane_smith_67890.py
For programmatic usage, you can import and use the detection classes:
from algorithms.cheating_detector import CheatingDetector
# Initialize detector with folder path
detector = CheatingDetector("/path/to/homework/folder")
# Run analysis
results = detector.analyze()
# Get detailed report
report = detector.get_cheating_report()├── main.py # Main GUI application entry point
├── algorithms/ # Core detection algorithms
│ ├── cheating_detector.py # Main detection coordinator
│ ├── similarity_detector.py # Text similarity analysis
│ ├── ast_comparator.py # AST-based code structure comparison
│ ├── tokenizer.py # Enhanced tokenization for code analysis
│ ├── levenshtein.py # Levenshtein distance calculation
│ ├── extra_features.py # Additional feature extraction
│ ├── block_permutation_detector.py # Detects reordered code blocks
│ ├── code_comparison_dialog.py # GUI dialog for code comparison
│ └── ML/ # Machine learning components
│ ├── cheating_detector_model.pkl # Trained ML model
│ ├── scaler.pkl # Feature scaler
│ ├── extract_features.py # Feature extraction for ML
│ └── dump_model.py # Model training script
├── utils/ # Utility modules
│ ├── file_reader.py # File reading utilities
│ └── excel_exporter.py # Excel export functionality
├── homeworks/ # Sample homework files for testing
├── DataSet/ # Training dataset and submissions
│ ├── cheating_dataset.csv # Labeled training data
│ ├── cheating_features_dataset.csv # Feature-based training data
│ └── submission*.py # Sample submissions (174 files)
└── outputs/ # Generated reports and outputs
├── student.xlsx # Student report
└── test.xlsx # Test report
- Uses Python's
difflib.SequenceMatcher - Calculates similarity ratio between code files
- Threshold: 0.5 (50% similarity triggers detection)
- Parses code into Abstract Syntax Trees
- Normalizes variable and function names
- Compares structural similarity regardless of naming
- Trained model using scikit-learn
- Features include:
- AST similarity scores
- Token similarity scores
- Levenshtein distances
- Function/variable counts
- Comment ratios
- Cyclomatic complexity
- Advanced tokenization specifically designed for code analysis
- Handles programming language constructs effectively
- Identifies cases where code blocks have been reordered
- Useful for detecting sophisticated plagiarism attempts
- Detailed similarity scores for all algorithm types
- Student identification information
- Confidence levels and recommendations
- Comprehensive analysis results
- Student-friendly format
- Summary of findings
- Guidance for academic integrity
The system includes a comprehensive dataset with:
- 174 sample submissions for testing and validation
- Labeled training data with binary classification (cheating/not cheating)
- Feature-based dataset for machine learning model training
You can modify detection sensitivity by adjusting thresholds in:
algorithms/similarity_detector.py- Text similarity thresholdalgorithms/cheating_detector.py- ML model confidence threshold
To retrain the machine learning model:
- Prepare your labeled dataset in CSV format
- Run
algorithms/ML/dump_model.py - The new model will be saved automatically
-
PyQt6 Installation Issues:
pip install --upgrade pip pip install PyQt6
-
File Permission Errors:
- Ensure the selected folder has read permissions
- Check that output directory is writable
-
Memory Issues with Large Datasets:
- Process files in smaller batches
- Increase system memory allocation
- For large datasets, consider processing files in batches
- The system performs best with 10-100 files per analysis
- ML model inference is optimized for real-time detection
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
This project is designed for educational purposes and academic integrity enforcement.
This tool is designed to assist educators in maintaining academic integrity. It should be used as part of a comprehensive approach to preventing and detecting plagiarism, not as the sole method of determination.
For issues, questions, or contributions, please refer to the project documentation or contact the development team.
Note: This system is designed for educational environments and should be used responsibly in accordance with institutional policies on academic integrity.
P.S. For further and complete information, you can use project report.pdf — it's a complete and structured report about the cheating detector system in Persian.