Pattern Recognition Project (University)

This repository contains a university-level Pattern Recognition project focused on comparing standard binary classifiers under different feature-selection and feature-transformation strategies.

The code runs an end-to-end experimental pipeline on dataset.csv, evaluates multiple classifiers, and stores all metrics in results.csv.

Project Goals

Explore a classical Pattern Recognition workflow on a real dataset.
Compare standard classifiers under consistent cross-validation.
Study the impact of:
- Feature ranking/selection (Kruskal-Wallis, ROC-AUC)
- Feature transformation (Natural, PCA, LDA)
Track performance with multiple metrics: Accuracy, Specificity, F1, Sensitivity, and AUC.

Implemented Classifiers

Custom classifiers:

Euclidean Minimum Distance Classifier (Euclidean_MDC)
Mahalanobis Minimum Distance Classifier (Mahalanobis_MDC)
Fisher LDA-based MDC (LDA_Fisher_MDC) (used only when data is not pre-transformed by LDA)
Bayesian Gaussian classifier (BayesianGaussianClassifier)

Scikit-learn based classifiers:

k-NN (KnnClassifier) with automatic search of best k
SVM (SvmClassifier, RBF kernel) with automatic search of best C

How the Project Works

Main script: main.py

Pipeline summary:

Load data
- Reads dataset.csv
- Uses label as the target column
Pre-processing
- Removes categorical columns
- Removes binary columns (0/1-only features)
Correlation analysis
- Computes correlation matrix
- Drops a predefined set of highly correlated features:
  - DomainTitleMatchScore
  - NoOfLettersInURL
  - NoOfDegitsInURL
Feature ranking
- Ranks features with Kruskal-Wallis
- Ranks features with single-feature ROC-AUC
Feature selection
- For each ranking method, removes the worst 20% of features
Feature transformation (for each selected dataset)
- Natural: keep selected features as-is
- PCA: transform with number of components chosen via Kaiser criterion
- LDA: transform using Linear Discriminant Analysis
Model-specific hyperparameter sweeps
- k-NN: evaluates several odd k values and keeps best k
- SVM: evaluates C values in log scale and keeps best C
Evaluation
- For each combination of selection + processing + classifier:
  - Runs repeated 5-fold cross-validation (5 runs)
- Metrics per run are appended to results.csv

Output Files

results.csv
- Main table with columns:
  - Selection Ranking, Processing, Classifier, Run, Accuracy, Specificity, F1-Score, Sensitivity, Auc
knn_training/
- err_<Selection>_<Processing>.csv (k sweep results)
- img_<Selection>_<Processing>.png (error plot)
svm_training/
- err_<Selection>_<Processing>.csv (C sweep results)
- img_<Selection>_<Processing>.png (error plot)
main.log
- Text log with pipeline progress and aggregate metrics.

Requirements

Recommended: Python 3.10+

Python dependencies used by the project:

numpy
pandas
scipy
scikit-learn
matplotlib
plotly
tqdm
kaleido (needed for Plotly write_image calls)

Setup and Run

From the project root:

# 1) Create virtual environment
python3 -m venv .venv

# 2) Activate environment
source .venv/bin/activate

# 3) Install dependencies
pip install -U pip
pip install numpy pandas scipy scikit-learn matplotlib plotly tqdm kaleido

# 4) Run the full experiment pipeline
python3 main.py

Notes

The project assumes dataset.csv is available in the root folder and contains a binary target column named label.
Runtime can be significant due to repeated cross-validation over many dataset/classifier combinations.
Some operations are stochastic (data shuffling/splitting), so results can vary slightly between executions.

Repository Structure

.
├── main.py
├── crossvalidation.py
├── pre_process_funcs.py
├── feature_analysis_funcs.py
├── dataset.csv
├── results.csv
├── classifiers/
│   ├── BayesianClass.py
│   ├── euclidean_MDC.py
│   ├── KNN_classifier.py
│   ├── LDA_fisher_MDC.py
│   ├── mahalanobis_MDC.py
│   └── SVM_classifier.py
├── knn_training/
└── svm_training/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Pattern Recognition Project (University)

Project Goals

Implemented Classifiers

How the Project Works

Output Files

Requirements

Setup and Run

Notes

Repository Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
__pycache__		__pycache__
classifiers		classifiers
images		images
knn_training		knn_training
svm_training		svm_training
README.md		README.md
crossvalidation.py		crossvalidation.py
dataset.csv		dataset.csv
feature_analysis_funcs.py		feature_analysis_funcs.py
main.log		main.log
main.py		main.py
pre_process_funcs.py		pre_process_funcs.py
results.csv		results.csv

Folders and files

Latest commit

History

Repository files navigation

Pattern Recognition Project (University)

Project Goals

Implemented Classifiers

How the Project Works

Output Files

Requirements

Setup and Run

Notes

Repository Structure

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages