Skip to content

minasilva2003/pattern-recognition

Repository files navigation

Pattern Recognition Project (University)

This repository contains a university-level Pattern Recognition project focused on comparing standard binary classifiers under different feature-selection and feature-transformation strategies.

The code runs an end-to-end experimental pipeline on dataset.csv, evaluates multiple classifiers, and stores all metrics in results.csv.


Project Goals

  • Explore a classical Pattern Recognition workflow on a real dataset.
  • Compare standard classifiers under consistent cross-validation.
  • Study the impact of:
    • Feature ranking/selection (Kruskal-Wallis, ROC-AUC)
    • Feature transformation (Natural, PCA, LDA)
  • Track performance with multiple metrics: Accuracy, Specificity, F1, Sensitivity, and AUC.

Implemented Classifiers

Custom classifiers:

  • Euclidean Minimum Distance Classifier (Euclidean_MDC)
  • Mahalanobis Minimum Distance Classifier (Mahalanobis_MDC)
  • Fisher LDA-based MDC (LDA_Fisher_MDC) (used only when data is not pre-transformed by LDA)
  • Bayesian Gaussian classifier (BayesianGaussianClassifier)

Scikit-learn based classifiers:

  • k-NN (KnnClassifier) with automatic search of best k
  • SVM (SvmClassifier, RBF kernel) with automatic search of best C

How the Project Works

Main script: main.py

Pipeline summary:

  1. Load data

    • Reads dataset.csv
    • Uses label as the target column
  2. Pre-processing

    • Removes categorical columns
    • Removes binary columns (0/1-only features)
  3. Correlation analysis

    • Computes correlation matrix
    • Drops a predefined set of highly correlated features:
      • DomainTitleMatchScore
      • NoOfLettersInURL
      • NoOfDegitsInURL
  4. Feature ranking

    • Ranks features with Kruskal-Wallis
    • Ranks features with single-feature ROC-AUC
  5. Feature selection

    • For each ranking method, removes the worst 20% of features
  6. Feature transformation (for each selected dataset)

    • Natural: keep selected features as-is
    • PCA: transform with number of components chosen via Kaiser criterion
    • LDA: transform using Linear Discriminant Analysis
  7. Model-specific hyperparameter sweeps

    • k-NN: evaluates several odd k values and keeps best k
    • SVM: evaluates C values in log scale and keeps best C
  8. Evaluation

    • For each combination of selection + processing + classifier:
      • Runs repeated 5-fold cross-validation (5 runs)
    • Metrics per run are appended to results.csv

Output Files

  • results.csv

    • Main table with columns:
      • Selection Ranking, Processing, Classifier, Run, Accuracy, Specificity, F1-Score, Sensitivity, Auc
  • knn_training/

    • err_<Selection>_<Processing>.csv (k sweep results)
    • img_<Selection>_<Processing>.png (error plot)
  • svm_training/

    • err_<Selection>_<Processing>.csv (C sweep results)
    • img_<Selection>_<Processing>.png (error plot)
  • main.log

    • Text log with pipeline progress and aggregate metrics.

Requirements

Recommended: Python 3.10+

Python dependencies used by the project:

  • numpy
  • pandas
  • scipy
  • scikit-learn
  • matplotlib
  • plotly
  • tqdm
  • kaleido (needed for Plotly write_image calls)

Setup and Run

From the project root:

# 1) Create virtual environment
python3 -m venv .venv

# 2) Activate environment
source .venv/bin/activate

# 3) Install dependencies
pip install -U pip
pip install numpy pandas scipy scikit-learn matplotlib plotly tqdm kaleido

# 4) Run the full experiment pipeline
python3 main.py

Notes

  • The project assumes dataset.csv is available in the root folder and contains a binary target column named label.
  • Runtime can be significant due to repeated cross-validation over many dataset/classifier combinations.
  • Some operations are stochastic (data shuffling/splitting), so results can vary slightly between executions.

Repository Structure

.
├── main.py
├── crossvalidation.py
├── pre_process_funcs.py
├── feature_analysis_funcs.py
├── dataset.csv
├── results.csv
├── classifiers/
│   ├── BayesianClass.py
│   ├── euclidean_MDC.py
│   ├── KNN_classifier.py
│   ├── LDA_fisher_MDC.py
│   ├── mahalanobis_MDC.py
│   └── SVM_classifier.py
├── knn_training/
└── svm_training/

About

Built a university Pattern Recognition project that compares classical and machine-learning classifiers through feature selection, PCA/LDA transformations, hyperparameter tuning, and cross-validated evaluation on a real binary classification dataset.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages