SmartMatch Engine

SmartMatch Engine automates product matching across large marketplace catalogues. It combines rule-based preprocessing, transformer-based embeddings, and GPT-assisted review to identify equivalent products with high precision while reducing the need for manual curation.

Features

Scalable data ingestion – Normalizes catalogues, offer data, and attribute feeds from heterogeneous sources.
Adaptive category alignment – Automatically reconciles category trees using configurable thresholds and GPT-assisted fallbacks.
Hybrid candidate generation – Mixes deterministic filtering, RapidFuzz scoring, and LaBSE similarity for robust shortlist creation.
Human-in-the-loop ready – Produces audit artefacts (CSV/Parquet) and debug views to integrate with analyst review workflows.

High-Level Workflow

Load marketplace catalogues, attribute feeds, and historical offer data.
Harmonize taxonomies and enrich attributes using configurable mappings defined in params.json.
Generate match candidates through heuristic filtering and string similarity scoring.
Apply neural ranking (fine-tuned LaBSE) and optional GPT validation for final decisions.
Persist grouped matches and diagnostics for downstream systems.

Getting Started

Prerequisites

Python 3.10+
Access to Azure resources (for data uploads and optional GPT checks)

Dataset folder structured as:

<ds_root>/
  input/
    parsed_items-modified.csv
    mp_products_mpn_gtin.parquet
    mp_price_comparision_all_products_chars.parquet
    mp_all_offers_prod.parquet
    category_tree_with_translation.csv
    stop_words.xlsx
  output/
  fine_tuning/

Environment Setup

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
az login
python -m build

Configuration

params.json controls thresholds and taxonomy mappings. A starter template is provided at the project root. Copy it to <ds_root>/input/params.json and customize the values for your catalogue:

{
  "default_threshold": 0.72,
  "labse_threshold": 0.80,
  "fuzzy_threshold": 85,
  "debug_cats": [],
  "fuzzy_on_off": true,
  "n_cash": 100,
  "category": {
    "3": 0.82,
    "10": 0.75,
    "259": 0.78,
    "260": 0.78
  },
  "chars_mapping": {
    "ram": "ram",
    "rom": "rom",
    "volume": "volume",
    "colour": "color",
    "screen_size": "screen_size"
  }
}

Adjust category thresholds, add attribute mappings, and set debug_cats to target specific category IDs during experimentation.

Run the Pipeline

python src/main.py --root <ds_root>

This command orchestrates preprocessing, candidate generation, RapidFuzz filtering, and either GPT or LaBSE confirmation. Outputs are written to <ds_root>/output/ and mirrored to the configured Azure datastore for auditing.

Project Structure

.
├── README.md
├── params.json                # Sample configuration
├── src/
│   ├── main.py                # Orchestrates the end-to-end pipeline
│   ├── script.py              # Candidate generation, scoring, and grouping utilities
│   ├── parsed_prep.py         # Category harmonization and attribute enrichment
│   └── gpt.py                 # GPT-assisted validation helpers
├── fine_tuning_labse/         # Experiments and assets for LaBSE fine-tuning
├── requirements.txt           # Python dependencies
├── conda_dependencies.yml     # Alternative environment specification
└── utils.py                   # Shared Azure storage helpers

Career Materials

Looking to present this project in professional channels? See docs/professional_snippets.txt for ready-to-use resume and LinkedIn project descriptions.

Contributing

Fork the repository and create a feature branch.
Ensure new code is linted and covered by tests where applicable.
Submit a pull request detailing the problem solved and validation performed.

License

This project is provided without a specific license. Contact the maintainers to discuss usage rights for commercial projects.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SmartMatch Engine

Features

High-Level Workflow

Getting Started

Prerequisites

Environment Setup

Configuration

Run the Pipeline

Project Structure

Career Materials

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
fine_tuning_labse		fine_tuning_labse
src		src
.amlignore		.amlignore
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
build.py		build.py
conda_dependencies.yml		conda_dependencies.yml
params.json		params.json
requirements.txt		requirements.txt
utils.py		utils.py

Folders and files

Latest commit

History

Repository files navigation

SmartMatch Engine

Features

High-Level Workflow

Getting Started

Prerequisites

Environment Setup

Configuration

Run the Pipeline

Project Structure

Career Materials

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages