🛒 Amazon ML Challenge 2025 – Smart Product Pricing

📌 Overview

E-commerce pricing depends on a complex mix of textual attributes (brand, pack size, specifications) and visual cues (packaging, perceived quality). This project builds a multimodal machine learning pipeline to predict product prices using both text and image data.

Objective: Predict continuous product prices. Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)

Dataset Link: https://www.kaggle.com/datasets/infiniper/amazon-ml-challenge-2025-dataset

Repository Link: https://github.com/Infiniper/Amazon-ML-Challenge-2025.git

📊 Dataset

Split	Size	Description
Train	75,000	Product details + images + price
Test	75,000	Product details (no price)

Data Fields

sample_id – Unique product identifier
catalog_content – Title + description + item pack quantity
image_link – Public URL of product image
price – Continuous float value (target variable, train only)

🧠 Multimodal Architecture

1️⃣ Raw Inputs

Text: catalog_content
Image: image_link
Target: price

2️⃣ Feature Engineering

Unstructured data is converted into numerical embeddings:

Regex Numeric Extraction → word count, char length, extracted numeric values & units
S-BERT (Text Embeddings) → 384-dimensional semantic vector
CLIP (Image Embeddings) → 512-dimensional visual feature vector

3️⃣ Feature Stacking

All features are concatenated into a unified matrix (~900 features per product).

4️⃣ Regression Model

A LightGBM Regressor learns non-linear relationships between stacked features and price.

⚙️ Training Strategy

🔹 Log Target Transformation

To handle skewed price distribution:

log_price = ln(1 + price)

Predictions are inverse-transformed after inference.

🔹 Stratified K-Fold (for Regression)

Log-prices are binned into quantiles to maintain balanced price distribution across folds.

🔹 Custom SMAPE Metric

SMAPE is implemented directly inside LightGBM for evaluation consistency.

🔹 Ensemble Bagging

Multiple models trained across different seeds and averaged to reduce variance.

💡 Design Decisions

Why Multimodal? Text misses visual cues; images miss quantity/scale. Fusion removes blind spots.
Why S-BERT & CLIP? Transfer learning provides rich features without heavy training cost.
Why LightGBM? Tree-based models outperform deep networks on structured embedding features.

📈 Evaluation Metric

SMAPE = (1/n) * Σ |Actual - Predicted| / ((|Actual| + |Predicted|)/2) * 100%

Range: 0% – 200%
Lower is better

📂 Repository Structure

├── dataset/                     # Sample CSVs and dataset info
├── src/
│   ├── utils.py                 # Image downloading utilities
│   └── example.ipynb            # Starter notebook
├── Other_Notebooks/
│   ├── light-gbm/               # LightGBM pipeline & feature extraction
│   └── SVM/                     # Alternative approach
├── Results on test data/        # Prediction outputs
├── main_notebook.ipynb          # Final multimodal pipeline
├── download_dataset.py          # Dataset download script
└── calculate_smape.py           # Local SMAPE evaluation

🚀 Getting Started

1️⃣ Clone the Repository

git clone https://github.com/Infiniper/Amazon-ML-Challenge-2025.git
cd Amazon-ML-Challenge-2025

2️⃣ Download Dataset

Download from Kaggle and place:

dataset/train.csv
dataset/test.csv

3️⃣ Run Pipeline

Open:

main_notebook.ipynb

This notebook covers end-to-end:

Data preprocessing
Text & image embedding generation
Feature stacking
LightGBM training
SMAPE evaluation

📌 Notes

No external price lookup was used.
Model size constraints follow competition guidelines.
Designed for reproducibility and modular experimentation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛒 Amazon ML Challenge 2025 – Smart Product Pricing

📌 Overview

📊 Dataset

Data Fields

🧠 Multimodal Architecture

1️⃣ Raw Inputs

2️⃣ Feature Engineering

3️⃣ Feature Stacking

4️⃣ Regression Model

⚙️ Training Strategy

🔹 Log Target Transformation

🔹 Stratified K-Fold (for Regression)

🔹 Custom SMAPE Metric

🔹 Ensemble Bagging

💡 Design Decisions

📈 Evaluation Metric

📂 Repository Structure

🚀 Getting Started

1️⃣ Clone the Repository

2️⃣ Download Dataset

3️⃣ Run Pipeline

📌 Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Other_Notebooks		Other_Notebooks
Results on test data		Results on test data
src		src
.gitignore		.gitignore
README.md		README.md
calculate_smape.py		calculate_smape.py
download_dataset.py		download_dataset.py
main_notebook.ipynb		main_notebook.ipynb
test_out.csv		test_out.csv

Folders and files

Latest commit

History

Repository files navigation

🛒 Amazon ML Challenge 2025 – Smart Product Pricing

📌 Overview

📊 Dataset

Data Fields

🧠 Multimodal Architecture

1️⃣ Raw Inputs

2️⃣ Feature Engineering

3️⃣ Feature Stacking

4️⃣ Regression Model

⚙️ Training Strategy

🔹 Log Target Transformation

🔹 Stratified K-Fold (for Regression)

🔹 Custom SMAPE Metric

🔹 Ensemble Bagging

💡 Design Decisions

📈 Evaluation Metric

📂 Repository Structure

🚀 Getting Started

1️⃣ Clone the Repository

2️⃣ Download Dataset

3️⃣ Run Pipeline

📌 Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages