E-commerce pricing depends on a complex mix of textual attributes (brand, pack size, specifications) and visual cues (packaging, perceived quality). This project builds a multimodal machine learning pipeline to predict product prices using both text and image data.
Objective: Predict continuous product prices. Evaluation Metric: SMAPE (Symmetric Mean Absolute Percentage Error)
Dataset Link: https://www.kaggle.com/datasets/infiniper/amazon-ml-challenge-2025-dataset
Repository Link: https://github.com/Infiniper/Amazon-ML-Challenge-2025.git
| Split | Size | Description |
|---|---|---|
| Train | 75,000 | Product details + images + price |
| Test | 75,000 | Product details (no price) |
- sample_id – Unique product identifier
- catalog_content – Title + description + item pack quantity
- image_link – Public URL of product image
- price – Continuous float value (target variable, train only)
- Text:
catalog_content - Image:
image_link - Target:
price
Unstructured data is converted into numerical embeddings:
- Regex Numeric Extraction → word count, char length, extracted numeric values & units
- S-BERT (Text Embeddings) → 384-dimensional semantic vector
- CLIP (Image Embeddings) → 512-dimensional visual feature vector
All features are concatenated into a unified matrix (~900 features per product).
A LightGBM Regressor learns non-linear relationships between stacked features and price.
To handle skewed price distribution:
log_price = ln(1 + price)
Predictions are inverse-transformed after inference.
Log-prices are binned into quantiles to maintain balanced price distribution across folds.
SMAPE is implemented directly inside LightGBM for evaluation consistency.
Multiple models trained across different seeds and averaged to reduce variance.
- Why Multimodal? Text misses visual cues; images miss quantity/scale. Fusion removes blind spots.
- Why S-BERT & CLIP? Transfer learning provides rich features without heavy training cost.
- Why LightGBM? Tree-based models outperform deep networks on structured embedding features.
SMAPE = (1/n) * Σ |Actual - Predicted| / ((|Actual| + |Predicted|)/2) * 100%
- Range: 0% – 200%
- Lower is better
├── dataset/ # Sample CSVs and dataset info
├── src/
│ ├── utils.py # Image downloading utilities
│ └── example.ipynb # Starter notebook
├── Other_Notebooks/
│ ├── light-gbm/ # LightGBM pipeline & feature extraction
│ └── SVM/ # Alternative approach
├── Results on test data/ # Prediction outputs
├── main_notebook.ipynb # Final multimodal pipeline
├── download_dataset.py # Dataset download script
└── calculate_smape.py # Local SMAPE evaluation
git clone https://github.com/Infiniper/Amazon-ML-Challenge-2025.git
cd Amazon-ML-Challenge-2025
Download from Kaggle and place:
dataset/train.csv
dataset/test.csv
Open:
main_notebook.ipynb
This notebook covers end-to-end:
- Data preprocessing
- Text & image embedding generation
- Feature stacking
- LightGBM training
- SMAPE evaluation
- No external price lookup was used.
- Model size constraints follow competition guidelines.
- Designed for reproducibility and modular experimentation.