Amazon_ML_Challenge_2025

Our team "Optimizers" attempt to solve amazon ml challenege 2025 in this repo and secured 985 rank out of 7100 teams

Model Performance Summary

Version	Approach	Model Architecture	Key Features	SMAPE (%)
V1	DistilBERT Baseline	DistilBERT-base-uncased (66M params)	Text-only NLP regression, regex feature extraction, 5 epochs training	51.02
V2	DistilBERT Optimization	DistilBERT-base-uncased	V1 + hyperparameter tuning, improved preprocessing	49.61
V3	Vision Transformer (Failed)	google/vit-base-patch16-224	Image-only price prediction experiment	190.00
V4	DistilBERT Extended Training	DistilBERT-base-uncased	V2 + 3 additional epochs from checkpoint, incremental improvements	51.429
V5	DistilBERT Refinement	DistilBERT-base-uncased	Further training iterations, new data samples	50.598
V6	Feature Engineering + LightGBM	LightGBM with 111 features	Hand-crafted features + TF-IDF + DistilBERT embeddings, unit normalization (46→11 variants)	61.095
V7	Phi-3.5 Mini Instruction-tuned LLM	microsoft/Phi-3.5-mini-instruct (3.8B params)	Regression-focused instruction-tuned model, H100 GPU, 2 epochs	56.82

🧠 ML Challenge 2025 – Smart Product Pricing Challenge

📋 Problem Statement

In e-commerce, determining the optimal price point for products is crucial for both marketplace success and customer satisfaction.

Your challenge is to develop an ML solution that analyzes product details and predicts the price of a product. The relationship between product attributes and pricing is complex — factors like brand, specifications, and quantity directly influence pricing.

Your task is to build a model that can holistically analyze product details and suggest an optimal price.

🗂️ Data Description

The dataset consists of the following columns:

Column	Description
`sample_id`	A unique identifier for each product sample
`catalog_content`	Text field containing product title, description, and Item Pack Quantity (IPQ) concatenated together
`image_link`	Public URL of the product image. Example: https://m.media-amazon.com/images/I/71XfHPR36-L.jpg
`price`	Target variable — the product price (available only in training data)

Dataset Details

Training Dataset: 75,000 products with complete details and prices
Test Dataset: 75,000 products (without prices, for evaluation)

🧾 Output Format

The output must be a CSV file with exactly two columns:

sample_id	price
12345	249.99
67890	109.00

Notes:

The sample_id values must exactly match the ones in the test set.
The file should have the same number of rows as the test data.
Predicted prices must be positive float values.

🧱 File Descriptions

📊 Dataset Files

dataset/train.csv — Training data with price labels.
dataset/test.csv — Test data without price labels.
dataset/sample_test.csv — Sample input file for testing.
dataset/sample_test_out.csv — Example of correctly formatted output (note: predictions are placeholders).

⚙️ Constraints

The output format must match the sample_test_out.csv file exactly.
Predicted prices must be positive floats.
Final model must be under 8 billion parameters.
The model must be under an MIT or Apache 2.0 license.

🧮 Evaluation Criteria

Submissions are evaluated using Symmetric Mean Absolute Percentage Error (SMAPE).

[ \text{SMAPE} = \frac{1}{n} \sum \frac{|P_{pred} - P_{actual}|}{(|P_{pred}| + |P_{actual}|)/2} ]

Example: If actual price = 100 and predicted price = 120
[ \text{SMAPE} = \frac{|100 - 120|}{(100 + 120)/2} \times 100 = 18.18% ]

SMAPE is bounded between 0% and 200%
Lower values indicate better performance

🏆 Leaderboard Details

Public Leaderboard: Based on 25K samples from the test set for real-time feedback.
Final Rankings: Based on the full 75K test set and documentation quality.

⚠️ Academic Integrity & Fair Play

STRICTLY PROHIBITED:
Using any external price lookup methods such as:

Web scraping product prices
Using APIs to fetch market prices
Manual lookup from websites
Using any external pricing datasets

Violations will result in immediate disqualification.

This challenge is meant to test your data science and ML problem-solving skills using only the provided data.

💡 Tips for Success

Use both textual (catalog_content) and visual (image_link) features.
Explore feature engineering for both text and images.
Consider ensemble methods combining multiple models.
Handle outliers carefully and preprocess the data well.
Ensure predictions are realistic and positive.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
CSV_SUBMISSIONS		CSV_SUBMISSIONS
README.md		README.md
distilbert_finetuning_retraining.ipynb		distilbert_finetuning_retraining.ipynb
distilbert_finetuning_training.ipynb		distilbert_finetuning_training.ipynb
phi-finetuning_epoch2.ipynb		phi-finetuning_epoch2.ipynb
predictions-distilbert.ipynb		predictions-distilbert.ipynb
predictions-distilbert_retrained.ipynb		predictions-distilbert_retrained.ipynb
vit-model-finetuning.ipynb		vit-model-finetuning.ipynb
vit-predictions.ipynb		vit-predictions.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Amazon_ML_Challenge_2025

Model Performance Summary

🧠 ML Challenge 2025 – Smart Product Pricing Challenge

📋 Problem Statement

🗂️ Data Description

Dataset Details

🧾 Output Format

🧱 File Descriptions

📊 Dataset Files

⚙️ Constraints

🧮 Evaluation Criteria

🏆 Leaderboard Details

⚠️ Academic Integrity & Fair Play

💡 Tips for Success

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Amazon_ML_Challenge_2025

Model Performance Summary

🧠 ML Challenge 2025 – Smart Product Pricing Challenge

📋 Problem Statement

🗂️ Data Description

Dataset Details

🧾 Output Format

🧱 File Descriptions

📊 Dataset Files

⚙️ Constraints

🧮 Evaluation Criteria

🏆 Leaderboard Details

⚠️ Academic Integrity & Fair Play

💡 Tips for Success

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages