Banshal Yadav
Dilip Kumar R
HR Prasith
Submission Date: 13/10/2025
This document outlines a multimodal solution for the Smart Product Pricing Challenge. Our approach integrates comprehensive feature engineering from both textual and visual data, utilizing a pre-trained Vision Transformer for image embeddings and regex-based parsing for text. A Gradient Boosting Regressor was selected as the final model, achieving a validation SMAPE of 64.438%.
The pipeline is structured around comprehensive feature engineering from both modalities, followed by training and evaluation of several regression models. The primary source of signal is assumed to be in the structured and unstructured data within the provided inputs.
Key Observations:
- The
catalog_contentfield is a composite of structured tags (e.g.,Value:,Unit:) and unstructured descriptions, requiring careful parsing. - Product images provide non-textual cues like brand recognition, product quality, and category, which can be captured via deep learning embeddings.
- A significant portion of price variance is likely driven by quantitative features like item pack quantity and weight/volume.
Our high-level approach involves creating a rich, flat feature set by combining handcrafted text features with deep visual features, and then applying a robust tree-based regression model.
Approach Type: Single Model (Gradient Boosting) on a Multimodal Feature Set.
Core Innovation: The fusion of 768-dimension visual embeddings from a Vision Transformer with meticulously parsed textual metadata to create a comprehensive feature matrix for a classical machine learning model.
The architecture is a sequential feature-engineering pipeline that feeds into a single regression model. Raw text and image URLs are processed in parallel to generate feature vectors, which are then concatenated. This final matrix is used for training the Gradient Boosting model.
Text Processing Pipeline:
- Preprocessing steps: Regex and string operations are used to parse
catalog_contentand extract features liketext_length,word_count,brand_name,value_amount,unit_type,brand_frequency, and keyword counts. Categorical features are thenLabelEncoded. - Model type: Not a model, but a feature extraction script.
- Key parameters: N/A (rule-based extraction).
Image Processing Pipeline:
- Preprocessing steps: Images are loaded from URLs, converted to RGB, and resized. The
AutoImageProcessorhandles normalization and tokenization for the ViT model. - Model type: Vision Transformer (
google/vit-base-patch16-224) for feature extraction. The output from the[CLS]token's last hidden state is used as the embedding. - Key parameters: Output embedding dimension is 768.
- SMAPE Score: 51.94%
- Other Metrics: The final selected model was a
GradientBoostingRegressor. Feature importance analysis indicated that text-derived metadata (text_length,value_amount) and the visual embeddings were the most influential predictors.
Our solution demonstrates the efficacy of combining deep visual features with robust text-based feature engineering. The Gradient Boosting model proved effective at capturing the complex, non-linear relationships within the fused feature set. Key lessons include the high signal value present in structured text tags and the significant predictive power of pre-trained vision models even when used only for feature extraction.
Include drive link for your complete code directory
