This repository contains a collection of models designed to localize images taken at various locations across the IIIT-Hyderabad campus. The pipeline includes region classification, viewing angle prediction, and geographic coordinate estimation.
The system comprises three major components:
-
Region Classification
Classifies an image into one of 15 predefined campus regions using a fine-tuned CNN. -
Viewing Angle Prediction
Predicts the direction the camera was facing (angle from North) using region-specific regressors. -
Coordinate Prediction
Estimates the (latitude, longitude) location of the image by classifying it to a representative anchor within the predicted region.
Notebook: region_classification.ipynb
Trains a ConvNeXtV2 Tiny model (convnextv2_tiny.fcmae_ft_in1k) for 15-way classification of images into campus regions.
- Training Strategy: Fine-tuned on 6000+ campus images using
CrossEntropyLoss. - Preprocessing: Resize to 224Γ224, normalize with ImageNet stats, and apply augmentations (flips, crops).
- Output: Region ID (1β15)
Notebook: angle_prediction.ipynb
Implements a two-stage ensemble for predicting the viewing direction in degrees:
- Stage 1: A frozen ConvNeXtV2 model classifies the image region.
- Stage 2: A region-specific ConvNeXtV2 model regresses the angle in
[cos, sin]form usingMSELoss. - Postprocessing: Angle recovered using
atan2(sin, cos).
Notebook: coordinate_prediction.ipynb
Uses fused features and anchor classification for coordinate prediction:
- Feature Extraction: Extracts and fuses features from frozen ConvNeXtV2 (CNN) and ViT (Transformer).
- Anchor Setup: Applies K-Means clustering to region-wise coordinates to define
Nanchor points per region. - Classification: Trains a region-specific MLP using
CrossEntropyLossto classify the fused feature into one of the region's anchor indices.
- Trained region-specific expert models for angle and coordinate prediction, guided by a shared region classifier.
- Converted coordinate regression to a classification task via anchor clustering.
- Combined features from CNNs and Transformers to improve semantic understanding and localization accuracy.