Monument AI is a research-oriented deep learning project that performs monument recognition by analyzing images through multiple visual perspectives simultaneously.
Instead of relying on a single RGB image or massive pretrained models (like ResNet/VGG), this project builds a Custom Multi-Modal Residual CNN from scratch. It learns complementary representations—Appearance, Texture, Geometry, and Shape—to make robust predictions even on limited data.
⚠️ Philosophy: This project intentionally avoids pretrained backbones to focus on architecture design, learning stability, and structural reasoning.
- 🏗️ Custom Architecture: A handcrafted Residual CNN with skip connections for stable gradient flow.
- 👁️ Multi-Modal Input:
- RGB: Captures color and general appearance.
- Grayscale: Focuses on lighting invariance and texture.
- Depth Map: Approximates 3D structural geometry.
- Edge Map: Highlights contours and shape boundaries.
- ⚖️ Imbalance Handling: Uses explicit class weighting and Macro F1-score evaluation.
- 🖥️ Desktop GUI: A visual inference tool to inspect all 4 input modalities and confidence scores.
Monuments often share similar visual patterns (arches, domes, pillars), making single-view models brittle to lighting or angle changes.
This project injects Inductive Bias by separating learning into specialized branches:
-
RGB
$\rightarrow$ Appearance -
Grayscale
$\rightarrow$ Texture Robustness -
Depth
$\rightarrow$ Structural Layout -
Edges
$\rightarrow$ Geometric Shape
Each branch learns independently, and their features are fused for the final classification. This improves learning stability and explainability.
The model uses parallel Convolutional branches that merge into a dense fusion layer.
graph LR
Input[Input Image] --> Pre[Preprocessing]
Pre --> A[RGB Branch]
Pre --> B[Gray Branch]
Pre --> C[Depth Branch]
Pre --> D[Edge Branch]
A & B & C & D --> Fusion[Feature Fusion Layer]
Fusion --> Dense[Dense Head]
Dense --> Class[Softmax Classifier]
Class --> Output[Monument Prediction]
MONUMENT_AI/
├── data/
│ ├── train/ # Training images (Class-wise folders)
│ └── test/ # Validation / Unseen images
│
├── src/
│ ├── config.py # Hyperparameters & Paths
│ ├── dataset.py # Data loader + Multi-view generation
│ ├── model.py # Custom Multi-modal Residual CNN
│ └── train.py # Training pipeline
│
├── outputs/
│ └── best_monument_model.h5 # Saved Model Weights
│
├── gui.py # Desktop GUI for inference
├── predict.py # CLI inference script
├── requirements.txt # Dependencies
└── README.md # Documentation
- Optimizer: Adam
- Loss Function: Sparse Categorical Cross-Entropy
- Regularization: Dropout + Early Stopping + ReduceLROnPlateau
- Metric: Macro F1-Score (Preferred over accuracy due to class imbalance).
- Python 3.10+
- GPU Recommended (but runs on CPU).
# Install dependencies
pip install -r requirements.txt
To visualize the 4-modality inputs and test predictions:
python gui.py
If you want to retrain the model on your own dataset:
python src/train.py
Current Limitations:
- Dataset Size: Limited data means the model is experimental.
- Depth Estimation: Depth maps are approximated from 2D images, not sensor-grade.
- Windows Tooling: The GUI is optimized for Windows.
Future Roadmap:
- Implement Attention Mechanisms for better feature fusion.
- Integrate state-of-the-art Monocular Depth Estimation.
- Ablation study to compare Single-Modal vs. Multi-Modal performance.
Educational & Research Project
