DeepAndes: A Self-Supervised Vision Foundation Model for Multi-Spectral Remote Sensing Imagery of the Andes (IEEE JSTARS2025)
TL;DR: DeepAndes is the first vision foundation model that applies the DINOv2 self-supervised learning framework and large-scale pre-training on multi-spectral satellite imagery specifically for the Andes region.
Roadmap | Lastest Updates | Hightlights | Architecture | Quick Start | Results | Citation | Acknowledgements
This is an ongoing project for developing foundation models for the GEOPACHA web app.
- Updating the code for YOLO(ultralytics) object detection head (in progress)
- Exploring the next-line DINOv3 model (in progress)
- 🌎 Extend pre-training to Full Andes Regions (100x more data) (in progress)
- 🔗 Integrate geospatial metadata and language models (next step)
🔥 🔥 🔥 Last Updated on 2025.12.27 🔥 🔥 🔥
- [2025.10.23] 3-million DeepAndes ViT-L/14 backbone pretrained weight (google drive) is released, check quick-start
- [2025.10.02] Our paper has been accepted for publication in the IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE JSTARS 2025).
- Foundation-model scale: Trained on ~3 million multi-spectral satellite patches covering ~488,640 km² of the Andes.
- Multi-spectral (8-band) input: Supports 8-band WorldView-2 and WorldView-3 satellite imagery instead of RGB.
- Self-supervised learning (SSL): Built upon DINOv2, adapted for geospatial feature scale and 8-channel inputs.
- Downstream versatility: Evaluated on classification, retrieval, and segmentation under both full and few-shot (reduced) settings.
- Backbone: Vision Transformer (ViT-L/14, ~304M parameters)
- Input: 8-band image patches (256 × 256) sampled across diverse Andean terrains at 0.5 meter/pixel
- SSL Framework:
- DINOv2 (Contrastive Learning + Distillation)
- Multi-crop global/local view strategy
- Channel adaptation for 8-band input
- Large-scale geospatial sampling
- Dataset: ~3M patches across 8 land-cover types (~488 k km² coverage)
This repo contains pretrained weight (google drive), codes and example scripts for downstream tasks with DeepAndes backbone:
- Use Pre-trained backbone (via Pytorch Hub)
- Launch Pre-training
- Zero-shot: Image to Image Retrieval
- Few-shot Fine-tuning: Classification
- Few-shot Fine-tuning: Segmentation
See the instructions to install Pytorch (the only required dependency for loading the model). xFormers is also installed for mem-efficient attention. An example of Pytorch 2.8.0 with CUDA 12.8 and xformers installation are provided here.
import torch
import torch.nn as nn
# checkpoint
checkpoint = '/path/to/model/teacher_checkpoint.pth'
model = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitl14')
pretrained_dict = torch.load(checkpoint, map_location="cpu")
checkpoint_key = 'teacher'
new_state_dict = {}
for k, v in pretrained_dict[checkpoint_key].items():
if 'dino_head' in k:
print(f'{k} not used')
elif 'ibot_head' in k:
print(f'{k} not used')
else:
new_key = k.replace('backbone.', '')
new_state_dict[new_key] = v
# ViT-L/14 with 224×224 input (8-band) → 257 tokens (256 patches + 1 cls), 1024 dims
pos_embed = nn.Parameter(torch.zeros(1, 257, 1024))
model.pos_embed = pos_embed
new_patch_embed = model.patch_embed
new_patch_embed.proj = nn.Conv2d(
in_channels=8, # Updated for 8 input bands
out_channels=new_patch_embed.proj.out_channels,
kernel_size=new_patch_embed.proj.kernel_size,
stride=new_patch_embed.proj.stride,
padding=new_patch_embed.proj.padding,
)
model.patch_embed = new_patch_embed
model.load_state_dict(new_state_dict, strict=True)E.g., Adding a simple linear classifer head
# add linear classification head
model.head = nn.Sequential(
nn.Linear(1024, 256),
nn.ReLU(),
nn.Linear(256, 2)
)-
Adapts patch_embedding from 8 channels to 3 channels after loading pre-trained weights.
-
Note: Model was designed for 8-band Worldview imagery; 3-band adaptation is only experimental!
-
We provide a demo script in
deepandes_for_3bands# loading pretrained weight model.load_state_dict(new_state_dict, strict=True) # 2. adjust patch embedding for new_in_ch = 3 adapt_patch_embed_in_chans(model, new_in_ch=3, mode="mean") model.eval()
If the error ModuleNotFoundError: No module named 'dinov2.hub.dinotxt' occurs while loading module, simply comment out the following line in the ~/.cache/torch/hub/facebookresearch_dinov2_main/hubconf.py file:
# from dinov2.hub.dinotxt import dinov2_vitl14_reg4_dinotxt_tet1280d20h24lThe hubconf.py we used is provided, copy and paste on the corresponding directory. This is the config mis-match since we use the simple torch hub loading and adjust the pre-trained wieght.
Please refer to dinov2_ssl_8bands/README.md for detailed installation (here), dataset setup, and key modifications supporting any number (e.g., this work is 8) of image bands.
Run on Multi-GPUs (without SLURM):
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 \
/path/to/dinov2_ssl_8bands/dinov2/train/train_8bands.py \
--output-dir /path/to/output_dir \
--config-file /path/to/config_file.yaml \
--ssl-data /path/to/dataset/folder \
--wandb-trial <name_of_the_run> \
--wandb-project <name_of_the_project>
replace the CUDA_VISIBLE_DEVICES and nproc_per_node with specific multi-gpus settings. (e.g., training on 8 A100-80GB GPUs)
Run on Single GPU (without SLURM):
python /path/to/dinov2_ssl_8bands/dinov2/train/train_8bands.py \
--output-dir /path/to/output_dir \
--config-file /path/to/config_file.yaml \
--ssl-data /path/to/dataset/folder \
--wandb-trial <name_of_the_run> \
--wandb-project <name_of_the_project>
An example training logs is provided here:
We use the FAISS library for fast, simple image-to-image retrieval. To set up the Conda environment, follow the instructions in faiss_install.md. For implementation details and descriptions, see image_retrieval/README.md.
An example for zero-shot image-to-image retrieval using the DeepAndes pre-trained backbone is provided: deepandes_feature_extract.ipynb. Some parameters are defined:
pretrained_weight = '/path/to/teacher_checkpoint.pth' # Path to pre-trained checkpoint
device = torch.device("cuda:0") # Specify GPU index
path_to_all_images = '/path/to/dataset/folder/*.npy' # Path to all database images
number_to_retrieve = 10 # Top-k retrieval by cosine similarity
query_image_path = '/path/to/CLS1-7760-223.npy' # A example query loci image for retrievalIn our work, we display the image using channels/bands 4, 2, and 1 as RGB for visualization purposes only. The example below shows a query image (Class 1, active corrals — dark areas indicate animal use) and the top-10 retrieved images based on cosine similarity.
See the classification_eval/README.md for setup details and baseline comparisons. E.g., To fine-tune deepandes using binary classification dataset.
python ./classification_eval/linear_prob_simple_args.py \
--use_wandb \
--wandb_project <wandb_project_name> \
--wandb_trial <wandb_run_name> \
--train_dataset_str /path/to/train_dataset_dir \
--val_dataset_str /path/to/val_dataset_dir \
--output_dir /path/to/output_dir \
--epochs 10 \
--cuda 0 \
--model_name deepandes \
--pretrained_weights /path/to/teacher_checkpoint.pthSee the segmentation_eval/README.md for details. The example script trains semantic segmentation models using pretrained backbones with a simple linear segmentation head.
E.g., For few-shot active corral segmentation task with frozen DeepAndes backbone,
python ./segmentation_eval/main_binary_experiment.py --config ./configs/segmentation_eval/active_corrals_experiment/corrals_active_FM3M.yamlScaling laws are observed as the pretraining scale increases from none to 30K, 300K, and 3M images, highlighting DeepAndes’ scalability and performance gains with more data.
We benchmark DeepAndes against representative baselines: a Scratch model, self-supervised backbones (MoCo-V2, MAE), and SatMAE—a domain-specific remote sensing model.
- Zero-shot image retrieval: Top-5 and Top-50 mAP
- Few-shot classification: F1, Recall, and Precision
- Few-shot segmentation: Dice Similarity Coefficient (DSC), with a frozen backbone and a linear segmentation head
Notes: Public SSL backbones for comparison are adapted to 8 bands (by adjusting patch embedding) using the timm API, which also supports the DINO series and other SOTA PyTorch-based ViT models. An example of this adjustment is moco_loader.py
Few-shot results are reported on both the full training set and a highly constrained setting (N_train = 72 for classification, N_train = 10 for segmentation) to simulate data-limited conditions.
If you find this repository useful, please consider giving a star ⭐ and citation 🦖 Thank you:)
@ARTICLE{11196959,
author={Guo, Junlin and Zimmer-Dauphinee, James R. and Nieusma, Jordan M. and Lu, Siqi and Liu, Quan and Deng, Ruining and Cui, Can and Yue, Jialin and Lin, Yizhe and Yao, Tianyuan and Xiong, Juming and Zhu, Junchao and Qu, Chongyu and Yang, Yuechen and Wilkes, Mitchell and Wang, Xiao and VanValkenburgh, Parker and Wernke, Steven A. and Huo, Yuankai},
journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing},
title={DeepAndes: A Self-Supervised Vision Foundation Model for Multispectral Remote Sensing Imagery of the Andes},
year={2025},
volume={18},
number={},
pages={26983-26999},
keywords={Remote sensing;Foundation models;Satellite images;Analytical models;Transformers;Training;Surveys;Frequency modulation;Data models;Geospatial analysis;Andean archaeology;DINOv2;foundation model (FM);multispectral imaging;remote sensing;self-supervised learning},
doi={10.1109/JSTARS.2025.3619423}}
Supported by the GeoPACHA Project and collaborators at Vanderbilt University, Brown University, and ORNL. Special thanks to all contributors from the Andean Archaeology and Remote Sensing communities.
- Awesome Remote Sensing Foundation Models Github Repository: Contiously updating, we reference several excellent DINO codebases and hyperparameter settings from here.
- Meta's DINO series (DINOv3 follows same/similar codebase, with same training setup for the first-stage pre-training)
For questions or contributions, open an issue or pull request. We are looking forward to your feedback!
Contact: Junlin Guo (junlinguo1@gmail.com), Yuankai Huo (PI)(yuankai.huo@vanderbilt.edu), Steven Wernke (PI)(s.wernke@Vanderbilt.Edu), and Parker VanValkenburgh (parker_vanvalkenburgh@brown.edu)




