Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation (CroDiNo-KD)
Multi-modal RGB and Depth (RGBD) data significantly enhance environmental perception by providing 3D spatial context. However, accessing all sensor modalities during inference may be infeasible due to sensor failures or resource constraints.
To overcome this, we introduce CroDiNo-KD (Cross-Modal Disentanglement: a New Outlook on Knowledge Distillation). Unlike traditional Cross-Modal Knowledge Distillation (CMKD) frameworks that rely on a computationally expensive teacher/student paradigm, CroDiNo-KD jointly trains single-modality RGB and Depth models through mutual interaction and collaboration.
By leveraging disentanglement representation learning, contrastive learning and decoupled data augmentation, our approach structures the models' internal manifolds into modality-invariant and modality-specific features.
- Teacher-Free Paradigm: Eliminates the need for a multi-modal teacher, significantly reducing training time and parameter count.
- Disentangled Representations: Separates feature embeddings into modality-invariant and modality-specific information.
- Decoupled Augmentation: Allows independent, per-modality data augmentation strategies.
- State-of-the-Art Performance: Consistently outperforms existing CMKD methods on diverse benchmarks (indoor, aerial and drone imagery).
Overview of the CroDiNo-KD architecture, featuring two encoder-decoder models and an auxiliary decoder, optimized via disentanglement and contrastive learning.
We evaluate CroDiNo-KD across three diverse RGBD datasets: NYU Depth v2 (indoor), Potsdam (aerial) and Mid-Air (drone flight). Our method achieves State-of-the-Art performance in Cross-Modal Knowledge Distillation (mIoU scores):
| Dataset | Modality | Single-Modality Baseline | Best Competitor | CroDiNo-KD (Ours) |
|---|---|---|---|---|
| NYUDepth | RGB | 42.64 | 43.86 (KDv2) | 44.85 |
| Depth | 36.01 | 37.28 (ProtoKD) | 37.60 | |
| Potsdam | RGB | 75,73 | 76.09 (Masked Dist.) | 76.13 |
| Depth | 42.47 | 42.43 (Masked Dist.) | 42.78 | |
| Mid-Air | RGB | 47.84 | 48.32 (KD-Net) | 48.37 |
| Depth | 47.07 | 47.40 (Masked Dist.) | 47.91 |
Note: CroDiNo-KD not only achieves higher accuracy but completes training 43% faster than standard CMKD methods (20h vs 36h+ on Mid-Air).
If you find this code or our paper useful in your research, please consider citing our work:
@inproceedings{crodinokd,
title={Revisiting Cross-Modal Knowledge Distillation: A Disentanglement Approach for RGBD Semantic Segmentation},
author={Ferrod, Roger and Dantas, C{\'a}ssio F. and Di Caro, Luigi and Ienco, Dino},
booktitle={European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML/PKDD)},
year={2025}
}