HI4HC and AAAAD: Exploring a Hierarchical Method and Dataset Using Hybrid Intelligence for Remote Sensing Scene Captioning
This is an ongoing project. We will continuously enhance the model's performance and update the dataset. Click the Star in the top right corner to follow us and stay updated with the latest research outcomes!
- [2025.03.05] The key value in the AAAAD has been modified to "element" to maintain consistency with the paper.
- [2024.12.10] We have released AAAAD, feel free to use it! see How to use AAAAD?
-
Our HI4HC-WebUI for automatically filtering and supplementing geographical element labels for GUT, CUT, and CDT:

-
HI4HC_demonstration_video.mp4
Please wear headphones for a better understanding of HI4HC_WebUI
Note: The complete code will be released after the paper is accepted.
- Overall Strategy for Hierarchical Captioning for Remote Sensing Scenes
- HI4HC: Hybrid Intelligence for Remote Sensing Scene Hierarchical Captioning
- AAAAD: A Hierarchical Caption Dataset for Remote Sensing Scene Based on Hybrid Intelligence
- Paper
- Acknowledgement
- License
- Global and category knowledge graphs used for automatic cleansing and supplementing geographical element labels.

The AAAAD dataset consists of two parts: a remote sensing imagery dataset and a hierarchical description dataset. The remote sensing imagery dataset is derived from the AID dataset, while the hierarchical caption dataset includes geographical element captions, spatial relation captions, and scene-level captions.
-
Download the remote sensing imagery dataset from
AAAAD: The remote sensing imagery inAAAADis sourced from theAIDdataset. You can either download the AID dataset and preprocess (center-cropped) it to 512x512 resolution or directly download our preprocessed version:Dataset_AAAAD_Imagery: Hugging Face or Baidu NetDisk (code: cjql)
-
Download the hierarchical description dataset from
AAAAD:- The
Dataset_AAAAD_Hierarchical_Caption.jsonfile in this repository contains the hierarchical caption dataset ofAAAAD, structured as follows:
- The
{
"dataset": "AAAAD",
"category":
{
"church":
[
{
"image_id": 1,
"file_name": "church_1.png",
"split": "test",
"hierarchical_caption":
{
"element": "building, city, cityscape, skyscraper, scenery, architecture, library, tower, street, real world location, town, outdoors, road, house, from above, car, tree, water, fountain",
"relation": "this aerial photo depicts a city area with multiple buildings and structures. the most striking feature is a large elliptical building with a blue-green roof, possibly a stadium or auditorium. surrounding this central structure are various other buildings of different shapes and sizes, including a semi-circular design adjacent to the elliptical structure.",
"scene": "commercial"
}
}
]
}
}
-
Qualitative comparison between existing RSI caption datasets and AAAAD (ours).

-
Quantitative comparison between AAAAD and existing remote sensing caption datasets.

-
Comparison of AAAAD and existing remote sensing caption datasets across different dimensions (element, attributes, spatial relations).

-
Statistical analysis of AAAAD and existing remote sensing caption datasets.

-
Comparison of semantic similarity between AAAAD and existing remote sensing caption datasets.

-
Direct comparison of remote sensing scenes generated by different algorithms using traditional single-level captions and hierarchical captions as prompts.

Please cite the following paper if you find it useful for your research:
@article{ren2024hi4hc,
title = {HI4HC and AAAAD: Exploring a hierarchical method and dataset using hybrid intelligence for remote sensing scene captioning},
journal = {International Journal of Applied Earth Observation and Geoinformation},
author={Jiaxin Ren, Wanzeng Liu, Jun Chen, and Shunxi Yin},
volume = {139},
pages = {104491},
year = {2025},
issn = {1569-8432},
doi = {https://doi.org/10.1016/j.jag.2025.104491},
url = {https://www.sciencedirect.com/science/article/pii/S1569843225001384}
}
- Kohya's GUI. This repository primarily provides a Gradio GUI for Kohya's Stable Diffusion trainers. Moreover, we drew inspiration from its annotator's WebUI to implement automatic filtering of geographical element labels for GUT, CUT, and CDT.
- Deep Danbooru. A deep learning model trained on the Danbooru dataset using the ResNet architecture, specifically designed for recognizing and tagging content and attributes in anime-style images.
- WD14. An advanced version of Deep Danbooru, combining a larger dataset and deeper network structure to support a broader range of tags and improve tag prediction accuracy.
- BLIP-2. A model that unifies the framework for visual-language pre-training and fine-tuning, enabling multimodal learning and cross-modal understanding.
This repo is distributed under MIT License. The code can be used for academic purposes only.


