Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding
Existing environmental sound classification (ESC) methods treat sounds as isolated signals, ignoring where they occur. SpaAudioLM bridges this gap by jointly reasoning over audio and geospatial Point-of-Interest (POI) metadata, enabling spatially grounded sound understanding across 28 environmental sound categories.
We fine-tune Qwen2.5-Omni-7B through a three-phase pipeline:
- Difficulty Profiling - derive per-class difficulty weights from zero-shot recall
- Constrained CoT SFT - supervised fine-tuning with answer-weighted loss
- Difficulty-Aware GRPO - reinforcement learning with composite reward (weighted F1 + format + POI consistency)
Comparison on multi-label audio event classification (mean +/- std over 5 runs, %):
| Model | F1-Micro | F1-Macro | F1-Weighted | Jaccard | Exact Match |
|---|---|---|---|---|---|
| Qwen2-Audio-7B | 4.73 | 2.86 | 5.27 | 1.96 | 0.00 |
| Qwen2.5-Omni-7B | 34.36 | 25.90 | 37.35 | 18.31 | 9.97 |
| Qwen3-Omni-30B | 29.66 | 20.26 | 28.80 | 14.81 | 14.02 |
| GPT-4o Audio | 30.09 | 26.47 | 34.07 | 17.18 | 9.43 |
| Gemini 2.5 Pro | 44.24 | 40.35 | 47.65 | 28.04 | 15.58 |
| SpaAudioLM (Ours) | 73.36 | 63.48 | 72.98 | 53.57 | 54.47 |
git clone https://github.com/yushiran/SpaAudioLM.git
cd SpaAudioLM
# Install uv (if not already installed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Create virtual environment and install dependencies
uv sync
source .venv/bin/activate# Clone from HuggingFace (includes audio files, ~2GB)
git clone https://huggingface.co/datasets/shiran-yu/SpaAudioLM-Dataset dataThis places the dataset at data/, which is the default DATASET_ROOT. The directory contains:
audio/- 3,854 WAV filessft_train.json- SFT training data with Chain-of-Thought annotations (2,697 samples)grpo_train.json- GRPO training data (2,697 samples)test.json- Test set (579 samples)validation.json- Validation set (578 samples)poi_features.json- POI metadata for all audio samplesclass_labels.json- List of 28 sound event categoriessplits/- Train/test/validation CSV splits
cp .env.example .env
# Edit .env with your paths:
# MODEL_PATH - HuggingFace model cache directory
# R1_STRENGTH_SFT_MODEL_PATH - SFT output directory
# R1_STRENGTH_GRPO_MODEL_PATH - GRPO output directoryTo run inference without training:
# Download SpaAudioLM weights from HuggingFace
huggingface-cli download shiran-yu/SpaAudioLM --local-dir models/SpaAudioLM# Single-node multi-GPU training (requires 4x GPUs, 32GB+ VRAM each)
bash app/src/sft/GeoOmniR1Strength-sft.sh
# Or submit to SLURM cluster
sbatch app/src/sft/GeoOmniR1Strength-sft-sbatch.batchKey SFT hyperparameters (configured in the script):
- Base model:
Qwen/Qwen2.5-Omni-7B - Epochs: 6, LR: 1e-5, Batch size: 4/GPU
- DeepSpeed Zero-2, full parameter fine-tuning
- Custom loss: answer tokens weighted 5x over thinking tokens
# Requires SFT checkpoint from Phase 1
# Ensure R1_STRENGTH_SFT_MODEL_PATH in .env points to your SFT checkpoint
bash app/src/grpo/GeoOmniR1-grpo-strength.sh
# Or submit to SLURM cluster
sbatch app/src/grpo/GeoOmniR1-grpo-strength-sbatch.batchKey GRPO hyperparameters:
- Epochs: 3, LR: 1e-6, Group size: 8
- Rewards: weighted F1 (1.0) + format (0.1) + POI consistency (0.3)
- KL coefficient: 0.05
# SFT model inference
bash app/src/sft/GeoOmniR1Strength-sft-infer.sh
# GRPO model inference
bash app/src/grpo/GeoOmniR1-grpo-strength-infer.sh
# 5-run statistical evaluation
bash app/src/grpo/5timesInfer.sh# Evaluate a single inference output
uv run app/src/GeoOmniR1Strength_evaluate.py --output_dir <path_to_output.json> --save_results
# Aggregate 5-run results (mean +/- std)
uv run app/src/evaluateAverageScore.py --base_dir <path_to_5runs_dir>SpaAudioLM/
├── app/
│ ├── config.py # Central configuration (env vars, paths)
│ ├── prompts/ # Jinja2 prompt templates
│ │ ├── prompt_manager.py
│ │ └── scripts/GeoOmniR1/ # Audio+POI, SFT, GRPO, ablation prompts
│ ├── utils/
│ │ ├── evaluate.py # Multi-label metrics & visualization
│ │ └── poi_filter.py # POI filtering + acoustic scene generation
│ └── src/
│ ├── sft/ # SFT training & inference scripts
│ │ ├── plugin/loss_scale.py # Token-level weighted loss
│ │ └── *.sh / *.batch
│ ├── grpo/ # GRPO training & inference scripts
│ │ ├── plugin.py # Reward functions (weighted F1, format, POI)
│ │ └── *.sh / *.batch
│ ├── dataset/ # Dataset generation pipeline
│ ├── GeoOmniR1Strength_evaluate.py
│ └── evaluateAverageScore.py
├── pyproject.toml
├── uv.lock
├── .env.example
└── LICENSE
All paths are configured via environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
DATASET_ROOT |
data |
Root directory of the dataset |
DATASET_PATH |
data/audio |
Audio files directory |
MODEL_PATH |
- | HuggingFace model cache |
R1_STRENGTH_SFT_MODEL_PATH |
- | SFT training output |
R1_STRENGTH_GRPO_MODEL_PATH |
- | GRPO training output |
@article{hou2025spaaudioLM,
title={SpaAudioLM: Spatial Context-Assisted Audio Language Model for Geospatially Aware Sound Understanding},
author={Hou, Yuanbo and Yu, Shiran and Zhi, Zhuo},
year={2025}
}