This document describes a feature extractor for underwater acoustic spectrograms using the SimCLR (Simple Framework for Contrastive Learning of Representations) approach. The feature extractor is designed to learn robust representations from unlabeled underwater acoustic data, capturing the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events.
The design incorporates specialized components for underwater acoustics:
- Enhanced backbone architecture with multi-scale processing and attention mechanisms
- Acoustic-specific data augmentations
- Optimized projection head for acoustic feature representation
This self-supervised approach is particularly well-suited for underwater acoustic data, where labeled examples may be scarce but unlabeled data is abundant.
- Introduction
- Underwater Acoustic Data Characteristics
- SimCLR Architecture Design
- Data Preprocessing and Augmentation
- Training Pipeline
- Demonstration of Augmentation Techniques
- Usage Guidelines
- Conclusion
Underwater acoustic data presents unique challenges for feature extraction:
- Non-stationary noise with varying density functions
- Signals from diverse sources (biological, geological, man-made)
- Wide variations in signal amplitude, frequency content, and temporal patterns
- Complex signal characteristics (harmonics, Doppler effects, transients)
Self-supervised learning approaches like SimCLR are ideal for this domain as they can learn meaningful representations without requiring labeled data. The SimCLR method works by learning to maximize agreement between differently augmented views of the same data sample via a contrastive loss in the latent space.
Our analysis of underwater acoustic spectrograms revealed distinct characteristics across different signal types:
- Uniform energy distribution across frequencies
- Non-stationary patterns over time
- Lower overall intensity compared to signal categories
- Whale calls: Distinctive frequency modulation patterns, concentrated energy in specific frequency bands
- Fish sounds: Short, impulsive patterns with broader frequency content
- Coral scraping: Irregular bursts of broadband energy
- Boats/Ships: Strong harmonic structure with clear fundamental frequency and overtones
- Submarines: Low-frequency tonals, sometimes with frequency shifts
- Speedboats: Higher frequency content with potential Doppler effects
- Brief, high-energy broadband events
- Sparse in time domain
- Wide frequency range coverage
The SimCLR architecture consists of three main components:
- Enhanced Backbone Network
- Projection Head
- Contrastive Loss Function
We designed a specialized backbone network based on ResNet with modifications for underwater acoustic spectrograms:
class EnhancedBackbone(nn.Module):
def __init__(self, base_model='resnet18', pretrained=False):
super(EnhancedBackbone, self).__init__()
# Base ResNet backbone
self.backbone = ResNetBackbone(base_model, pretrained)
# Add multi-scale modules after each ResNet block
self.multi_scale1 = MultiScaleModule(64, 64)
self.multi_scale2 = MultiScaleModule(128, 128)
self.multi_scale3 = MultiScaleModule(256, 256)
# Add attention modules
self.attention1 = DualAttentionModule(64)
self.attention2 = DualAttentionModule(128)
self.attention3 = DualAttentionModule(256)
# Feature dimension remains the same as the base backbone
self.feature_dim = self.backbone.feature_dimKey enhancements include:
The MultiScaleModule processes the input at multiple scales to capture both fine-grained patterns (like transients) and longer-term patterns (like whale calls):
class MultiScaleModule(nn.Module):
def __init__(self, in_channels, out_channels):
super(MultiScaleModule, self).__init__()
# Different kernel sizes for capturing patterns at different scales
self.branch1 = nn.Sequential(
nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
nn.BatchNorm2d(out_channels // 4),
nn.ReLU(inplace=True)
)
self.branch2 = nn.Sequential(
nn.Conv2d(in_channels, out_channels // 4, kernel_size=3, padding=1),
nn.BatchNorm2d(out_channels // 4),
nn.ReLU(inplace=True)
)
self.branch3 = nn.Sequential(
nn.Conv2d(in_channels, out_channels // 4, kernel_size=5, padding=2),
nn.BatchNorm2d(out_channels // 4),
nn.ReLU(inplace=True)
)
self.branch4 = nn.Sequential(
nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
nn.BatchNorm2d(out_channels // 4),
nn.ReLU(inplace=True)
)The DualAttentionModule combines frequency and time attention to focus on relevant parts of the spectrogram:
class DualAttentionModule(nn.Module):
def __init__(self, in_channels, reduction_ratio=8):
super(DualAttentionModule, self).__init__()
self.freq_attention = FrequencyAttention(in_channels, reduction_ratio)
self.time_attention = TimeAttention(in_channels, reduction_ratio)
def forward(self, x):
x = self.freq_attention(x)
x = self.time_attention(x)
return xThis helps the model attend to specific frequency bands important for different acoustic signals (e.g., low frequencies for submarines, mid-frequencies for whale calls) and specific temporal patterns (e.g., brief transients, longer modulated calls).
The projection head maps representations to the space where contrastive loss is applied:
class ProjectionHead(nn.Module):
def __init__(self, input_dim, hidden_dim=512, output_dim=128):
super(ProjectionHead, self).__init__()
# Multi-layer projection head as recommended in SimCLR paper
self.projection = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(inplace=True),
nn.Linear(hidden_dim, hidden_dim),
nn.BatchNorm1d(hidden_dim),
nn.ReLU(inplace=True),
nn.Linear(hidden_dim, output_dim),
nn.BatchNorm1d(output_dim)
)We use a deeper projection head than the original SimCLR paper to handle the complexity of acoustic features.
We implement the NT-Xent (Normalized Temperature-scaled Cross Entropy) loss from the SimCLR paper:
class NTXentLoss(nn.Module):
def __init__(self, temperature=0.5, batch_size=256):
super(NTXentLoss, self).__init__()
self.temperature = temperature
self.batch_size = batch_size
self.criterion = nn.CrossEntropyLoss(reduction="sum")
self.similarity_f = nn.CosineSimilarity(dim=2)The temperature parameter controls the concentration of the distribution - lower values make the model more sensitive to hard negatives.
We implemented a custom dataset class for underwater acoustic spectrograms:
class UnderwaterAcousticDataset(Dataset):
def __init__(self, data_dir, transform=None, simclr_mode=True):
self.data_dir = data_dir
self.transform = transform
self.simclr_mode = simclr_mode
self.file_list = [f for f in os.listdir(data_dir) if f.endswith('.png')]
def __getitem__(self, idx):
img_name = os.path.join(self.data_dir, self.file_list[idx])
# Load image and convert to grayscale
image = Image.open(img_name).convert('L')
# Extract the actual spectrogram part
image = image.crop((80, 80, 800, 350))
# Apply transformations if specified
if self.transform:
if self.simclr_mode:
# For SimCLR, create two differently augmented views
img1 = self.transform(image)
img2 = self.transform(image)
return img1, img2
else:
return self.transform(image)Based on our analysis of underwater acoustic characteristics, we designed specialized augmentations:
-
Time Shifting: Handles varying onset times of signals
def time_shift(spectrogram, max_shift_percent=0.2): width = spectrogram.shape[1] shift_amount = int(width * np.random.uniform(-max_shift_percent, max_shift_percent)) shifted = np.zeros_like(spectrogram) if shift_amount > 0: shifted[:, shift_amount:] = spectrogram[:, :width-shift_amount] elif shift_amount < 0: shifted[:, :width+shift_amount] = spectrogram[:, -shift_amount:] else: shifted = spectrogram return shifted
-
Time Masking: Simulates intermittent signals and improves robustness
def time_mask(spectrogram, max_mask_percent=0.2, num_masks=2): width = spectrogram.shape[1] masked = spectrogram.copy() for _ in range(num_masks): mask_width = int(width * np.random.uniform(0, max_mask_percent)) mask_start = np.random.randint(0, width - mask_width) masked[:, mask_start:mask_start + mask_width] = 0 return masked
-
Frequency Shifting: Handles variations in pitch/frequency
def freq_shift(spectrogram, max_shift_percent=0.2): height = spectrogram.shape[0] shift_amount = int(height * np.random.uniform(-max_shift_percent, max_shift_percent)) shifted = np.zeros_like(spectrogram) if shift_amount > 0: shifted[shift_amount:, :] = spectrogram[:height-shift_amount, :] elif shift_amount < 0: shifted[:height+shift_amount, :] = spectrogram[-shift_amount:, :] else: shifted = spectrogram return shifted
-
Frequency Masking: Improves robustness to frequency-selective noise
def freq_mask(spectrogram, max_mask_percent=0.2, num_masks=2): height = spectrogram.shape[0] masked = spectrogram.copy() for _ in range(num_masks): mask_height = int(height * np.random.uniform(0, max_mask_percent)) mask_start = np.random.randint(0, height - mask_height) masked[mask_start:mask_start + mask_height, :] = 0 return masked
-
Amplitude Scaling: Handles variations in signal strength
def amplitude_scale(spectrogram, min_factor=0.5, max_factor=1.5): scale_factor = np.random.uniform(min_factor, max_factor) return spectrogram * scale_factor
-
Gaussian Noise: Improves robustness to background noise
def add_gaussian_noise(spectrogram, max_noise_percent=0.1): noise_level = np.random.uniform(0, max_noise_percent) noise = np.random.normal(0, noise_level * np.mean(spectrogram), spectrogram.shape) return spectrogram + noise
The training pipeline connects the data preprocessing with the SimCLR model:
def main():
# Parse arguments
args = parse_args()
# Setup output directory
output_dir = setup_output_dir(args.output_dir)
# Create data loaders
train_loader, val_loader = create_data_loaders(
data_dir=args.data_dir,
batch_size=args.batch_size,
num_workers=args.num_workers,
simclr_mode=True
)
# Create model configuration
config = {
'base_model': args.base_model,
'pretrained': args.pretrained,
'projection_dim': args.projection_dim,
'batch_size': args.batch_size,
'temperature': args.temperature,
'learning_rate': args.lr,
'weight_decay': args.weight_decay,
'epochs': args.epochs
}
# Create model
model = SimCLRModel(config)
# Train model
model.train(train_loader, val_loader, args.epochs)
# Save final model
model.save_model(os.path.join(output_dir, 'final_model.pt'))
# Visualize features
feature_df = visualize_features(model, eval_loader, output_dir)The training process includes:
- Batch creation with pairs of augmented views
- Forward pass through the backbone and projection head
- NT-Xent loss calculation
- Backpropagation and optimization
- Checkpoint saving and visualization
We've demonstrated the effect of our specialized augmentations on different types of underwater acoustic signals. The visualizations show:
- SimCLR Augmentation Pairs: Multiple differently-augmented views of the same spectrogram, as used in the contrastive learning process
- Individual Augmentation Effects: The impact of each augmentation type on different signal categories
These demonstrations illustrate how the augmentations preserve the essential characteristics of each signal type while introducing variations that help the model learn robust representations.
Once trained, the model can be used to extract features from underwater acoustic spectrograms:
def extract_features(model_path, spectrogram_path):
# Load model
model = UnderwaterAcousticSimCLR()
model.load_state_dict(torch.load(model_path))
model.eval()
# Load and preprocess spectrogram
image = Image.open(spectrogram_path).convert('L')
image = image.crop((80, 80, 800, 350))
image_tensor = transforms.ToTensor()(image).unsqueeze(0)
# Extract features
with torch.no_grad():
features, _ = model(image_tensor)
return features.numpy()The extracted features can be used for various downstream tasks:
- Classification: Train a simple classifier on top of the frozen features
- Clustering: Group similar acoustic signals
- Anomaly Detection: Identify unusual acoustic events
- Signal Type Identification: Distinguish between biological, geological, and man-made signals
The SimCLR-based feature extractor for underwater acoustic spectrograms provides a powerful tool for learning robust representations from unlabeled data. The specialized architecture with multi-scale processing and attention mechanisms, combined with acoustic-specific augmentations, addresses the unique challenges of underwater acoustic signals.
This self-supervised approach is particularly valuable in the underwater acoustic domain, where labeled data may be scarce but unlabeled recordings are abundant. The learned representations capture the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events, providing a solid foundation for various downstream tasks.
Future work could explore:
- Integration with other modalities (e.g., visual data from underwater cameras)
- Adaptation to real-time processing for online monitoring systems
- Extension to longer temporal contexts for tracking evolving acoustic environments