Underwater Acoustic Feature Extractor: SimCLR Approach

Executive Summary

This document describes a feature extractor for underwater acoustic spectrograms using the SimCLR (Simple Framework for Contrastive Learning of Representations) approach. The feature extractor is designed to learn robust representations from unlabeled underwater acoustic data, capturing the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events.

The design incorporates specialized components for underwater acoustics:

Enhanced backbone architecture with multi-scale processing and attention mechanisms
Acoustic-specific data augmentations
Optimized projection head for acoustic feature representation

This self-supervised approach is particularly well-suited for underwater acoustic data, where labeled examples may be scarce but unlabeled data is abundant.

Introduction
Underwater Acoustic Data Characteristics
SimCLR Architecture Design
Data Preprocessing and Augmentation
Training Pipeline
Demonstration of Augmentation Techniques
Usage Guidelines
Conclusion

Introduction

Underwater acoustic data presents unique challenges for feature extraction:

Non-stationary noise with varying density functions
Signals from diverse sources (biological, geological, man-made)
Wide variations in signal amplitude, frequency content, and temporal patterns
Complex signal characteristics (harmonics, Doppler effects, transients)

Self-supervised learning approaches like SimCLR are ideal for this domain as they can learn meaningful representations without requiring labeled data. The SimCLR method works by learning to maximize agreement between differently augmented views of the same data sample via a contrastive loss in the latent space.

Underwater Acoustic Data Characteristics

Our analysis of underwater acoustic spectrograms revealed distinct characteristics across different signal types:

Environmental Noise

Uniform energy distribution across frequencies
Non-stationary patterns over time
Lower overall intensity compared to signal categories

Biological Signals

Whale calls: Distinctive frequency modulation patterns, concentrated energy in specific frequency bands
Fish sounds: Short, impulsive patterns with broader frequency content
Coral scraping: Irregular bursts of broadband energy

Man-made Signals

Boats/Ships: Strong harmonic structure with clear fundamental frequency and overtones
Submarines: Low-frequency tonals, sometimes with frequency shifts
Speedboats: Higher frequency content with potential Doppler effects

Transient Signals

Brief, high-energy broadband events
Sparse in time domain
Wide frequency range coverage

SimCLR Architecture Design

The SimCLR architecture consists of three main components:

Enhanced Backbone Network
Projection Head
Contrastive Loss Function

Enhanced Backbone Network

We designed a specialized backbone network based on ResNet with modifications for underwater acoustic spectrograms:

class EnhancedBackbone(nn.Module):
    def __init__(self, base_model='resnet18', pretrained=False):
        super(EnhancedBackbone, self).__init__()
        
        # Base ResNet backbone
        self.backbone = ResNetBackbone(base_model, pretrained)
        
        # Add multi-scale modules after each ResNet block
        self.multi_scale1 = MultiScaleModule(64, 64)
        self.multi_scale2 = MultiScaleModule(128, 128)
        self.multi_scale3 = MultiScaleModule(256, 256)
        
        # Add attention modules
        self.attention1 = DualAttentionModule(64)
        self.attention2 = DualAttentionModule(128)
        self.attention3 = DualAttentionModule(256)
        
        # Feature dimension remains the same as the base backbone
        self.feature_dim = self.backbone.feature_dim

Key enhancements include:

Multi-Scale Processing

The MultiScaleModule processes the input at multiple scales to capture both fine-grained patterns (like transients) and longer-term patterns (like whale calls):

class MultiScaleModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(MultiScaleModule, self).__init__()
        
        # Different kernel sizes for capturing patterns at different scales
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=5, padding=2),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )

Dual Attention Mechanism

The DualAttentionModule combines frequency and time attention to focus on relevant parts of the spectrogram:

class DualAttentionModule(nn.Module):
    def __init__(self, in_channels, reduction_ratio=8):
        super(DualAttentionModule, self).__init__()
        self.freq_attention = FrequencyAttention(in_channels, reduction_ratio)
        self.time_attention = TimeAttention(in_channels, reduction_ratio)
        
    def forward(self, x):
        x = self.freq_attention(x)
        x = self.time_attention(x)
        return x

This helps the model attend to specific frequency bands important for different acoustic signals (e.g., low frequencies for submarines, mid-frequencies for whale calls) and specific temporal patterns (e.g., brief transients, longer modulated calls).

Projection Head

The projection head maps representations to the space where contrastive loss is applied:

class ProjectionHead(nn.Module):
    def __init__(self, input_dim, hidden_dim=512, output_dim=128):
        super(ProjectionHead, self).__init__()
        
        # Multi-layer projection head as recommended in SimCLR paper
        self.projection = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, output_dim),
            nn.BatchNorm1d(output_dim)
        )

We use a deeper projection head than the original SimCLR paper to handle the complexity of acoustic features.

Contrastive Loss Function

We implement the NT-Xent (Normalized Temperature-scaled Cross Entropy) loss from the SimCLR paper:

class NTXentLoss(nn.Module):
    def __init__(self, temperature=0.5, batch_size=256):
        super(NTXentLoss, self).__init__()
        self.temperature = temperature
        self.batch_size = batch_size
        self.criterion = nn.CrossEntropyLoss(reduction="sum")
        self.similarity_f = nn.CosineSimilarity(dim=2)

The temperature parameter controls the concentration of the distribution - lower values make the model more sensitive to hard negatives.

Data Preprocessing and Augmentation

Dataset Class

We implemented a custom dataset class for underwater acoustic spectrograms:

class UnderwaterAcousticDataset(Dataset):
    def __init__(self, data_dir, transform=None, simclr_mode=True):
        self.data_dir = data_dir
        self.transform = transform
        self.simclr_mode = simclr_mode
        self.file_list = [f for f in os.listdir(data_dir) if f.endswith('.png')]
    
    def __getitem__(self, idx):
        img_name = os.path.join(self.data_dir, self.file_list[idx])
        
        # Load image and convert to grayscale
        image = Image.open(img_name).convert('L')
        
        # Extract the actual spectrogram part
        image = image.crop((80, 80, 800, 350))
        
        # Apply transformations if specified
        if self.transform:
            if self.simclr_mode:
                # For SimCLR, create two differently augmented views
                img1 = self.transform(image)
                img2 = self.transform(image)
                return img1, img2
            else:
                return self.transform(image)

Acoustic-Specific Augmentations

Based on our analysis of underwater acoustic characteristics, we designed specialized augmentations:

Time-domain Augmentations

Time Shifting: Handles varying onset times of signals

def time_shift(spectrogram, max_shift_percent=0.2):
    width = spectrogram.shape[1]
    shift_amount = int(width * np.random.uniform(-max_shift_percent, max_shift_percent))
    
    shifted = np.zeros_like(spectrogram)
    if shift_amount > 0:
        shifted[:, shift_amount:] = spectrogram[:, :width-shift_amount]
    elif shift_amount < 0:
        shifted[:, :width+shift_amount] = spectrogram[:, -shift_amount:]
    else:
        shifted = spectrogram
        
    return shifted

Time Masking: Simulates intermittent signals and improves robustness

def time_mask(spectrogram, max_mask_percent=0.2, num_masks=2):
    width = spectrogram.shape[1]
    masked = spectrogram.copy()
    
    for _ in range(num_masks):
        mask_width = int(width * np.random.uniform(0, max_mask_percent))
        mask_start = np.random.randint(0, width - mask_width)
        masked[:, mask_start:mask_start + mask_width] = 0
        
    return masked

Frequency-domain Augmentations

Frequency Shifting: Handles variations in pitch/frequency

def freq_shift(spectrogram, max_shift_percent=0.2):
    height = spectrogram.shape[0]
    shift_amount = int(height * np.random.uniform(-max_shift_percent, max_shift_percent))
    
    shifted = np.zeros_like(spectrogram)
    if shift_amount > 0:
        shifted[shift_amount:, :] = spectrogram[:height-shift_amount, :]
    elif shift_amount < 0:
        shifted[:height+shift_amount, :] = spectrogram[-shift_amount:, :]
    else:
        shifted = spectrogram
        
    return shifted

Frequency Masking: Improves robustness to frequency-selective noise

def freq_mask(spectrogram, max_mask_percent=0.2, num_masks=2):
    height = spectrogram.shape[0]
    masked = spectrogram.copy()
    
    for _ in range(num_masks):
        mask_height = int(height * np.random.uniform(0, max_mask_percent))
        mask_start = np.random.randint(0, height - mask_height)
        masked[mask_start:mask_start + mask_height, :] = 0
        
    return masked

Intensity Augmentations

Amplitude Scaling: Handles variations in signal strength

def amplitude_scale(spectrogram, min_factor=0.5, max_factor=1.5):
    scale_factor = np.random.uniform(min_factor, max_factor)
    return spectrogram * scale_factor

Gaussian Noise: Improves robustness to background noise

def add_gaussian_noise(spectrogram, max_noise_percent=0.1):
    noise_level = np.random.uniform(0, max_noise_percent)
    noise = np.random.normal(0, noise_level * np.mean(spectrogram), spectrogram.shape)
    return spectrogram + noise

Training Pipeline

The training pipeline connects the data preprocessing with the SimCLR model:

def main():
    # Parse arguments
    args = parse_args()
    
    # Setup output directory
    output_dir = setup_output_dir(args.output_dir)
    
    # Create data loaders
    train_loader, val_loader = create_data_loaders(
        data_dir=args.data_dir,
        batch_size=args.batch_size,
        num_workers=args.num_workers,
        simclr_mode=True
    )
    
    # Create model configuration
    config = {
        'base_model': args.base_model,
        'pretrained': args.pretrained,
        'projection_dim': args.projection_dim,
        'batch_size': args.batch_size,
        'temperature': args.temperature,
        'learning_rate': args.lr,
        'weight_decay': args.weight_decay,
        'epochs': args.epochs
    }
    
    # Create model
    model = SimCLRModel(config)
    
    # Train model
    model.train(train_loader, val_loader, args.epochs)
    
    # Save final model
    model.save_model(os.path.join(output_dir, 'final_model.pt'))
    
    # Visualize features
    feature_df = visualize_features(model, eval_loader, output_dir)

The training process includes:

Batch creation with pairs of augmented views
Forward pass through the backbone and projection head
NT-Xent loss calculation
Backpropagation and optimization
Checkpoint saving and visualization

Demonstration of Augmentation Techniques

We've demonstrated the effect of our specialized augmentations on different types of underwater acoustic signals. The visualizations show:

SimCLR Augmentation Pairs: Multiple differently-augmented views of the same spectrogram, as used in the contrastive learning process
Individual Augmentation Effects: The impact of each augmentation type on different signal categories

These demonstrations illustrate how the augmentations preserve the essential characteristics of each signal type while introducing variations that help the model learn robust representations.

Usage Guidelines

Feature Extraction

Once trained, the model can be used to extract features from underwater acoustic spectrograms:

def extract_features(model_path, spectrogram_path):
    # Load model
    model = UnderwaterAcousticSimCLR()
    model.load_state_dict(torch.load(model_path))
    model.eval()
    
    # Load and preprocess spectrogram
    image = Image.open(spectrogram_path).convert('L')
    image = image.crop((80, 80, 800, 350))
    image_tensor = transforms.ToTensor()(image).unsqueeze(0)
    
    # Extract features
    with torch.no_grad():
        features, _ = model(image_tensor)
    
    return features.numpy()

Downstream Tasks

The extracted features can be used for various downstream tasks:

Classification: Train a simple classifier on top of the frozen features
Clustering: Group similar acoustic signals
Anomaly Detection: Identify unusual acoustic events
Signal Type Identification: Distinguish between biological, geological, and man-made signals

Conclusion

The SimCLR-based feature extractor for underwater acoustic spectrograms provides a powerful tool for learning robust representations from unlabeled data. The specialized architecture with multi-scale processing and attention mechanisms, combined with acoustic-specific augmentations, addresses the unique challenges of underwater acoustic signals.

This self-supervised approach is particularly valuable in the underwater acoustic domain, where labeled data may be scarce but unlabeled recordings are abundant. The learned representations capture the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events, providing a solid foundation for various downstream tasks.

Future work could explore:

Integration with other modalities (e.g., visual data from underwater cameras)
Adaptation to real-time processing for online monitoring systems
Extension to longer temporal contexts for tracking evolving acoustic environments

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Underwater Acoustic Feature Extractor: SimCLR Approach

Executive Summary

Table of Contents

Introduction

Underwater Acoustic Data Characteristics

Environmental Noise

Biological Signals

Man-made Signals

Transient Signals

SimCLR Architecture Design

Enhanced Backbone Network

Multi-Scale Processing

Dual Attention Mechanism

Projection Head

Contrastive Loss Function

Data Preprocessing and Augmentation

Dataset Class

Acoustic-Specific Augmentations

Time-domain Augmentations

Frequency-domain Augmentations

Intensity Augmentations

Training Pipeline

Demonstration of Augmentation Techniques

Usage Guidelines

Feature Extraction

Downstream Tasks

Conclusion

FilesExpand file tree

feature_extractor_documentation.md

Latest commit

History

feature_extractor_documentation.md

File metadata and controls

Underwater Acoustic Feature Extractor: SimCLR Approach

Executive Summary

Table of Contents

Introduction

Underwater Acoustic Data Characteristics

Environmental Noise

Biological Signals

Man-made Signals

Transient Signals

SimCLR Architecture Design

Enhanced Backbone Network

Multi-Scale Processing

Dual Attention Mechanism

Projection Head

Contrastive Loss Function

Data Preprocessing and Augmentation

Dataset Class

Acoustic-Specific Augmentations

Time-domain Augmentations

Frequency-domain Augmentations

Intensity Augmentations

Training Pipeline

Demonstration of Augmentation Techniques

Usage Guidelines

Feature Extraction

Downstream Tasks

Conclusion