Skip to content

Latest commit

 

History

History
410 lines (312 loc) · 15.6 KB

File metadata and controls

410 lines (312 loc) · 15.6 KB

Underwater Acoustic Feature Extractor: SimCLR Approach

Executive Summary

This document describes a feature extractor for underwater acoustic spectrograms using the SimCLR (Simple Framework for Contrastive Learning of Representations) approach. The feature extractor is designed to learn robust representations from unlabeled underwater acoustic data, capturing the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events.

The design incorporates specialized components for underwater acoustics:

  1. Enhanced backbone architecture with multi-scale processing and attention mechanisms
  2. Acoustic-specific data augmentations
  3. Optimized projection head for acoustic feature representation

This self-supervised approach is particularly well-suited for underwater acoustic data, where labeled examples may be scarce but unlabeled data is abundant.

Table of Contents

  1. Introduction
  2. Underwater Acoustic Data Characteristics
  3. SimCLR Architecture Design
  4. Data Preprocessing and Augmentation
  5. Training Pipeline
  6. Demonstration of Augmentation Techniques
  7. Usage Guidelines
  8. Conclusion

Introduction

Underwater acoustic data presents unique challenges for feature extraction:

  • Non-stationary noise with varying density functions
  • Signals from diverse sources (biological, geological, man-made)
  • Wide variations in signal amplitude, frequency content, and temporal patterns
  • Complex signal characteristics (harmonics, Doppler effects, transients)

Self-supervised learning approaches like SimCLR are ideal for this domain as they can learn meaningful representations without requiring labeled data. The SimCLR method works by learning to maximize agreement between differently augmented views of the same data sample via a contrastive loss in the latent space.

Underwater Acoustic Data Characteristics

Our analysis of underwater acoustic spectrograms revealed distinct characteristics across different signal types:

Environmental Noise

  • Uniform energy distribution across frequencies
  • Non-stationary patterns over time
  • Lower overall intensity compared to signal categories

Biological Signals

  • Whale calls: Distinctive frequency modulation patterns, concentrated energy in specific frequency bands
  • Fish sounds: Short, impulsive patterns with broader frequency content
  • Coral scraping: Irregular bursts of broadband energy

Man-made Signals

  • Boats/Ships: Strong harmonic structure with clear fundamental frequency and overtones
  • Submarines: Low-frequency tonals, sometimes with frequency shifts
  • Speedboats: Higher frequency content with potential Doppler effects

Transient Signals

  • Brief, high-energy broadband events
  • Sparse in time domain
  • Wide frequency range coverage

SimCLR Architecture Design

The SimCLR architecture consists of three main components:

  1. Enhanced Backbone Network
  2. Projection Head
  3. Contrastive Loss Function

Enhanced Backbone Network

We designed a specialized backbone network based on ResNet with modifications for underwater acoustic spectrograms:

class EnhancedBackbone(nn.Module):
    def __init__(self, base_model='resnet18', pretrained=False):
        super(EnhancedBackbone, self).__init__()
        
        # Base ResNet backbone
        self.backbone = ResNetBackbone(base_model, pretrained)
        
        # Add multi-scale modules after each ResNet block
        self.multi_scale1 = MultiScaleModule(64, 64)
        self.multi_scale2 = MultiScaleModule(128, 128)
        self.multi_scale3 = MultiScaleModule(256, 256)
        
        # Add attention modules
        self.attention1 = DualAttentionModule(64)
        self.attention2 = DualAttentionModule(128)
        self.attention3 = DualAttentionModule(256)
        
        # Feature dimension remains the same as the base backbone
        self.feature_dim = self.backbone.feature_dim

Key enhancements include:

Multi-Scale Processing

The MultiScaleModule processes the input at multiple scales to capture both fine-grained patterns (like transients) and longer-term patterns (like whale calls):

class MultiScaleModule(nn.Module):
    def __init__(self, in_channels, out_channels):
        super(MultiScaleModule, self).__init__()
        
        # Different kernel sizes for capturing patterns at different scales
        self.branch1 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch2 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch3 = nn.Sequential(
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=5, padding=2),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )
        
        self.branch4 = nn.Sequential(
            nn.MaxPool2d(kernel_size=3, stride=1, padding=1),
            nn.Conv2d(in_channels, out_channels // 4, kernel_size=1),
            nn.BatchNorm2d(out_channels // 4),
            nn.ReLU(inplace=True)
        )

Dual Attention Mechanism

The DualAttentionModule combines frequency and time attention to focus on relevant parts of the spectrogram:

class DualAttentionModule(nn.Module):
    def __init__(self, in_channels, reduction_ratio=8):
        super(DualAttentionModule, self).__init__()
        self.freq_attention = FrequencyAttention(in_channels, reduction_ratio)
        self.time_attention = TimeAttention(in_channels, reduction_ratio)
        
    def forward(self, x):
        x = self.freq_attention(x)
        x = self.time_attention(x)
        return x

This helps the model attend to specific frequency bands important for different acoustic signals (e.g., low frequencies for submarines, mid-frequencies for whale calls) and specific temporal patterns (e.g., brief transients, longer modulated calls).

Projection Head

The projection head maps representations to the space where contrastive loss is applied:

class ProjectionHead(nn.Module):
    def __init__(self, input_dim, hidden_dim=512, output_dim=128):
        super(ProjectionHead, self).__init__()
        
        # Multi-layer projection head as recommended in SimCLR paper
        self.projection = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, hidden_dim),
            nn.BatchNorm1d(hidden_dim),
            nn.ReLU(inplace=True),
            nn.Linear(hidden_dim, output_dim),
            nn.BatchNorm1d(output_dim)
        )

We use a deeper projection head than the original SimCLR paper to handle the complexity of acoustic features.

Contrastive Loss Function

We implement the NT-Xent (Normalized Temperature-scaled Cross Entropy) loss from the SimCLR paper:

class NTXentLoss(nn.Module):
    def __init__(self, temperature=0.5, batch_size=256):
        super(NTXentLoss, self).__init__()
        self.temperature = temperature
        self.batch_size = batch_size
        self.criterion = nn.CrossEntropyLoss(reduction="sum")
        self.similarity_f = nn.CosineSimilarity(dim=2)

The temperature parameter controls the concentration of the distribution - lower values make the model more sensitive to hard negatives.

Data Preprocessing and Augmentation

Dataset Class

We implemented a custom dataset class for underwater acoustic spectrograms:

class UnderwaterAcousticDataset(Dataset):
    def __init__(self, data_dir, transform=None, simclr_mode=True):
        self.data_dir = data_dir
        self.transform = transform
        self.simclr_mode = simclr_mode
        self.file_list = [f for f in os.listdir(data_dir) if f.endswith('.png')]
    
    def __getitem__(self, idx):
        img_name = os.path.join(self.data_dir, self.file_list[idx])
        
        # Load image and convert to grayscale
        image = Image.open(img_name).convert('L')
        
        # Extract the actual spectrogram part
        image = image.crop((80, 80, 800, 350))
        
        # Apply transformations if specified
        if self.transform:
            if self.simclr_mode:
                # For SimCLR, create two differently augmented views
                img1 = self.transform(image)
                img2 = self.transform(image)
                return img1, img2
            else:
                return self.transform(image)

Acoustic-Specific Augmentations

Based on our analysis of underwater acoustic characteristics, we designed specialized augmentations:

Time-domain Augmentations

  • Time Shifting: Handles varying onset times of signals

    def time_shift(spectrogram, max_shift_percent=0.2):
        width = spectrogram.shape[1]
        shift_amount = int(width * np.random.uniform(-max_shift_percent, max_shift_percent))
        
        shifted = np.zeros_like(spectrogram)
        if shift_amount > 0:
            shifted[:, shift_amount:] = spectrogram[:, :width-shift_amount]
        elif shift_amount < 0:
            shifted[:, :width+shift_amount] = spectrogram[:, -shift_amount:]
        else:
            shifted = spectrogram
            
        return shifted
  • Time Masking: Simulates intermittent signals and improves robustness

    def time_mask(spectrogram, max_mask_percent=0.2, num_masks=2):
        width = spectrogram.shape[1]
        masked = spectrogram.copy()
        
        for _ in range(num_masks):
            mask_width = int(width * np.random.uniform(0, max_mask_percent))
            mask_start = np.random.randint(0, width - mask_width)
            masked[:, mask_start:mask_start + mask_width] = 0
            
        return masked

Frequency-domain Augmentations

  • Frequency Shifting: Handles variations in pitch/frequency

    def freq_shift(spectrogram, max_shift_percent=0.2):
        height = spectrogram.shape[0]
        shift_amount = int(height * np.random.uniform(-max_shift_percent, max_shift_percent))
        
        shifted = np.zeros_like(spectrogram)
        if shift_amount > 0:
            shifted[shift_amount:, :] = spectrogram[:height-shift_amount, :]
        elif shift_amount < 0:
            shifted[:height+shift_amount, :] = spectrogram[-shift_amount:, :]
        else:
            shifted = spectrogram
            
        return shifted
  • Frequency Masking: Improves robustness to frequency-selective noise

    def freq_mask(spectrogram, max_mask_percent=0.2, num_masks=2):
        height = spectrogram.shape[0]
        masked = spectrogram.copy()
        
        for _ in range(num_masks):
            mask_height = int(height * np.random.uniform(0, max_mask_percent))
            mask_start = np.random.randint(0, height - mask_height)
            masked[mask_start:mask_start + mask_height, :] = 0
            
        return masked

Intensity Augmentations

  • Amplitude Scaling: Handles variations in signal strength

    def amplitude_scale(spectrogram, min_factor=0.5, max_factor=1.5):
        scale_factor = np.random.uniform(min_factor, max_factor)
        return spectrogram * scale_factor
  • Gaussian Noise: Improves robustness to background noise

    def add_gaussian_noise(spectrogram, max_noise_percent=0.1):
        noise_level = np.random.uniform(0, max_noise_percent)
        noise = np.random.normal(0, noise_level * np.mean(spectrogram), spectrogram.shape)
        return spectrogram + noise

Training Pipeline

The training pipeline connects the data preprocessing with the SimCLR model:

def main():
    # Parse arguments
    args = parse_args()
    
    # Setup output directory
    output_dir = setup_output_dir(args.output_dir)
    
    # Create data loaders
    train_loader, val_loader = create_data_loaders(
        data_dir=args.data_dir,
        batch_size=args.batch_size,
        num_workers=args.num_workers,
        simclr_mode=True
    )
    
    # Create model configuration
    config = {
        'base_model': args.base_model,
        'pretrained': args.pretrained,
        'projection_dim': args.projection_dim,
        'batch_size': args.batch_size,
        'temperature': args.temperature,
        'learning_rate': args.lr,
        'weight_decay': args.weight_decay,
        'epochs': args.epochs
    }
    
    # Create model
    model = SimCLRModel(config)
    
    # Train model
    model.train(train_loader, val_loader, args.epochs)
    
    # Save final model
    model.save_model(os.path.join(output_dir, 'final_model.pt'))
    
    # Visualize features
    feature_df = visualize_features(model, eval_loader, output_dir)

The training process includes:

  • Batch creation with pairs of augmented views
  • Forward pass through the backbone and projection head
  • NT-Xent loss calculation
  • Backpropagation and optimization
  • Checkpoint saving and visualization

Demonstration of Augmentation Techniques

We've demonstrated the effect of our specialized augmentations on different types of underwater acoustic signals. The visualizations show:

  1. SimCLR Augmentation Pairs: Multiple differently-augmented views of the same spectrogram, as used in the contrastive learning process
  2. Individual Augmentation Effects: The impact of each augmentation type on different signal categories

These demonstrations illustrate how the augmentations preserve the essential characteristics of each signal type while introducing variations that help the model learn robust representations.

Usage Guidelines

Feature Extraction

Once trained, the model can be used to extract features from underwater acoustic spectrograms:

def extract_features(model_path, spectrogram_path):
    # Load model
    model = UnderwaterAcousticSimCLR()
    model.load_state_dict(torch.load(model_path))
    model.eval()
    
    # Load and preprocess spectrogram
    image = Image.open(spectrogram_path).convert('L')
    image = image.crop((80, 80, 800, 350))
    image_tensor = transforms.ToTensor()(image).unsqueeze(0)
    
    # Extract features
    with torch.no_grad():
        features, _ = model(image_tensor)
    
    return features.numpy()

Downstream Tasks

The extracted features can be used for various downstream tasks:

  1. Classification: Train a simple classifier on top of the frozen features
  2. Clustering: Group similar acoustic signals
  3. Anomaly Detection: Identify unusual acoustic events
  4. Signal Type Identification: Distinguish between biological, geological, and man-made signals

Conclusion

The SimCLR-based feature extractor for underwater acoustic spectrograms provides a powerful tool for learning robust representations from unlabeled data. The specialized architecture with multi-scale processing and attention mechanisms, combined with acoustic-specific augmentations, addresses the unique challenges of underwater acoustic signals.

This self-supervised approach is particularly valuable in the underwater acoustic domain, where labeled data may be scarce but unlabeled recordings are abundant. The learned representations capture the diverse characteristics of environmental noise, biological signals, man-made signals, and transient events, providing a solid foundation for various downstream tasks.

Future work could explore:

  1. Integration with other modalities (e.g., visual data from underwater cameras)
  2. Adaptation to real-time processing for online monitoring systems
  3. Extension to longer temporal contexts for tracking evolving acoustic environments