Skip to content

Enhancement: Temporal-Ordered Metric Delivery Queue for Production Reliability #30

@laplaque

Description

@laplaque

Overview

This enhancement proposal addresses the need for guaranteed temporal ordering and reliable delivery of metrics to OpenTelemetry/Instana with proper timestamp preservation during retry scenarios.

Problem Statement

Current metric delivery approach has several limitations:

  1. Temporal Ordering Issues: When metrics fail to send and are retried, they may arrive out of chronological order
  2. Timestamp Corruption: Retried metrics get new timestamps instead of preserving original collection time
  3. Delivery Reliability: No guarantees that metrics reach OpenTelemetry collector during network issues
  4. Clock Skew Handling: No protection against system clock synchronization issues

Example Failure Scenario

Time 10:00 - Collect CPU=50% → Send fails → Queued for retry
Time 10:01 - Collect CPU=60% → Send succeeds (timestamp: 10:01)  
Time 10:02 - Retry CPU=50% → Send succeeds (timestamp: 10:02) ❌ WRONG!

Result: OpenTelemetry sees CPU=60% at 10:01, then CPU=50% at 10:02 (temporal inversion!)

Proposed Solution Architecture

1. Timestamped Metric Queue

class TimestampedMetric:
    def __init__(self, name, value, collection_time_ns):
        self.name = name
        self.value = value
        self.collection_time_ns = collection_time_ns  # Original collection time
        self.retry_count = 0

2. OpenTelemetry Timestamp Preservation

# Enhanced metric observation with original timestamps
def callback(options):
    for metric in pending_metrics_queue:
        yield Observation(
            value=metric.value,
            time_unix_nano=metric.collection_time_ns  # Original time!
        )

3. Ordered Delivery Queue

class OrderedMetricQueue:
    def add_metrics(self, metrics_batch):
        """Add new metrics maintaining temporal order"""
        self.pending_metrics.extend(metrics_batch)
        # Keep sorted by collection timestamp
        self.pending_metrics.sort(key=lambda m: m.collection_time_ns)
        
    def get_batch_for_delivery(self, max_age_seconds=300):
        """Get oldest metrics that should be sent"""
        now_ns = time.time_ns() 
        cutoff_ns = now_ns - (max_age_seconds * 1_000_000_000)
        
        # Send metrics older than max_age to prevent indefinite delay
        batch = [m for m in self.pending_metrics if m.collection_time_ns <= cutoff_ns]
        return batch[:100]  # Limit batch size

4. Advanced Features

  • Clock Skew Detection: Identify and handle system clock synchronization issues
  • Batch Window Strategy: Group metrics by time windows for efficient delivery
  • Circuit Breaker Pattern: Prevent cascading failures during OpenTelemetry outages
  • Dead Letter Queue: Handle metrics that repeatedly fail to deliver

Implementation Considerations

Architecture Changes Required

  1. Metric Collection Layer: Enhanced to capture and preserve original timestamps
  2. Queue Management: New component for ordered metric storage and retrieval
  3. Delivery Engine: Async delivery system with temporal ordering guarantees
  4. Error Handling: Comprehensive retry and failure management

Trade-offs

Benefits:

  • ✅ Temporal integrity preserved during retries
  • ✅ Guaranteed chronological ordering in OpenTelemetry
  • ✅ Production-grade reliability and delivery guarantees
  • ✅ Clock skew protection
  • ✅ Comprehensive error handling

Complexity:

  • ⚠️ Major architectural change (capture → cache → push pattern)
  • ⚠️ Additional storage requirements for metric queue
  • ⚠️ Increased memory and CPU overhead
  • ⚠️ More complex error scenarios to handle

Future Implementation Timeline

This enhancement should be considered when:

  1. Production deployments require guaranteed temporal ordering
  2. Current simple retry approaches prove insufficient
  3. Monitoring systems require strict chronological consistency
  4. Network reliability issues become frequent

Related Issues

  • CPU Usage Staleness Investigation (resolved in v0.1.04)
  • Production reliability requirements assessment
  • OpenTelemetry configuration optimization

Acceptance Criteria

  • Metrics maintain original collection timestamps during retry scenarios
  • OpenTelemetry receives metrics in strict chronological order
  • Clock skew detection and handling implemented
  • Delivery guarantees configurable (at-least-once vs exactly-once)
  • Performance impact minimized through efficient batching
  • Comprehensive test suite covering temporal ordering scenarios
  • Production monitoring and alerting for delivery failures

Labels: enhancement, production, reliability, opentelemetry, future
Priority: Low (future enhancement)
Effort: High (major architectural change)

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions