Overview
This enhancement proposal addresses the need for guaranteed temporal ordering and reliable delivery of metrics to OpenTelemetry/Instana with proper timestamp preservation during retry scenarios.
Problem Statement
Current metric delivery approach has several limitations:
- Temporal Ordering Issues: When metrics fail to send and are retried, they may arrive out of chronological order
- Timestamp Corruption: Retried metrics get new timestamps instead of preserving original collection time
- Delivery Reliability: No guarantees that metrics reach OpenTelemetry collector during network issues
- Clock Skew Handling: No protection against system clock synchronization issues
Example Failure Scenario
Time 10:00 - Collect CPU=50% → Send fails → Queued for retry
Time 10:01 - Collect CPU=60% → Send succeeds (timestamp: 10:01)
Time 10:02 - Retry CPU=50% → Send succeeds (timestamp: 10:02) ❌ WRONG!
Result: OpenTelemetry sees CPU=60% at 10:01, then CPU=50% at 10:02 (temporal inversion!)
Proposed Solution Architecture
1. Timestamped Metric Queue
class TimestampedMetric:
def __init__(self, name, value, collection_time_ns):
self.name = name
self.value = value
self.collection_time_ns = collection_time_ns # Original collection time
self.retry_count = 0
2. OpenTelemetry Timestamp Preservation
# Enhanced metric observation with original timestamps
def callback(options):
for metric in pending_metrics_queue:
yield Observation(
value=metric.value,
time_unix_nano=metric.collection_time_ns # Original time!
)
3. Ordered Delivery Queue
class OrderedMetricQueue:
def add_metrics(self, metrics_batch):
"""Add new metrics maintaining temporal order"""
self.pending_metrics.extend(metrics_batch)
# Keep sorted by collection timestamp
self.pending_metrics.sort(key=lambda m: m.collection_time_ns)
def get_batch_for_delivery(self, max_age_seconds=300):
"""Get oldest metrics that should be sent"""
now_ns = time.time_ns()
cutoff_ns = now_ns - (max_age_seconds * 1_000_000_000)
# Send metrics older than max_age to prevent indefinite delay
batch = [m for m in self.pending_metrics if m.collection_time_ns <= cutoff_ns]
return batch[:100] # Limit batch size
4. Advanced Features
- Clock Skew Detection: Identify and handle system clock synchronization issues
- Batch Window Strategy: Group metrics by time windows for efficient delivery
- Circuit Breaker Pattern: Prevent cascading failures during OpenTelemetry outages
- Dead Letter Queue: Handle metrics that repeatedly fail to deliver
Implementation Considerations
Architecture Changes Required
- Metric Collection Layer: Enhanced to capture and preserve original timestamps
- Queue Management: New component for ordered metric storage and retrieval
- Delivery Engine: Async delivery system with temporal ordering guarantees
- Error Handling: Comprehensive retry and failure management
Trade-offs
Benefits:
- ✅ Temporal integrity preserved during retries
- ✅ Guaranteed chronological ordering in OpenTelemetry
- ✅ Production-grade reliability and delivery guarantees
- ✅ Clock skew protection
- ✅ Comprehensive error handling
Complexity:
- ⚠️ Major architectural change (capture → cache → push pattern)
- ⚠️ Additional storage requirements for metric queue
- ⚠️ Increased memory and CPU overhead
- ⚠️ More complex error scenarios to handle
Future Implementation Timeline
This enhancement should be considered when:
- Production deployments require guaranteed temporal ordering
- Current simple retry approaches prove insufficient
- Monitoring systems require strict chronological consistency
- Network reliability issues become frequent
Related Issues
- CPU Usage Staleness Investigation (resolved in v0.1.04)
- Production reliability requirements assessment
- OpenTelemetry configuration optimization
Acceptance Criteria
Labels: enhancement, production, reliability, opentelemetry, future
Priority: Low (future enhancement)
Effort: High (major architectural change)
Overview
This enhancement proposal addresses the need for guaranteed temporal ordering and reliable delivery of metrics to OpenTelemetry/Instana with proper timestamp preservation during retry scenarios.
Problem Statement
Current metric delivery approach has several limitations:
Example Failure Scenario
Result: OpenTelemetry sees CPU=60% at 10:01, then CPU=50% at 10:02 (temporal inversion!)
Proposed Solution Architecture
1. Timestamped Metric Queue
2. OpenTelemetry Timestamp Preservation
3. Ordered Delivery Queue
4. Advanced Features
Implementation Considerations
Architecture Changes Required
Trade-offs
Benefits:
Complexity:
Future Implementation Timeline
This enhancement should be considered when:
Related Issues
Acceptance Criteria
Labels:
enhancement,production,reliability,opentelemetry,futurePriority:
Low(future enhancement)Effort:
High(major architectural change)