Observability Guide

Comprehensive observability with OpenTelemetry: traces, metrics, and structured logs

Overview

The vCon MCP server includes production-ready observability powered by OpenTelemetry. Monitor request flows, track performance metrics, and analyze logs with full trace correlation.

Features

Distributed Tracing: Full request lifecycle tracing with spans for every operation
Metrics: Business and performance metrics (counters, histograms, gauges)
Structured Logging: JSON logs with automatic trace context correlation
Flexible Exports: Console/JSON for development, OTLP for production collectors
Zero Configuration: Works out of the box with sensible defaults
No Performance Impact: Minimal overhead when disabled

Quick Start

Development (Console Export)

Output telemetry to stderr as JSON for local development:

# .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=console
OTEL_LOG_LEVEL=info

Start the server and see structured logs:

npm run dev

Production (OTLP Collector)

Export to an OpenTelemetry collector:

# .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318
OTEL_SERVICE_NAME=vcon-mcp-server
OTEL_SERVICE_VERSION=1.0.0

Configuration

Environment Variables

Variable	Default	Description
`OTEL_ENABLED`	`true`	Enable/disable OpenTelemetry
`OTEL_EXPORTER_TYPE`	`console`	Export type: `console` or `otlp`
`OTEL_ENDPOINT`	`http://localhost:4318`	OTLP collector endpoint
`OTEL_SERVICE_NAME`	`vcon-mcp-server`	Service name in telemetry
`OTEL_SERVICE_VERSION`	`1.0.0`	Service version in telemetry
`OTEL_LOG_LEVEL`	`info`	Diagnostic log level

Export Types

Console (Development)

Best for local development and debugging:

OTEL_EXPORTER_TYPE=console

Outputs JSON to stderr
No external dependencies
Easy to pipe to log processors
Human-readable with jq

Example output:

{
  "timestamp": "2025-10-15T10:30:45.123Z",
  "level": "info",
  "message": "Tool execution completed",
  "trace_id": "1234567890abcdef",
  "span_id": "abcdef123456",
  "tool_name": "create_vcon",
  "duration_ms": 125
}

OTLP (Production)

Best for production environments with observability platforms:

OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318

Exports via OTLP/HTTP protocol
Compatible with all major observability platforms
Efficient binary protocol
Automatic retries and batching

Collector Setup

Testing Setup (Recommended)

The easiest way to set up a local backend for testing is using Docker Compose with the included helper scripts:

Quick Start with Docker Compose

# Start Jaeger backend
./jaeger/start-jaeger.sh

# Configure your server in .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318
OTEL_SERVICE_NAME=vcon-mcp-server
OTEL_SERVICE_VERSION=1.0.0

# Start your server
npm run dev

# View traces at: http://localhost:16686

The jaeger/docker-compose.yml file includes a pre-configured Jaeger all-in-one setup. The helper scripts handle container management:

./jaeger/start-jaeger.sh - Starts Jaeger and checks health
./jaeger/stop-jaeger.sh - Stops and removes the container

Manual Docker Compose

You can also use docker-compose directly from the jaeger directory:

# Start
cd jaeger
docker compose up -d jaeger

# Stop
docker compose stop jaeger
docker compose rm -f jaeger

Local Development

Jaeger (All-in-One) - Manual Setup

If you prefer to run Jaeger manually without docker-compose:

docker run -d --name jaeger \
  -p 4318:4318 \
  -p 16686:16686 \
  jaegertracing/all-in-one:latest

# .env
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318

View traces at: http://localhost:16686

OpenTelemetry Collector

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      http:
        endpoint: 0.0.0.0:4318

exporters:
  logging:
    loglevel: debug

service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [logging]
    metrics:
      receivers: [otlp]
      exporters: [logging]

docker run -d --name otel-collector \
  -p 4318:4318 \
  -v $(pwd)/otel-collector-config.yaml:/etc/otel-collector-config.yaml \
  otel/opentelemetry-collector:latest \
  --config=/etc/otel-collector-config.yaml

# .env
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318

Cloud Platforms

Honeycomb

# .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=https://api.honeycomb.io
OTEL_SERVICE_NAME=vcon-mcp-server

# Also set Honeycomb API key header (requires custom exporter config)

Datadog

# .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=http://localhost:4318
OTEL_SERVICE_NAME=vcon-mcp-server

# Run Datadog agent with OTLP receiver enabled

New Relic

# .env
OTEL_ENABLED=true
OTEL_EXPORTER_TYPE=otlp
OTEL_ENDPOINT=https://otlp.nr-data.net:4318
OTEL_SERVICE_NAME=vcon-mcp-server

# Set New Relic API key in exporter config

AWS X-Ray

Use AWS OpenTelemetry Collector with X-Ray exporter.

Telemetry Reference

Traces

Traces capture the full lifecycle of requests with hierarchical spans.

MCP Tool Execution

Every tool call creates a root span:

Span Name: mcp.tool.{tool_name}
Attributes:
- mcp.tool.name: Tool name
- mcp.tool.success: Boolean success status
- vcon.uuid: vCon UUID (when applicable)
- error.type: Error type (on failure)
- error.message: Error message (on failure)

Database Operations

Database queries create child spans:

Span Names:
- db.createVCon
- db.getVCon
- db.keywordSearch
- db.semanticSearch
- db.hybridSearch
Attributes:
- db.operation: Operation type (insert, select, search)
- db.system: Database system (supabase)
- vcon.uuid: vCon UUID
- cache.hit: Cache hit/miss boolean
- search.type: Search type
- search.results.count: Result count

Metrics

Business Metrics

Metric	Type	Description	Labels
`vcon.created.count`	Counter	vCons created	`vcon.uuid`
`vcon.deleted.count`	Counter	vCons deleted	`vcon.uuid`
`vcon.search.count`	Counter	Searches performed	`search.type`
`tool.execution.count`	Counter	Tool executions	`mcp.tool.name`, `status`

Performance Metrics

Metric	Type	Description	Labels
`tool.execution.duration`	Histogram	Tool execution time (ms)	`mcp.tool.name`, `mcp.tool.success`
`db.query.duration`	Histogram	Query execution time (ms)	`operation`
`cache.operation.duration`	Histogram	Cache operation time (ms)	`operation`

System Metrics

Metric	Type	Description	Labels
`db.query.count`	Counter	Database queries	`operation`
`db.query.errors`	Counter	Database errors	`operation`, `error_type`
`cache.hit`	Counter	Cache hits	`operation`
`cache.miss`	Counter	Cache misses	`operation`
`cache.error`	Counter	Cache errors	`error_type`

Structured Logs

All logs include automatic trace context correlation:

{
  "timestamp": "2025-10-15T10:30:45.123Z",
  "level": "info",
  "message": "Cache layer enabled",
  "trace_id": "1234567890abcdef",
  "span_id": "abcdef123456",
  "trace_flags": 1,
  "cache_ttl": 3600
}

Log Levels

debug: Verbose diagnostic information
info: Normal operational events
warn: Warning conditions
error: Error conditions

Best Practices

Development

Use Console Export: Easy to debug locally
Pipe to jq: Format JSON logs for readability
```
npm run dev 2>&1 | jq -R 'fromjson?'
```
Filter by Trace: Follow a single request through the system
Check Metrics: Monitor performance during development

Production

Use OTLP Export: Connect to your observability platform
Set Service Name: Identify this service in distributed traces
Monitor Key Metrics:
- Tool execution duration
- Database query count
- Cache hit rate
- Error rates
Set Up Alerts:
- High error rates
- Slow queries (p95 > 1000ms)
- Low cache hit rates (< 50%)
Retain Traces: Keep 7-30 days for debugging

Performance

The observability system is designed for minimal overhead:

When Enabled: < 1% CPU overhead, < 10MB memory
When Disabled: Zero overhead (short-circuit checks)
Batching: Metrics and traces are batched for efficiency
Async Export: No blocking on the critical path

Sampling

For high-traffic scenarios, consider trace sampling:

# Sample 10% of traces (requires custom configuration)
OTEL_TRACES_SAMPLER=traceidratio
OTEL_TRACES_SAMPLER_ARG=0.1

Troubleshooting

No Telemetry Output

Check OTEL_ENABLED=true
Verify OTEL_EXPORTER_TYPE is set
Check initialization logs in stderr

OTLP Connection Errors

Verify collector is running
Check OTEL_ENDPOINT is correct
Ensure network connectivity
Check firewall rules

Missing Traces

Verify span creation in code
Check sampling configuration
Ensure proper shutdown (flushes buffers)

High Overhead

Reduce OTEL_LOG_LEVEL to error
Disable auto-instrumentations (edit config)
Increase metric export interval
Enable trace sampling

Examples

Query Spans in Jaeger

Open Jaeger UI: http://localhost:16686
Select service: vcon-mcp-server
Search for operations: mcp.tool.create_vcon
View trace timeline and tags

Analyze Metrics

# Export metrics to Prometheus format
curl http://localhost:9464/metrics

# Query specific metric
curl http://localhost:9464/metrics | grep vcon_created_count

Correlate Logs with Traces

# Extract trace ID from log
TRACE_ID=$(cat logs.json | jq -r '.trace_id')

# Search traces by ID in Jaeger
curl "http://localhost:16686/api/traces/${TRACE_ID}"

Integration Examples

Custom Dashboard (Grafana)

{
  "dashboard": {
    "title": "vCon MCP Server",
    "panels": [
      {
        "title": "Tool Execution Rate",
        "targets": [
          {
            "expr": "rate(tool_execution_count[5m])"
          }
        ]
      },
      {
        "title": "Database Query Duration (p95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, rate(db_query_duration_bucket[5m]))"
          }
        ]
      },
      {
        "title": "Cache Hit Rate",
        "targets": [
          {
            "expr": "rate(cache_hit[5m]) / (rate(cache_hit[5m]) + rate(cache_miss[5m]))"
          }
        ]
      }
    ]
  }
}

Alert Rules (Prometheus)

groups:
  - name: vcon_mcp
    rules:
      - alert: HighErrorRate
        expr: rate(tool_execution_count{status="error"}[5m]) > 0.1
        annotations:
          summary: "High error rate detected"
      
      - alert: SlowQueries
        expr: histogram_quantile(0.95, rate(db_query_duration_bucket[5m])) > 1000
        annotations:
          summary: "Database queries are slow"
      
      - alert: LowCacheHitRate
        expr: rate(cache_hit[5m]) / (rate(cache_hit[5m]) + rate(cache_miss[5m])) < 0.5
        annotations:
          summary: "Cache hit rate is low"

FilesExpand file tree

observability.md

Latest commit

History