Building Production-Ready Observability with OpenTelemetry

Oct 22, 2024

Modern distributed systems require sophisticated observability strategies. This guide walks through implementing production-ready observability using OpenTelemetry and cloud-native tools.

The Three Pillars of Observability

Effective observability combines metrics, logs, and traces to provide complete system visibility:

Metrics: System Health at a Glance

# Prometheus metrics configuration
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "first_rules.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
          - alertmanager:9093

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
    - role: pod
    relabel_configs:
    - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
      action: keep
      regex: true

Distributed Tracing with OpenTelemetry

Implement distributed tracing to understand request flows across microservices:

# OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  resource:
    attributes:
    - key: environment
      value: production
      action: upsert

exporters:
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]

Application Instrumentation Best Practices

Proper instrumentation is crucial for meaningful observability data:

Automatic vs Manual Instrumentation

💡

Strategy: Start with automatic instrumentation for immediate value, then add manual instrumentation for business-specific metrics and traces.

# Python application with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor

# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)

otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)

# Manual instrumentation for business logic
@tracer.start_as_current_span("process_order")
def process_order(order_id):
    span = trace.get_current_span()
    span.set_attribute("order.id", order_id)
    span.set_attribute("order.amount", order.amount)
    
    try:
        # Business logic here
        result = validate_and_process(order_id)
        span.set_attribute("order.status", "processed")
        return result
    except Exception as e:
        span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
        raise

Custom Metrics for Business KPIs

# Custom Prometheus metrics
order_processing_duration = Histogram(
    'order_processing_seconds',
    'Time spent processing orders',
    ['order_type', 'payment_method']
)

active_user_sessions = Gauge(
    'active_user_sessions_total',
    'Number of active user sessions'
)

order_total_counter = Counter(
    'orders_processed_total',
    'Total number of orders processed',
    ['status', 'region']
)

Log Aggregation and Correlation

Structured logging with trace correlation enables powerful debugging capabilities:

# Structured logging with trace context
import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor

# Auto-inject trace context into logs
LoggingInstrumentor().instrument(set_logging_format=True)

logger = logging.getLogger(__name__)

def process_payment(payment_data):
    logger.info(
        "Processing payment",
        extra={
            "payment_id": payment_data.id,
            "amount": payment_data.amount,
            "currency": payment_data.currency,
            "user_id": payment_data.user_id
        }
    )

Alerting and SLI/SLO Management

Alerting and SLI/SLO Management

# Prometheus alerting rules
groups:
- name: slo-alerts
  rules:
  - alert: HighErrorRate
    expr: |
      (
        sum(rate(http_requests_total{status=~"5.."}[5m])) /
        sum(rate(http_requests_total[5m]))
      ) > 0.05
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "High error rate detected"
      description: "Error rate is {{ $value | humanizePercentage }}"

  - alert: HighLatency
    expr: |
      histogram_quantile(0.95, 
        sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
      ) > 0.5
    for: 5m
    labels:
      severity: warning

SLO Dashboard Configuration

  • Availability SLO: 99.9% uptime (4.32 minutes downtime/month)
  • Latency SLO: 95% of requests under 200ms
  • Error Rate SLO: Less than 0.1% error rate
  • Throughput SLO: Handle 10,000 requests/minute

Cost Optimization in Observability

Observability can become expensive at scale. Implement these strategies:

💡

Tip: Use sampling for traces (1-10% in production), retain metrics for longer periods than logs, and implement log level filtering based on environment.

Intelligent Data Retention

  • Traces: 7 days full retention, 30 days sampled
  • Metrics: 1 year high-resolution, 2 years downsampled
  • Logs: 30 days application logs, 90 days audit logs

Troubleshooting Distributed Systems

Effective observability enables rapid incident resolution:

The Debugging Workflow

  1. Alert triggers: Automated detection of SLO violations
  2. Service map analysis: Identify affected components
  3. Trace analysis: Find specific failing requests
  4. Metric correlation: Understand system behavior
  5. Log analysis: Root cause identification

This comprehensive observability stack provides the foundation for reliable, scalable distributed systems. Remember to start simple and evolve your observability practice as your system grows.

Ops & Cloud