Modern distributed systems require sophisticated observability strategies. This guide walks through implementing production-ready observability using OpenTelemetry and cloud-native tools.
The Three Pillars of Observability
Effective observability combines metrics, logs, and traces to provide complete system visibility:
Metrics: System Health at a Glance
# Prometheus metrics configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "first_rules.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
Distributed Tracing with OpenTelemetry
Implement distributed tracing to understand request flows across microservices:
# OpenTelemetry Collector configuration
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
resource:
attributes:
- key: environment
value: production
action: upsert
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: "0.0.0.0:8889"
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch, resource]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch, resource]
exporters: [prometheus]
Application Instrumentation Best Practices
Proper instrumentation is crucial for meaningful observability data:
Automatic vs Manual Instrumentation
Strategy: Start with automatic instrumentation for immediate value, then add manual instrumentation for business-specific metrics and traces.
# Python application with OpenTelemetry
from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
# Configure tracing
trace.set_tracer_provider(TracerProvider())
tracer = trace.get_tracer(__name__)
otlp_exporter = OTLPSpanExporter(endpoint="http://otel-collector:4317")
span_processor = BatchSpanProcessor(otlp_exporter)
trace.get_tracer_provider().add_span_processor(span_processor)
# Manual instrumentation for business logic
@tracer.start_as_current_span("process_order")
def process_order(order_id):
span = trace.get_current_span()
span.set_attribute("order.id", order_id)
span.set_attribute("order.amount", order.amount)
try:
# Business logic here
result = validate_and_process(order_id)
span.set_attribute("order.status", "processed")
return result
except Exception as e:
span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
raise
Custom Metrics for Business KPIs
# Custom Prometheus metrics
order_processing_duration = Histogram(
'order_processing_seconds',
'Time spent processing orders',
['order_type', 'payment_method']
)
active_user_sessions = Gauge(
'active_user_sessions_total',
'Number of active user sessions'
)
order_total_counter = Counter(
'orders_processed_total',
'Total number of orders processed',
['status', 'region']
)
Log Aggregation and Correlation
Structured logging with trace correlation enables powerful debugging capabilities:
# Structured logging with trace context
import logging
from opentelemetry.instrumentation.logging import LoggingInstrumentor
# Auto-inject trace context into logs
LoggingInstrumentor().instrument(set_logging_format=True)
logger = logging.getLogger(__name__)
def process_payment(payment_data):
logger.info(
"Processing payment",
extra={
"payment_id": payment_data.id,
"amount": payment_data.amount,
"currency": payment_data.currency,
"user_id": payment_data.user_id
}
)
Alerting and SLI/SLO Management
Alerting and SLI/SLO Management
# Prometheus alerting rules
groups:
- name: slo-alerts
rules:
- alert: HighErrorRate
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m]))
) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) > 0.5
for: 5m
labels:
severity: warning
SLO Dashboard Configuration
- Availability SLO: 99.9% uptime (4.32 minutes downtime/month)
- Latency SLO: 95% of requests under 200ms
- Error Rate SLO: Less than 0.1% error rate
- Throughput SLO: Handle 10,000 requests/minute
Cost Optimization in Observability
Observability can become expensive at scale. Implement these strategies:
Tip: Use sampling for traces (1-10% in production), retain metrics for longer periods than logs, and implement log level filtering based on environment.
Intelligent Data Retention
- Traces: 7 days full retention, 30 days sampled
- Metrics: 1 year high-resolution, 2 years downsampled
- Logs: 30 days application logs, 90 days audit logs
Troubleshooting Distributed Systems
Effective observability enables rapid incident resolution:
The Debugging Workflow
- Alert triggers: Automated detection of SLO violations
- Service map analysis: Identify affected components
- Trace analysis: Find specific failing requests
- Metric correlation: Understand system behavior
- Log analysis: Root cause identification
This comprehensive observability stack provides the foundation for reliable, scalable distributed systems. Remember to start simple and evolve your observability practice as your system grows.