Production Observability with OpenTelemetry

Implement production-grade observability using OpenTelemetry as the 2025 industry standard. Covers the three pillars (metrics, logs, traces), LGTM stack deployment, and critical log-trace correlation patterns.

When to Use

Use when:

Building production systems requiring visibility into performance and errors
Debugging distributed systems with multiple services
Setting up monitoring, logging, or tracing infrastructure
Implementing structured logging with trace correlation
Configuring alerting rules for production systems

Skip if:

Building proof-of-concept without production deployment
System has < 100 requests/day (console logging may suffice)

Multi-Language Support

This skill provides patterns for:

Python: opentelemetry-api, opentelemetry-sdk, structlog
Rust: opentelemetry, tracing-opentelemetry
Go: go.opentelemetry.io/otel, slog
TypeScript: @opentelemetry/api, pino

The OpenTelemetry Standard (2025)

OpenTelemetry is the CNCF graduated project unifying observability:

┌────────────────────────────────────────────────────────┐
│          OpenTelemetry: The Unified Standard           │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ONE SDK for ALL signals:                              │
│  ├── Metrics (Prometheus-compatible)                   │
│  ├── Logs (structured, correlated)                     │
│  ├── Traces (distributed, standardized)                │
│  └── Context (propagates across services)              │
│                                                         │
│  Language SDKs:                                         │
│  ├── Python: opentelemetry-api, opentelemetry-sdk      │
│  ├── Rust: opentelemetry, tracing-opentelemetry        │
│  ├── Go: go.opentelemetry.io/otel                      │
│  └── TypeScript: @opentelemetry/api                    │
│                                                         │
│  Export to ANY backend:                                │
│  ├── LGTM Stack (Loki, Grafana, Tempo, Mimir)          │
│  ├── Prometheus + Jaeger                               │
│  ├── Datadog, New Relic, Honeycomb (SaaS)              │
│  └── Custom backends via OTLP protocol                 │
│                                                         │
└────────────────────────────────────────────────────────┘

The Three Pillars of Observability

1. Metrics (What is happening?)

Track system health and performance over time.

Metric Types: Counters (always increase), Gauges (up/down), Histograms (distributions), Summaries (percentiles).

Example (Python):

from opentelemetry import metrics

meter = metrics.get_meter(__name__)
http_requests = meter.create_counter("http.server.requests")
http_requests.add(1, {"method": "GET", "status": 200})

2. Logs (What happened?)

Record discrete events with context.

CRITICAL: Always inject trace_id/span_id for log-trace correlation.

Example (Python + structlog):

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "processing_request",
    trace_id=format(ctx.trace_id, '032x'),
    span_id=format(ctx.span_id, '016x'),
    user_id=user_id
)

3. Traces (Where did time go?)

Track request flow across distributed services.

Key Concepts: Trace (end-to-end journey), Span (individual operation), Parent-Child (nested operations).

Example (Python + FastAPI):

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-traces all HTTP requests

The LGTM Stack (Self-Hosted Observability)

LGTM = Loki (Logs) + Grafana (Visualization) + Tempo (Traces) + Mimir (Metrics)

┌────────────────────────────────────────────────────────┐
│                  LGTM Architecture                      │
├────────────────────────────────────────────────────────┤
│                                                         │
│  ┌──────────────────────────────────────────────┐      │
│  │           Grafana Dashboard (Port 3000)      │      │
│  │  Unified UI for Logs, Metrics, Traces       │      │
│  └──────┬──────────────┬─────────────┬─────────┘      │
│         │              │             │                 │
│         ▼              ▼             ▼                 │
│  ┌──────────┐   ┌──────────┐  ┌──────────┐            │
│  │   Loki   │   │  Tempo   │  │  Mimir   │            │
│  │  (Logs)  │   │ (Traces) │  │(Metrics) │            │
│  │Port 3100 │   │Port 3200 │  │Port 9009 │            │
│  └────▲─────┘   └────▲─────┘  └────▲─────┘            │
│       │              │             │                   │
│       └──────────────┴─────────────┘                   │
│                      │                                 │
│              ┌───────▼────────┐                        │
│              │ Grafana Alloy  │                        │
│              │  (Collector)   │                        │
│              │  Port 4317/8   │ ← OTLP gRPC/HTTP       │
│              └───────▲────────┘                        │
│                      │                                 │
│         OpenTelemetry Instrumented Apps                │
│                                                         │
└────────────────────────────────────────────────────────┘

Critical Pattern: Log-Trace Correlation

The Problem: Logs and traces live in separate systems. You see an error log but can't find the related trace.

The Solution: Inject trace_id and span_id into every log record.

Python (structlog)

import structlog
from opentelemetry import trace

logger = structlog.get_logger()
span = trace.get_current_span()
ctx = span.get_span_context()

logger.info(
    "request_processed",
    trace_id=format(ctx.trace_id, '032x'),  # 32-char hex
    span_id=format(ctx.span_id, '016x'),    # 16-char hex
    user_id=user_id
)

Rust (tracing)

use tracing::{info, instrument};

#[instrument(fields(user_id = %user_id))]
async fn process_request(user_id: u64) -> Result<Response> {
    // trace_id/span_id automatically included
    info!(user_id = user_id, "processing request");
    Ok(result)
}

Query in Grafana

{job="api-service"} |= "trace_id=4bf92f3577b34da6a3ce929d0e0e4736"

Quick Setup Guide

1. Choose Your Stack

Decision Tree:

Greenfield: OpenTelemetry SDK + LGTM Stack (self-hosted) or Grafana Cloud (managed)
Existing Prometheus: Add Loki (logs) + Tempo (traces)
Kubernetes: LGTM via Helm, Alloy DaemonSet
Zero-ops: Managed SaaS (Grafana Cloud, Datadog, New Relic)

2. Install OpenTelemetry SDK

Manual (Python):

pip install opentelemetry-api opentelemetry-sdk \
    opentelemetry-instrumentation-fastapi \
    opentelemetry-exporter-otlp

3. Deploy LGTM Stack

Docker Compose (development):

cd examples/lgtm-docker-compose
docker-compose up -d
# Grafana: http://localhost:3000 (admin/admin)
# OTLP: localhost:4317 (gRPC), localhost:4318 (HTTP)

Auto-Instrumentation

OpenTelemetry auto-instruments popular frameworks:

from opentelemetry.instrumentation.fastapi import FastAPIInstrumentor

app = FastAPI()
FastAPIInstrumentor.instrument_app(app)  # Auto-trace all HTTP requests

Supported: FastAPI, Flask, Django, Express, Gin, Echo, Nest.js

Common Patterns

Custom Spans

from opentelemetry import trace

tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("fetch_user_details") as span:
    span.set_attribute("user_id", user_id)
    user = await db.fetch_user(user_id)
    span.set_attribute("user_found", user is not None)

Error Tracking

from opentelemetry.trace import Status, StatusCode

with tracer.start_as_current_span("process_payment") as span:
    try:
        result = process_payment(amount, card_token)
        span.set_status(Status(StatusCode.OK))
    except PaymentError as e:
        span.set_status(Status(StatusCode.ERROR, str(e)))
        span.record_exception(e)
        raise

Integration with Other Skills

Dashboards: Embed Grafana panels, query Prometheus metrics
Feedback: Alert routing (Slack, PagerDuty), notification UI
Data-Viz: Time-series charts, trace waterfall, latency heatmaps
API Patterns: Auto-instrument REST/GraphQL APIs
Message Queues: Trace async job processing

Key Principles

OpenTelemetry is THE standard - Use OTel SDK, not vendor-specific SDKs
Auto-instrumentation first - Prefer auto over manual spans
Always correlate logs and traces - Inject trace_id/span_id into every log
Use structured logging - JSON format, consistent field names
LGTM stack for self-hosting - Production-ready open-source stack

Common Pitfalls

Don't:

Use vendor-specific SDKs (use OpenTelemetry)
Log without trace_id/span_id context
Manually instrument what auto-instrumentation covers
Mix logging libraries (pick one: structlog, tracing, slog, pino)

Do:

Start with auto-instrumentation
Add manual spans only for business-critical operations
Use semantic conventions for span attributes
Export to OTLP (gRPC preferred over HTTP)
Test locally with LGTM docker-compose before production

Success Metrics

100% of logs include trace_id when in request context
Mean time to resolution (MTTR) decreases by >50%
Developers use Grafana as first debugging tool
80%+ of telemetry from auto-instrumentation
Alert noise < 5% false positives

API Patterns - Instrument APIs automatically
Message Queues - Trace async workflows
Dashboards - Visualize metrics in Grafana
Deploying Applications - Deploy LGTM stack

References

Full Skill Documentation
OpenTelemetry: https://opentelemetry.io/
Grafana LGTM: https://grafana.com/
Prometheus: https://prometheus.io/

When to Use​

Multi-Language Support​

The OpenTelemetry Standard (2025)​

The Three Pillars of Observability​

1. Metrics (What is happening?)​

2. Logs (What happened?)​

3. Traces (Where did time go?)​

The LGTM Stack (Self-Hosted Observability)​

Critical Pattern: Log-Trace Correlation​

Python (structlog)​

Rust (tracing)​

Query in Grafana​

Quick Setup Guide​

1. Choose Your Stack​

2. Install OpenTelemetry SDK​

3. Deploy LGTM Stack​

Auto-Instrumentation​

Common Patterns​

Custom Spans​

Error Tracking​

Integration with Other Skills​

Key Principles​

Common Pitfalls​

Success Metrics​

Related Skills​

References​