Kubernetes Observability: Metrics, Logs, and Traces Done Right

The difference between a Kubernetes cluster you can operate confidently and one that feels like a black box is observability. Not monitoring—observability. The distinction matters: monitoring tells you when things break. Observability lets you understand why.

The three pillars are well established: metrics, logs, and traces. Here’s how to implement them properly.

The Stack

The de facto standard for Kubernetes observability in 2026:

Metrics: Prometheus + Grafana (kube-prometheus-stack)
Logs: Loki + Grafana (or OpenSearch if you need full-text search at scale)
Traces: Tempo + Grafana (or Jaeger for standalone UI)
Instrumentation: OpenTelemetry

The Grafana ecosystem (Grafana, Loki, Tempo, Mimir) has a strong advantage: everything integrates, and you use one UI for all three pillars. If you’re starting fresh, this is the right choice.

Metrics with Prometheus

The kube-prometheus-stack Helm chart deploys everything you need:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: kube-prometheus-stack
  namespace: monitoring
spec:
  interval: 30m
  chart:
    spec:
      chart: kube-prometheus-stack
      version: ">=55.0.0 <60.0.0"
      sourceRef:
        kind: HelmRepository
        name: prometheus-community
        namespace: flux-system
  values:
    grafana:
      enabled: true
      adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
      ingress:
        enabled: true
        ingressClassName: nginx
        hosts:
          - grafana.internal.example.com
    prometheus:
      prometheusSpec:
        retention: 30d
        storageSpec:
          volumeClaimTemplate:
            spec:
              storageClassName: longhorn
              accessModes: ["ReadWriteOnce"]
              resources:
                requests:
                  storage: 50Gi
    alertmanager:
      alertmanagerSpec:
        storage:
          volumeClaimTemplate:
            spec:
              storageClassName: longhorn
              resources:
                requests:
                  storage: 10Gi

This gives you:

Prometheus scraping all cluster components
Alertmanager for alert routing
Grafana with default Kubernetes dashboards
Node Exporter on every node
kube-state-metrics for Kubernetes object metrics

ServiceMonitors: The Right Way to Scrape

Don’t add scrape configs to Prometheus directly. Use ServiceMonitor resources:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-app
  namespace: my-app
  labels:
    release: kube-prometheus-stack  # Must match prometheus operator's selector
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: my-app
  endpoints:
    - port: http
      path: /metrics
      interval: 30s
      scrapeTimeout: 10s

The Prometheus Operator discovers ServiceMonitor resources automatically. Your application teams can deploy their own monitoring configuration without touching the Prometheus server.

Alerting Rules That Matter

The kube-prometheus-stack ships with good default alerts. Customize them for your environment:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: my-app-alerts
  namespace: my-app
spec:
  groups:
    - name: my-app
      rules:
        - alert: MyAppHighErrorRate
          expr: |
            rate(http_requests_total{job="my-app", status=~"5.."}[5m])
            /
            rate(http_requests_total{job="my-app"}[5m])
            > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "My App has high error rate"
            description: "Error rate is {{ humanizePercentage $value }} over the last 5 minutes"

        - alert: MyAppHighLatency
          expr: |
            histogram_quantile(0.99,
              rate(http_request_duration_seconds_bucket{job="my-app"}[5m])
            ) > 1.0
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "My App p99 latency is high"

Logs with Loki

Loki is Prometheus for logs. It stores logs indexed by labels (same as Prometheus metrics), not by content. This makes it cheap to operate but less capable for full-text search than Elasticsearch. For most use cases, it’s more than sufficient.

Deploy with the loki-stack or grafana/loki Helm charts:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: loki
  namespace: monitoring
spec:
  chart:
    spec:
      chart: loki
      version: ">=5.0.0 <6.0.0"
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  values:
    loki:
      commonConfig:
        replication_factor: 1
      storage:
        type: filesystem
    singleBinary:
      replicas: 1
      persistence:
        size: 20Gi
        storageClass: longhorn

For log shipping, Promtail (also from Grafana) reads logs from Kubernetes pod log files and ships them to Loki:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: promtail
  namespace: monitoring
spec:
  chart:
    spec:
      chart: promtail
      version: ">=6.0.0"
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  values:
    config:
      clients:
        - url: http://loki:3100/loki/api/v1/push
    tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

Promtail runs as a DaemonSet, collecting logs from every node. The Kubernetes metadata (namespace, pod name, container name) becomes Loki labels automatically.

Traces with Tempo and OpenTelemetry

Distributed tracing is the most impactful observability tool for understanding latency problems in microservices. A trace shows you the complete journey of a request across all services, with timing for each hop.

Tempo stores traces. OpenTelemetry instruments your applications.

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
  name: tempo
  namespace: monitoring
spec:
  chart:
    spec:
      chart: tempo
      sourceRef:
        kind: HelmRepository
        name: grafana
        namespace: flux-system
  values:
    tempo:
      storage:
        trace:
          backend: local
          local:
            path: /var/tempo/traces
    persistence:
      enabled: true
      size: 10Gi
      storageClassName: longhorn

For Spring Boot applications, adding OpenTelemetry is a few lines in build.gradle:

dependencies {
    implementation 'io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter:2.9.0'
}

# application.yaml
management:
  tracing:
    sampling:
      probability: 0.1  # Sample 10% of requests
otel:
  exporter:
    otlp:
      endpoint: http://tempo.monitoring.svc.cluster.local:4318
  service:
    name: my-app

Zero code changes. The Spring Boot auto-instrumentation captures HTTP requests, database calls, external service calls—all as spans in a trace.

The Grafana Unified View

The real value of the Grafana ecosystem is correlation. When you get an alert about high error rate:

Open Grafana, find the error rate spike on your Prometheus dashboard
Click into the time window, drill down to the service
Switch to Logs view (Loki)—see the actual error messages from that time window
See a trace ID in the logs, click it
Open Tempo—see the full distributed trace for a failing request

From alert to root cause in three clicks. This is what good observability enables.

What to Actually Alert On

Alert on symptoms, not causes. The four golden signals (from Google’s SRE book):

Latency: How long does it take to serve a request?
Traffic: How many requests per second?
Errors: What’s the error rate?
Saturation: How full are your resources (CPU, memory, disk)?

Don’t alert on CPU > 80%. Alert on request latency > SLO threshold. CPU high is a cause; high latency is the symptom your users feel.

Closing Thoughts

Observability is infrastructure. It should be deployed the same way your applications are—via GitOps, with version-controlled configuration, with secrets managed properly, with alerts that route to wherever your team actually sees them.

The Grafana ecosystem gives you everything you need. The kube-prometheus-stack gives you metrics from day one. Add Loki for logs and Tempo for traces and you have a complete picture of what your cluster is doing. The marginal cost of adding the full stack versus just metrics is small, and the debugging capability delta is enormous.

Don’t wait until you have an incident to set this up.