The difference between a Kubernetes cluster you can operate confidently and one that feels like a black box is observability. Not monitoring—observability. The distinction matters: monitoring tells you when things break. Observability lets you understand why.
The three pillars are well established: metrics, logs, and traces. Here’s how to implement them properly.
The Stack
The de facto standard for Kubernetes observability in 2026:
- Metrics: Prometheus + Grafana (kube-prometheus-stack)
- Logs: Loki + Grafana (or OpenSearch if you need full-text search at scale)
- Traces: Tempo + Grafana (or Jaeger for standalone UI)
- Instrumentation: OpenTelemetry
The Grafana ecosystem (Grafana, Loki, Tempo, Mimir) has a strong advantage: everything integrates, and you use one UI for all three pillars. If you’re starting fresh, this is the right choice.
Metrics with Prometheus
The kube-prometheus-stack Helm chart deploys everything you need:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: kube-prometheus-stack
namespace: monitoring
spec:
interval: 30m
chart:
spec:
chart: kube-prometheus-stack
version: ">=55.0.0 <60.0.0"
sourceRef:
kind: HelmRepository
name: prometheus-community
namespace: flux-system
values:
grafana:
enabled: true
adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
ingress:
enabled: true
ingressClassName: nginx
hosts:
- grafana.internal.example.com
prometheus:
prometheusSpec:
retention: 30d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: longhorn
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 50Gi
alertmanager:
alertmanagerSpec:
storage:
volumeClaimTemplate:
spec:
storageClassName: longhorn
resources:
requests:
storage: 10Gi
This gives you:
- Prometheus scraping all cluster components
- Alertmanager for alert routing
- Grafana with default Kubernetes dashboards
- Node Exporter on every node
- kube-state-metrics for Kubernetes object metrics
ServiceMonitors: The Right Way to Scrape
Don’t add scrape configs to Prometheus directly. Use ServiceMonitor resources:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: my-app
labels:
release: kube-prometheus-stack # Must match prometheus operator's selector
spec:
selector:
matchLabels:
app.kubernetes.io/name: my-app
endpoints:
- port: http
path: /metrics
interval: 30s
scrapeTimeout: 10s
The Prometheus Operator discovers ServiceMonitor resources automatically. Your application teams can deploy their own monitoring configuration without touching the Prometheus server.
Alerting Rules That Matter
The kube-prometheus-stack ships with good default alerts. Customize them for your environment:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-alerts
namespace: my-app
spec:
groups:
- name: my-app
rules:
- alert: MyAppHighErrorRate
expr: |
rate(http_requests_total{job="my-app", status=~"5.."}[5m])
/
rate(http_requests_total{job="my-app"}[5m])
> 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "My App has high error rate"
description: "Error rate is {{ humanizePercentage $value }} over the last 5 minutes"
- alert: MyAppHighLatency
expr: |
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{job="my-app"}[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "My App p99 latency is high"
Logs with Loki
Loki is Prometheus for logs. It stores logs indexed by labels (same as Prometheus metrics), not by content. This makes it cheap to operate but less capable for full-text search than Elasticsearch. For most use cases, it’s more than sufficient.
Deploy with the loki-stack or grafana/loki Helm charts:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: loki
namespace: monitoring
spec:
chart:
spec:
chart: loki
version: ">=5.0.0 <6.0.0"
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
values:
loki:
commonConfig:
replication_factor: 1
storage:
type: filesystem
singleBinary:
replicas: 1
persistence:
size: 20Gi
storageClass: longhorn
For log shipping, Promtail (also from Grafana) reads logs from Kubernetes pod log files and ships them to Loki:
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: promtail
namespace: monitoring
spec:
chart:
spec:
chart: promtail
version: ">=6.0.0"
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
values:
config:
clients:
- url: http://loki:3100/loki/api/v1/push
tolerations:
- key: node-role.kubernetes.io/master
operator: Exists
effect: NoSchedule
Promtail runs as a DaemonSet, collecting logs from every node. The Kubernetes metadata (namespace, pod name, container name) becomes Loki labels automatically.
Traces with Tempo and OpenTelemetry
Distributed tracing is the most impactful observability tool for understanding latency problems in microservices. A trace shows you the complete journey of a request across all services, with timing for each hop.
Tempo stores traces. OpenTelemetry instruments your applications.
apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
metadata:
name: tempo
namespace: monitoring
spec:
chart:
spec:
chart: tempo
sourceRef:
kind: HelmRepository
name: grafana
namespace: flux-system
values:
tempo:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
persistence:
enabled: true
size: 10Gi
storageClassName: longhorn
For Spring Boot applications, adding OpenTelemetry is a few lines in build.gradle:
dependencies {
implementation 'io.opentelemetry.instrumentation:opentelemetry-spring-boot-starter:2.9.0'
}
# application.yaml
management:
tracing:
sampling:
probability: 0.1 # Sample 10% of requests
otel:
exporter:
otlp:
endpoint: http://tempo.monitoring.svc.cluster.local:4318
service:
name: my-app
Zero code changes. The Spring Boot auto-instrumentation captures HTTP requests, database calls, external service calls—all as spans in a trace.
The Grafana Unified View
The real value of the Grafana ecosystem is correlation. When you get an alert about high error rate:
- Open Grafana, find the error rate spike on your Prometheus dashboard
- Click into the time window, drill down to the service
- Switch to Logs view (Loki)—see the actual error messages from that time window
- See a trace ID in the logs, click it
- Open Tempo—see the full distributed trace for a failing request
From alert to root cause in three clicks. This is what good observability enables.
What to Actually Alert On
Alert on symptoms, not causes. The four golden signals (from Google’s SRE book):
- Latency: How long does it take to serve a request?
- Traffic: How many requests per second?
- Errors: What’s the error rate?
- Saturation: How full are your resources (CPU, memory, disk)?
Don’t alert on CPU > 80%. Alert on request latency > SLO threshold. CPU high is a cause; high latency is the symptom your users feel.
Closing Thoughts
Observability is infrastructure. It should be deployed the same way your applications are—via GitOps, with version-controlled configuration, with secrets managed properly, with alerts that route to wherever your team actually sees them.
The Grafana ecosystem gives you everything you need. The kube-prometheus-stack gives you metrics from day one. Add Loki for logs and Tempo for traces and you have a complete picture of what your cluster is doing. The marginal cost of adding the full stack versus just metrics is small, and the debugging capability delta is enormous.
Don’t wait until you have an incident to set this up.