Observability and Monitoring Production Systems: 2025 Guide
You can't fix what you can't see. Comprehensive guide to observability in production systems: OpenTelemetry, distributed tracing, metrics, logs, SLI/SLO/SLA and practical implementations proven in real deployments
Outage Cost: $5,600 per minute
According to Gartner, the average cost of IT downtime is $5,600 per minute. For e-commerce during Black Friday? Up to $300,000 per hour. You can't afford hours of debugging.
It's 3 AM. Alerts wake you up: "System is down".
You check the status page - everything green. CPU and memory normal. Logs show only standard entries. But users are screaming they can't place orders.
Where is the problem? Without proper observability you waste precious minutes jumping between dashboards, searching for a needle in a haystack.
This is the difference between monitoring and observability. Monitoring tells you that the system has a problem. Observability tells you why - and leads you directly to the cause. In the world of microservices, distributed systems and cloud deployments, observability isn't a luxury. It's a survival necessity.
01Observability vs Monitoring - The Difference That Matters
Most companies think monitoring is observability. That's like thinking owning a thermometer means understanding medicine.
The difference is fundamental - and can save you hours of late-night debugging.
Monitoring
Answers the question: "What's happening?"
- •You measure known metrics (CPU, RAM, request rate)
- •You set alerts on threshold breaches
- •You look at dashboards with metrics
- •You react to symptoms
Example: Alert "CPU >80%" sends notification. But you don't know which service causes the problem or why.
Observability
Answers the question: "Why is this happening?"
- •You understand the internal state of the system from external signals
- •You can ask arbitrary questions about the system
- •You trace requests through the entire system (distributed tracing)
- •You find the root cause of problems
Example: You see checkout API is slow, trace it and discover PaymentService waits 5s for response from payment gateway.
Key Difference
Monitoring answers questions you anticipated. Observability allows you to answer questions you didn't anticipate - and those are exactly the questions that arise at 3 AM during production incidents.
This is especially critical in microservices architectures, where a single request can pass through 10+ services before returning an error.
Honeycomb (observability pioneers) defines it as:
"Observability is the ability to understand the internal state of a system by analyzing its external outputs. A system is observable when you can ask arbitrary questions about its behavior without needing to predict the question in advance."
02Three Pillars of Observability
Observability is based on three data pillars: metrics, logs and distributed tracing.
Each pillar provides a different view of the system. Individually? Limited value. Together? Complete picture of what's happening in production.
Metrics - Numbers Showing Trends
Aggregated numerical values measured over time. They answer the question: "How good/bad is the system's state?"
Example metrics:
- • Request rate: 1,200 req/s (how many requests coming in)
- • Error rate: 0.5% (percentage of requests ending in error)
- • Latency P95: 150ms (95% of requests faster than 150ms)
- • CPU usage: 65% (resource utilization)
- • Active connections: 450 (how many concurrent connections)
Advantages:
Small data sizes, fast queries, great for alerts and dashboards. You can store metrics from many months without large costs.
Disadvantages:
Lack of context. You see error rate increased, but don't know which user, which endpoint, what conditions caused the error.
Logs - Discrete Events with Context
Text records of what's happening in the application. Each log is a single event with timestamp and context.
Example structured log:
{
"timestamp": "2025-11-22T14:23:45.123Z",
"level": "ERROR",
"service": "order-service",
"trace_id": "abc123xyz",
"user_id": "user_456",
"message": "Payment gateway timeout",
"details": {
"payment_method": "credit_card",
"amount": 129.99,
"gateway": "stripe",
"timeout_ms": 5000
}
}Advantages:
Rich context, you can attach any data, great for debugging specific cases. Structured logs (JSON) enable advanced filters.
Disadvantages:
Large data sizes (GB daily in production), expensive storage, hard to see big picture. Don't tell anything about relationships between services.
Traces - Request Journey Through System
Distributed trace shows the exact path of a single request through all microservices, with execution times for each step.
Example trace (checkout request):
Advantages:
Shows exactly where request spends time, which services communicate, where bottlenecks are. Invaluable in microservices.
Disadvantages:
Requires instrumentation of all services, large data sizes (typically sample only 1-10% of requests). Complex configuration.
How These Three Pillars Work Together in Practice
Scenario: Production incident. You got an alert about high API latency. How do you debug?
- 1.Metrics say: "P95 latency rose from 100ms to 2s at 14:30"
- 2.Traces show: "Slow requests spend most time in PaymentService"
- 3.Logs from PaymentService reveal: "Connection pool exhausted - 0 available connections"
Without all three, debugging takes hours. With all three - minutes.
03OpenTelemetry - The Observability Standard
For years each vendor had their own format. Prometheus - its own. Jaeger - different. DataDog - yet another.
Integration? Nightmare. Changing vendors? Rewriting all instrumentation. Then came OpenTelemetry - vendor-neutral standard supported by Microsoft, Google, AWS and the entire CNCF. It's the de facto industry standard for observability in 2025.
What is OpenTelemetry (OTel)?
OpenTelemetry is a collection of APIs, SDKs, libraries and tools for automatic and manual application instrumentation. It generates, collects and exports telemetry data: metrics, logs and traces.
Born from the merger of OpenTracing (distributed tracing) and OpenCensus (metrics) in 2019. A CNCF (Cloud Native Computing Foundation) project at "graduated" level - the highest maturity level, alongside Kubernetes, Prometheus and Envoy.
What OTel provides:
- • Vendor-neutral - not tied to one vendor
- • Single SDK for all pillars (metrics, logs, traces)
- • Automatic instrumentation for popular frameworks
- • Export to any backend (Prometheus, Jaeger, DataDog, etc)
Who uses OTel:
- • Microsoft (Azure Monitor)
- • Google (Cloud Trace)
- • AWS (X-Ray)
- • Uber, Netflix, Shopify
Why observability is crucial for distributed systems?
In a monolith: 1 process, 1 database, 1 log file. Debug? Check logs, see stack trace, find bug.
In microservices: Request passes through API Gateway → Auth Service → Order Service → Inventory Service → Payment Service → Notification Service. Each has its own logs, its own metrics, its own database. Where's the error? Without distributed tracing - good luck.
Practical Implementation - .NET API with OpenTelemetry
How to add full observability to a .NET 10 application - example from a real production project:
1. Install packages:
dotnet add package OpenTelemetry.Extensions.Hosting dotnet add package OpenTelemetry.Instrumentation.AspNetCore dotnet add package OpenTelemetry.Instrumentation.Http dotnet add package OpenTelemetry.Instrumentation.SqlClient dotnet add package OpenTelemetry.Exporter.Prometheus dotnet add package OpenTelemetry.Exporter.Jaeger
2. Configuration in Program.cs:
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using OpenTelemetry.Metrics;
var builder = WebApplication.CreateBuilder(args);
// Define service name and version
var serviceName = "order-api";
var serviceVersion = "1.0.0";
builder.Services.AddOpenTelemetry()
.ConfigureResource(resource => resource
.AddService(serviceName: serviceName, serviceVersion: serviceVersion))
// Traces configuration
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation(options =>
{
options.RecordException = true;
options.Filter = (httpContext) =>
{
// Don't trace health checks
return !httpContext.Request.Path.StartsWithSegments("/health");
};
})
.AddHttpClientInstrumentation()
.AddSqlClientInstrumentation(options =>
{
options.RecordException = true;
options.SetDbStatementForText = true;
})
.AddSource("OrderService.*") // Custom traces
.AddJaegerExporter(options =>
{
options.AgentHost = "jaeger";
options.AgentPort = 6831;
}))
// Metrics configuration
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddRuntimeInstrumentation()
.AddMeter("OrderService.*") // Custom metrics
.AddPrometheusExporter());
var app = builder.Build();
// Expose Prometheus metrics endpoint
app.MapPrometheusScrapingEndpoint("/metrics");
app.Run();Custom Traces and Spans
Automatic instrumentation is a great start, but for critical operations you should add custom spans:
using System.Diagnostics;
public class OrderService
{
private static readonly ActivitySource ActivitySource =
new("OrderService.Processing");
public async Task<Order> ProcessOrderAsync(CreateOrderRequest request)
{
using var activity = ActivitySource.StartActivity("ProcessOrder");
activity?.SetTag("order.amount", request.TotalAmount);
activity?.SetTag("order.items", request.Items.Count);
activity?.SetTag("user.id", request.UserId);
try
{
// Validate inventory
using (var inventoryActivity = ActivitySource.StartActivity("CheckInventory"))
{
await _inventoryService.ValidateStockAsync(request.Items);
inventoryActivity?.SetTag("inventory.valid", true);
}
// Process payment
using (var paymentActivity = ActivitySource.StartActivity("ProcessPayment"))
{
var payment = await _paymentService.ChargeAsync(
request.UserId,
request.TotalAmount
);
paymentActivity?.SetTag("payment.id", payment.Id);
paymentActivity?.SetTag("payment.method", payment.Method);
}
// Create order
var order = await _repository.CreateOrderAsync(request);
activity?.SetTag("order.id", order.Id);
return order;
}
catch (Exception ex)
{
activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
activity?.RecordException(ex);
throw;
}
}
}This code automatically creates a distributed trace showing the duration of each operation with business context (order ID, payment method, etc).
04Prometheus and Grafana - Metrics in Practice
Prometheus is the de facto standard for metrics in the Kubernetes and microservices ecosystem. Used by Google, Uber, SoundCloud, DigitalOcean and thousands of other companies.
Grafana is the visualization platform that turns raw metrics into readable dashboards. Together they create a powerful, open-source monitoring stack used worldwide.
Setup Prometheus + Grafana in Kubernetes
If you're using Kubernetes (especially Azure AKS), the fastest way is the kube-prometheus-stack Helm chart - complete stack out-of-the-box:
# Add Prometheus Community Helm repo helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update # Install complete stack (Prometheus + Grafana + AlertManager + Node Exporter) helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \ --set grafana.adminPassword=admin123 \ --set grafana.persistence.enabled=true \ --set grafana.persistence.size=10Gi # Port-forward Grafana (dev) kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80
This installs a complete stack in 2 minutes: Prometheus for scraping metrics, Grafana with predefined dashboards, AlertManager for alerts.
Key Metrics to Track (Golden Signals)
Google SRE defines "Golden Signals" - 4 metrics that tell the most about system health:
1. Latency (Delay)
How long it takes to handle a request
PromQL query:
# P95 latency last 5 minutes histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]) )
Alert: P95 > 500ms for 5 minutes
2. Traffic (Load)
How many requests are coming to the system
PromQL query:
# Requests per second rate(http_requests_total[5m])
Alert: Traffic rose 3x above baseline (possible DDoS attack)
3. Errors
Percentage of requests ending in error
PromQL query:
# Error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100Alert: Error rate > 1% for 5 minutes
4. Saturation (Capacity)
How heavily the system is loaded
PromQL query:
# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100Alert: CPU > 80% or Memory > 85% for 10 minutes
Example Grafana Dashboards
Instead of building from scratch, use ready-made dashboards from Grafana Dashboards library:
- •Dashboard 3662 - Prometheus 2.0 Overview
- •Dashboard 1860 - Node Exporter Full
- •Dashboard 7645 - Kubernetes Cluster Monitoring
- •Dashboard 12708 - .NET Core / ASP.NET Core
05SLI, SLO, SLA - Measuring Reliability
You have metrics. You have dashboards. But how do you determine if your system is "good enough"?
This is where the SLI/SLO/SLA framework comes in - Google Site Reliability Engineering's methodology for objectively measuring and ensuring system reliability. It's the difference between "I feel the system is running OK" and "I know exactly what level of service I guarantee".
Service Level Indicator
Definition: Quantifiable measure of service level. It's a specific metric saying "how well we're doing".
Example SLI:
- • Availability: Percentage of requests ending in success (200-299, 300-399 status)
- • Latency: Percentage of requests faster than 300ms
- • Throughput: Requests per second
- • Correctness: Percentage of operations giving correct result
Service Level Objective
Definition: Target for SLI. It's a value saying "we want to be better than X for Y time".
Example SLO (for Order API):
- →Availability: 99.9% of requests succeed in 30-day period
- →Latency: 95% of requests faster than 300ms in 30-day period
- →Durability: 99.999% of orders saved correctly (zero data loss)
SLO defines internal target. It's for you and your team.
Service Level Agreement
Definition: Contract with customer specifying service level guarantees and consequences of not meeting them.
Example SLA (for SaaS platform):
Uptime Commitment:
- • 99.9% uptime in monthly billing period
- • Maximum 43.2 minutes downtime/month allowed
Breach consequences:
- • 99.0-99.9% uptime: 10% refund of monthly costs
- • 95.0-99.0% uptime: 25% refund of monthly costs
- • <95.0% uptime: 50% refund of monthly costs
SLA is external commitment. It has legal and financial consequences.
Relationship Between SLI/SLO/SLA
Key principle: SLO should be more restrictive than SLA. You need a buffer.
Example hierarchy:
If SLI falls below SLO, you know you must act before breaking SLA. This gives you time to react.
Error Budget - How Many Failures Can You Afford
Error budget is the acceptable percentage of errors resulting from SLO. It's a tool for balancing between reliability and velocity.
Calculating error budget:
If you have SLO = 99.9% availability:
- • Allowed errors: 100% - 99.9% = 0.1%
- • In 30-day month (43,200 minutes): 43.2 minutes downtime OK
- • At 1M requests/month: 1,000 errors allowed
If you have error budget:
You can risk new features, aggressive deployments, experiments. Innovation > stability.
If you've exhausted error budget:
Freeze new features, focus on stability, more rigorous code reviews, more tests. Stability > innovation.
06Production Best Practices
1. Structured Logging - Logs as Data, Not Text
Forget about plain text logs. In 2025 every log is JSON with context.
Bad - Plain text:
ERROR: Payment failed for user 123
Hard to search, filter, analyze trends
Good - Structured JSON:
{
"level": "ERROR",
"timestamp": "2025-11-22T14:30:00Z",
"message": "Payment failed",
"user_id": "123",
"payment_method": "credit_card",
"amount": 99.99,
"error_code": "insufficient_funds",
"trace_id": "abc123xyz"
}Easy to query, aggregate, correlate with traces
In .NET use Serilog with JSON sink. In Node.js - Winston or Pino. Always add trace_id for correlation with traces.
2. Correlation IDs - Track Requests Through System
In distributed systems one user request triggers 10+ services. How do you connect all those logs?
Implementation in API Gateway:
// Generate correlation ID at entry point (API Gateway)
app.Use(async (context, next) =>
{
var correlationId = context.Request.Headers["X-Correlation-ID"].FirstOrDefault()
?? Guid.NewGuid().ToString();
context.Response.Headers.Add("X-Correlation-ID", correlationId);
// Add to all logs in this request
using (LogContext.PushProperty("CorrelationId", correlationId))
{
await next();
}
});
// Propagate to downstream services
httpClient.DefaultRequestHeaders.Add("X-Correlation-ID", correlationId);Now every log from this request has the same correlation_id. You can search for all logs related to a specific user request.
3. Sampling - Don't Trace Every Request
Tracing every request means huge storage costs and performance overhead. In production sample 1-10% of requests.
Sampling strategies:
- •Random sampling: 5% randomly selected requests
- •Error-based: Always trace requests ending in error
- •Latency-based: Always trace slow requests (> 1s)
- •Adaptive: Increase sampling when error rate rises
4. Alert Fatigue - Don't Alert on Everything
I've seen teams with 200+ alerts daily. Result? They ignore all, including critical ones.
Good alerting practices:
- ✓Alert only on symptoms visible to users (latency, errors, downtime)
- ✓Every alert requires action - if not, it's notification not alert
- ✓Use severity: P1 (wake people at 3 AM), P2 (handle during work hours), P3 (info)
- ✓Add playbook to every alert: "What to do when you get this alert?"
Anti-practices:
- ✗Alerting on transient issues (CPU spike for 10s - may be OK)
- ✗Alerts without context ("CPU high" - but which service? which node?)
- ✗Identical alerts from 10 services (aggregate them into one)
5. Cost of Observability - It Can Be Expensive
Full observability is 10-20% of infrastructure costs. You must balance between visibility and costs.
Example costs (average 5-node cluster):
Managed services (DataDog, New Relic) can cost 2-3x more, but you save setup and maintenance time.
07Frequently Asked Questions
Can I use only monitoring without full observability?
Yes, if you have a simple monolith with little traffic. Basic monitoring (CPU, RAM, request count, error rate) is enough for small applications. But once you move to microservices or have > 50k requests/day, basic monitoring stops being sufficient - you lose too much time on debugging.
OpenTelemetry vs vendor-specific solutions (DataDog, New Relic)?
OpenTelemetry gives vendor neutrality - you can change backend without changing instrumentation. But requires more setup and maintenance. Vendor solutions (DataDog, New Relic) are plug-and-play, but expensive and lock-in.
Recommendation: Start with OTel + self-hosted stack (Prometheus/Grafana/Jaeger) to avoid vendor lock-in. As budget grows, consider managed service for convenience.
Does distributed tracing slow down the application?
Yes, but minimally. OpenTelemetry adds <1ms overhead per request (with 100% sampling). With 5% sampling overhead is practically unnoticeable (<0.05ms). It's an acceptable cost for the visibility you get. Without tracing, one incident costs you hours of debugging.
How long to retain metrics, logs and traces?
Standard retention strategy:
- •Metrics: 30-90 days high-resolution, then downsample to lower resolution for a year
- •Logs: 7-14 days hot storage (fast query), 30-90 days cold storage (archive)
- •Traces: 7-14 days (large data sizes, rarely need older ones)
What tools for small team/startup?
For team <10 people, budget up to $500/month:
Option 1: Self-hosted (cheapest)
- • Prometheus + Grafana (metrics)
- • Loki (logs)
- • Jaeger (traces)
- • Cost: ~$100-150/month (infrastructure only)
Option 2: Managed, free tiers
- • Azure Monitor / Application Insights (if on Azure)
- • Grafana Cloud (free tier: 10k series, 50GB logs)
- • Sentry (errors tracking, free tier: 5k events/month)
Key Takeaways - Observability in 2025
- →
Monitoring vs Observability - they're not the same. Monitoring shows "what", observability explains "why". In distributed systems and microservices, observability isn't optional - it's necessary.
- →
Three pillars: metrics, logs, traces - each provides a different view. Metrics show trends, logs provide context, distributed tracing follows requests through the entire system. All three together = full visibility.
- →
OpenTelemetry = industry standard - vendor-neutral, supported by the biggest players (Microsoft, Google, AWS). Invest in OTel now, save yourself from rewriting instrumentation when changing vendors.
- →
SLI/SLO/SLA framework - objective way to measure reliability. Error budget balances innovation vs stability - when you exhaust the budget, freeze features and fix issues.
- →
Alert fatigue is real - 200 alerts daily = you ignore them all. Alert only on what requires immediate action and affects users.
- →
ROI of observability is obvious - cost: 10-20% of infrastructure budget. Benefit: debugging in minutes instead of hours, less downtime, smaller business losses. One prevented incident pays for a month of the stack.
I Implement Observability in Production Systems
I help teams implement comprehensive observability stack: OpenTelemetry instrumentation for .NET/Node.js, Prometheus/Grafana/Jaeger setup in Kubernetes, defining SLI/SLO aligned with business goals, and alerting strategy that eliminates alert fatigue.
From initial setup through tuning to production-ready monitoring - you'll end up with a system where debugging incidents takes minutes, not hours.
Sources
- [1] Google SRE Book - Site Reliability Engineering -https://sre.google/sre-book/table-of-contents/
- [2] OpenTelemetry - Official Documentation -https://opentelemetry.io/docs/
- [3] Prometheus - Official Documentation -https://prometheus.io/docs/introduction/overview/
- [4] Grafana - Official Documentation -https://grafana.com/docs/
- [5] Jaeger - Distributed Tracing Documentation -https://www.jaegertracing.io/docs/
- [6] CNCF - Cloud Native Computing Foundation -https://www.cncf.io/
- [7] Honeycomb - Observability Best Practices -https://www.honeycomb.io/what-is-observability