Observability and Monitoring Production Systems - 2025 Guide | Wojciechowski.app

Outage Cost: $5,600 per minute

According to Gartner, the average cost of IT downtime is $5,600 per minute. For e-commerce during Black Friday? Up to $300,000 per hour. You can't afford hours of debugging.

It's 3 AM. Alerts wake you up: "System is down".

You check the status page - everything green. CPU and memory normal. Logs show only standard entries. But users are screaming they can't place orders.

Where is the problem? Without proper observability you waste precious minutes jumping between dashboards, searching for a needle in a haystack.

This is the difference between monitoring and observability. Monitoring tells you that the system has a problem. Observability tells you why - and leads you directly to the cause. In the world of microservices, distributed systems and cloud deployments, observability isn't a luxury. It's a survival necessity.

01Observability vs Monitoring - The Difference That Matters

Most companies think monitoring is observability. That's like thinking owning a thermometer means understanding medicine.

The difference is fundamental - and can save you hours of late-night debugging.

Monitoring

Answers the question: "What's happening?"

•You measure known metrics (CPU, RAM, request rate)
•You set alerts on threshold breaches
•You look at dashboards with metrics
•You react to symptoms

Example: Alert "CPU >80%" sends notification. But you don't know which service causes the problem or why.

Observability

Answers the question: "Why is this happening?"

•You understand the internal state of the system from external signals
•You can ask arbitrary questions about the system
•You trace requests through the entire system (distributed tracing)
•You find the root cause of problems

Example: You see checkout API is slow, trace it and discover PaymentService waits 5s for response from payment gateway.

Key Difference

Monitoring answers questions you anticipated. Observability allows you to answer questions you didn't anticipate - and those are exactly the questions that arise at 3 AM during production incidents.

This is especially critical in microservices architectures, where a single request can pass through 10+ services before returning an error.

Honeycomb (observability pioneers) defines it as:

"Observability is the ability to understand the internal state of a system by analyzing its external outputs. A system is observable when you can ask arbitrary questions about its behavior without needing to predict the question in advance."

02Three Pillars of Observability

Observability is based on three data pillars: metrics, logs and distributed tracing.

Each pillar provides a different view of the system. Individually? Limited value. Together? Complete picture of what's happening in production.

Metrics - Numbers Showing Trends

Aggregated numerical values measured over time. They answer the question: "How good/bad is the system's state?"

Example metrics:

• Request rate: 1,200 req/s (how many requests coming in)
• Error rate: 0.5% (percentage of requests ending in error)
• Latency P95: 150ms (95% of requests faster than 150ms)
• CPU usage: 65% (resource utilization)
• Active connections: 450 (how many concurrent connections)

Advantages:

Small data sizes, fast queries, great for alerts and dashboards. You can store metrics from many months without large costs.

Disadvantages:

Lack of context. You see error rate increased, but don't know which user, which endpoint, what conditions caused the error.

Logs - Discrete Events with Context

Text records of what's happening in the application. Each log is a single event with timestamp and context.

Example structured log:

{
  "timestamp": "2025-11-22T14:23:45.123Z",
  "level": "ERROR",
  "service": "order-service",
  "trace_id": "abc123xyz",
  "user_id": "user_456",
  "message": "Payment gateway timeout",
  "details": {
    "payment_method": "credit_card",
    "amount": 129.99,
    "gateway": "stripe",
    "timeout_ms": 5000
  }
}

Advantages:

Rich context, you can attach any data, great for debugging specific cases. Structured logs (JSON) enable advanced filters.

Disadvantages:

Large data sizes (GB daily in production), expensive storage, hard to see big picture. Don't tell anything about relationships between services.

Traces - Request Journey Through System

Distributed trace shows the exact path of a single request through all microservices, with execution times for each step.

Example trace (checkout request):

|--API Gateway [5ms]

|--Order Service [350ms total]

|--Inventory Check [20ms]

|--Payment Service [300ms] ← BOTTLENECK

|--Stripe API call [280ms] ← SLOW!

|--Notification Service [15ms]

|--Response [5ms]

Advantages:

Shows exactly where request spends time, which services communicate, where bottlenecks are. Invaluable in microservices.

Disadvantages:

Requires instrumentation of all services, large data sizes (typically sample only 1-10% of requests). Complex configuration.

How These Three Pillars Work Together in Practice

Scenario: Production incident. You got an alert about high API latency. How do you debug?

1.Metrics say: "P95 latency rose from 100ms to 2s at 14:30"
2.Traces show: "Slow requests spend most time in PaymentService"
3.Logs from PaymentService reveal: "Connection pool exhausted - 0 available connections"

Without all three, debugging takes hours. With all three - minutes.

03OpenTelemetry - The Observability Standard

For years each vendor had their own format. Prometheus - its own. Jaeger - different. DataDog - yet another.

Integration? Nightmare. Changing vendors? Rewriting all instrumentation. Then came OpenTelemetry - vendor-neutral standard supported by Microsoft, Google, AWS and the entire CNCF. It's the de facto industry standard for observability in 2025.

What is OpenTelemetry (OTel)?

OpenTelemetry is a collection of APIs, SDKs, libraries and tools for automatic and manual application instrumentation. It generates, collects and exports telemetry data: metrics, logs and traces.

Born from the merger of OpenTracing (distributed tracing) and OpenCensus (metrics) in 2019. A CNCF (Cloud Native Computing Foundation) project at "graduated" level - the highest maturity level, alongside Kubernetes, Prometheus and Envoy.

What OTel provides:

• Vendor-neutral - not tied to one vendor
• Single SDK for all pillars (metrics, logs, traces)
• Automatic instrumentation for popular frameworks
• Export to any backend (Prometheus, Jaeger, DataDog, etc)

Who uses OTel:

• Microsoft (Azure Monitor)
• Google (Cloud Trace)
• AWS (X-Ray)
• Uber, Netflix, Shopify

Why observability is crucial for distributed systems?

In a monolith: 1 process, 1 database, 1 log file. Debug? Check logs, see stack trace, find bug.

In microservices: Request passes through API Gateway → Auth Service → Order Service → Inventory Service → Payment Service → Notification Service. Each has its own logs, its own metrics, its own database. Where's the error? Without distributed tracing - good luck.

Practical Implementation - .NET API with OpenTelemetry

How to add full observability to a .NET 10 application - example from a real production project:

1. Install packages:

dotnet add package OpenTelemetry.Extensions.Hosting
dotnet add package OpenTelemetry.Instrumentation.AspNetCore
dotnet add package OpenTelemetry.Instrumentation.Http
dotnet add package OpenTelemetry.Instrumentation.SqlClient
dotnet add package OpenTelemetry.Exporter.Prometheus
dotnet add package OpenTelemetry.Exporter.Jaeger

2. Configuration in Program.cs:

using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
using OpenTelemetry.Metrics;

var builder = WebApplication.CreateBuilder(args);

// Define service name and version
var serviceName = "order-api";
var serviceVersion = "1.0.0";

builder.Services.AddOpenTelemetry()
    .ConfigureResource(resource => resource
        .AddService(serviceName: serviceName, serviceVersion: serviceVersion))

    // Traces configuration
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation(options =>
        {
            options.RecordException = true;
            options.Filter = (httpContext) =>
            {
                // Don't trace health checks
                return !httpContext.Request.Path.StartsWithSegments("/health");
            };
        })
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation(options =>
        {
            options.RecordException = true;
            options.SetDbStatementForText = true;
        })
        .AddSource("OrderService.*") // Custom traces
        .AddJaegerExporter(options =>
        {
            options.AgentHost = "jaeger";
            options.AgentPort = 6831;
        }))

    // Metrics configuration
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddMeter("OrderService.*") // Custom metrics
        .AddPrometheusExporter());

var app = builder.Build();

// Expose Prometheus metrics endpoint
app.MapPrometheusScrapingEndpoint("/metrics");

app.Run();

Custom Traces and Spans

Automatic instrumentation is a great start, but for critical operations you should add custom spans:

using System.Diagnostics;

public class OrderService
{
    private static readonly ActivitySource ActivitySource =
        new("OrderService.Processing");

    public async Task<Order> ProcessOrderAsync(CreateOrderRequest request)
    {
        using var activity = ActivitySource.StartActivity("ProcessOrder");
        activity?.SetTag("order.amount", request.TotalAmount);
        activity?.SetTag("order.items", request.Items.Count);
        activity?.SetTag("user.id", request.UserId);

        try
        {
            // Validate inventory
            using (var inventoryActivity = ActivitySource.StartActivity("CheckInventory"))
            {
                await _inventoryService.ValidateStockAsync(request.Items);
                inventoryActivity?.SetTag("inventory.valid", true);
            }

            // Process payment
            using (var paymentActivity = ActivitySource.StartActivity("ProcessPayment"))
            {
                var payment = await _paymentService.ChargeAsync(
                    request.UserId,
                    request.TotalAmount
                );
                paymentActivity?.SetTag("payment.id", payment.Id);
                paymentActivity?.SetTag("payment.method", payment.Method);
            }

            // Create order
            var order = await _repository.CreateOrderAsync(request);
            activity?.SetTag("order.id", order.Id);

            return order;
        }
        catch (Exception ex)
        {
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            activity?.RecordException(ex);
            throw;
        }
    }
}

This code automatically creates a distributed trace showing the duration of each operation with business context (order ID, payment method, etc).

04Prometheus and Grafana - Metrics in Practice

Prometheus is the de facto standard for metrics in the Kubernetes and microservices ecosystem. Used by Google, Uber, SoundCloud, DigitalOcean and thousands of other companies.

Grafana is the visualization platform that turns raw metrics into readable dashboards. Together they create a powerful, open-source monitoring stack used worldwide.

Setup Prometheus + Grafana in Kubernetes

If you're using Kubernetes (especially Azure AKS), the fastest way is the kube-prometheus-stack Helm chart - complete stack out-of-the-box:

# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Install complete stack (Prometheus + Grafana + AlertManager + Node Exporter)
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi \
  --set grafana.adminPassword=admin123 \
  --set grafana.persistence.enabled=true \
  --set grafana.persistence.size=10Gi

# Port-forward Grafana (dev)
kubectl port-forward -n monitoring svc/kube-prometheus-stack-grafana 3000:80

This installs a complete stack in 2 minutes: Prometheus for scraping metrics, Grafana with predefined dashboards, AlertManager for alerts.

Key Metrics to Track (Golden Signals)

Google SRE defines "Golden Signals" - 4 metrics that tell the most about system health:

1. Latency (Delay)

How long it takes to handle a request

PromQL query:

# P95 latency last 5 minutes
histogram_quantile(0.95,
  rate(http_request_duration_seconds_bucket[5m])
)

Alert: P95 > 500ms for 5 minutes

2. Traffic (Load)

How many requests are coming to the system

PromQL query:

# Requests per second
rate(http_requests_total[5m])

Alert: Traffic rose 3x above baseline (possible DDoS attack)

3. Errors

Percentage of requests ending in error

PromQL query:

# Error rate (5xx responses)
rate(http_requests_total{status=~"5.."}[5m])
/
rate(http_requests_total[5m]) * 100

Alert: Error rate > 1% for 5 minutes

4. Saturation (Capacity)

How heavily the system is loaded

PromQL query:

# CPU usage
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# Memory usage
(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100

Alert: CPU > 80% or Memory > 85% for 10 minutes

Example Grafana Dashboards

Instead of building from scratch, use ready-made dashboards from Grafana Dashboards library:

•Dashboard 3662 - Prometheus 2.0 Overview
•Dashboard 1860 - Node Exporter Full
•Dashboard 7645 - Kubernetes Cluster Monitoring
•Dashboard 12708 - .NET Core / ASP.NET Core

05SLI, SLO, SLA - Measuring Reliability

You have metrics. You have dashboards. But how do you determine if your system is "good enough"?

This is where the SLI/SLO/SLA framework comes in - Google Site Reliability Engineering's methodology for objectively measuring and ensuring system reliability. It's the difference between "I feel the system is running OK" and "I know exactly what level of service I guarantee".

SLI

Service Level Indicator

Definition: Quantifiable measure of service level. It's a specific metric saying "how well we're doing".

Example SLI:

• Availability: Percentage of requests ending in success (200-299, 300-399 status)
• Latency: Percentage of requests faster than 300ms
• Throughput: Requests per second
• Correctness: Percentage of operations giving correct result

SLO

Service Level Objective

Definition: Target for SLI. It's a value saying "we want to be better than X for Y time".

Example SLO (for Order API):

→Availability: 99.9% of requests succeed in 30-day period
→Latency: 95% of requests faster than 300ms in 30-day period
→Durability: 99.999% of orders saved correctly (zero data loss)

SLO defines internal target. It's for you and your team.

SLA

Service Level Agreement

Definition: Contract with customer specifying service level guarantees and consequences of not meeting them.

Example SLA (for SaaS platform):

Uptime Commitment:

• 99.9% uptime in monthly billing period
• Maximum 43.2 minutes downtime/month allowed

Breach consequences:

• 99.0-99.9% uptime: 10% refund of monthly costs
• 95.0-99.0% uptime: 25% refund of monthly costs
• <95.0% uptime: 50% refund of monthly costs

SLA is external commitment. It has legal and financial consequences.

Relationship Between SLI/SLO/SLA

Key principle: SLO should be more restrictive than SLA. You need a buffer.

Example hierarchy:

SLA:99.9% uptime (customer commitment)

↓ SLO:99.95% uptime (internal target - you have 0.05% buffer)

↓ SLI:99.97% uptime (current measurement - you're above SLO)

If SLI falls below SLO, you know you must act before breaking SLA. This gives you time to react.

Error Budget - How Many Failures Can You Afford

Error budget is the acceptable percentage of errors resulting from SLO. It's a tool for balancing between reliability and velocity.

Calculating error budget:

If you have SLO = 99.9% availability:

• Allowed errors: 100% - 99.9% = 0.1%
• In 30-day month (43,200 minutes): 43.2 minutes downtime OK
• At 1M requests/month: 1,000 errors allowed

If you have error budget:

You can risk new features, aggressive deployments, experiments. Innovation > stability.

If you've exhausted error budget:

Freeze new features, focus on stability, more rigorous code reviews, more tests. Stability > innovation.

06Production Best Practices

1. Structured Logging - Logs as Data, Not Text

Forget about plain text logs. In 2025 every log is JSON with context.

Bad - Plain text:

ERROR: Payment failed for user 123

Hard to search, filter, analyze trends

Good - Structured JSON:

{
  "level": "ERROR",
  "timestamp": "2025-11-22T14:30:00Z",
  "message": "Payment failed",
  "user_id": "123",
  "payment_method": "credit_card",
  "amount": 99.99,
  "error_code": "insufficient_funds",
  "trace_id": "abc123xyz"
}

Easy to query, aggregate, correlate with traces

In .NET use Serilog with JSON sink. In Node.js - Winston or Pino. Always add trace_id for correlation with traces.

2. Correlation IDs - Track Requests Through System

In distributed systems one user request triggers 10+ services. How do you connect all those logs?

Implementation in API Gateway:

// Generate correlation ID at entry point (API Gateway)
app.Use(async (context, next) =>
{
    var correlationId = context.Request.Headers["X-Correlation-ID"].FirstOrDefault()
        ?? Guid.NewGuid().ToString();

    context.Response.Headers.Add("X-Correlation-ID", correlationId);

    // Add to all logs in this request
    using (LogContext.PushProperty("CorrelationId", correlationId))
    {
        await next();
    }
});

// Propagate to downstream services
httpClient.DefaultRequestHeaders.Add("X-Correlation-ID", correlationId);

Now every log from this request has the same correlation_id. You can search for all logs related to a specific user request.

3. Sampling - Don't Trace Every Request

Tracing every request means huge storage costs and performance overhead. In production sample 1-10% of requests.

Sampling strategies:

•Random sampling: 5% randomly selected requests
•Error-based: Always trace requests ending in error
•Latency-based: Always trace slow requests (> 1s)
•Adaptive: Increase sampling when error rate rises

4. Alert Fatigue - Don't Alert on Everything

I've seen teams with 200+ alerts daily. Result? They ignore all, including critical ones.

Good alerting practices:

✓Alert only on symptoms visible to users (latency, errors, downtime)
✓Every alert requires action - if not, it's notification not alert
✓Use severity: P1 (wake people at 3 AM), P2 (handle during work hours), P3 (info)
✓Add playbook to every alert: "What to do when you get this alert?"

Anti-practices:

✗Alerting on transient issues (CPU spike for 10s - may be OK)
✗Alerts without context ("CPU high" - but which service? which node?)
✗Identical alerts from 10 services (aggregate them into one)

5. Cost of Observability - It Can Be Expensive

Full observability is 10-20% of infrastructure costs. You must balance between visibility and costs.

Example costs (average 5-node cluster):

Prometheus (storage 30d retention, 50GB)$50/month

Grafana (persistence, 10GB)$20/month

Jaeger (traces storage, sampling 5%)$40/month

Loki (logs, 7d retention, 100GB ingestion)$80/month

Total observability stack~$190/month

Managed services (DataDog, New Relic) can cost 2-3x more, but you save setup and maintenance time.

07Frequently Asked Questions

Can I use only monitoring without full observability?

Yes, if you have a simple monolith with little traffic. Basic monitoring (CPU, RAM, request count, error rate) is enough for small applications. But once you move to microservices or have > 50k requests/day, basic monitoring stops being sufficient - you lose too much time on debugging.

OpenTelemetry vs vendor-specific solutions (DataDog, New Relic)?

OpenTelemetry gives vendor neutrality - you can change backend without changing instrumentation. But requires more setup and maintenance. Vendor solutions (DataDog, New Relic) are plug-and-play, but expensive and lock-in.

Recommendation: Start with OTel + self-hosted stack (Prometheus/Grafana/Jaeger) to avoid vendor lock-in. As budget grows, consider managed service for convenience.

Does distributed tracing slow down the application?

Yes, but minimally. OpenTelemetry adds <1ms overhead per request (with 100% sampling). With 5% sampling overhead is practically unnoticeable (<0.05ms). It's an acceptable cost for the visibility you get. Without tracing, one incident costs you hours of debugging.

How long to retain metrics, logs and traces?

Standard retention strategy:

•Metrics: 30-90 days high-resolution, then downsample to lower resolution for a year
•Logs: 7-14 days hot storage (fast query), 30-90 days cold storage (archive)
•Traces: 7-14 days (large data sizes, rarely need older ones)

What tools for small team/startup?

For team <10 people, budget up to $500/month:

Option 1: Self-hosted (cheapest)

• Prometheus + Grafana (metrics)
• Loki (logs)
• Jaeger (traces)
• Cost: ~$100-150/month (infrastructure only)

Option 2: Managed, free tiers

• Azure Monitor / Application Insights (if on Azure)
• Grafana Cloud (free tier: 10k series, 50GB logs)
• Sentry (errors tracking, free tier: 5k events/month)

Key Takeaways - Observability in 2025

→
Monitoring vs Observability - they're not the same. Monitoring shows "what", observability explains "why". In distributed systems and microservices, observability isn't optional - it's necessary.
→
Three pillars: metrics, logs, traces - each provides a different view. Metrics show trends, logs provide context, distributed tracing follows requests through the entire system. All three together = full visibility.
→
OpenTelemetry = industry standard - vendor-neutral, supported by the biggest players (Microsoft, Google, AWS). Invest in OTel now, save yourself from rewriting instrumentation when changing vendors.
→
SLI/SLO/SLA framework - objective way to measure reliability. Error budget balances innovation vs stability - when you exhaust the budget, freeze features and fix issues.
→
Alert fatigue is real - 200 alerts daily = you ignore them all. Alert only on what requires immediate action and affects users.
→
ROI of observability is obvious - cost: 10-20% of infrastructure budget. Benefit: debugging in minutes instead of hours, less downtime, smaller business losses. One prevented incident pays for a month of the stack.

I Implement Observability in Production Systems

I help teams implement comprehensive observability stack: OpenTelemetry instrumentation for .NET/Node.js, Prometheus/Grafana/Jaeger setup in Kubernetes, defining SLI/SLO aligned with business goals, and alerting strategy that eliminates alert fatigue.

From initial setup through tuning to production-ready monitoring - you'll end up with a system where debugging incidents takes minutes, not hours.

Get in Touch More About Consulting

Observability and Monitoring Production Systems: 2025 Guide

01Observability vs Monitoring - The Difference That Matters

Monitoring

Observability

Key Difference

02Three Pillars of Observability

Metrics - Numbers Showing Trends

Logs - Discrete Events with Context

Traces - Request Journey Through System

How These Three Pillars Work Together in Practice

03OpenTelemetry - The Observability Standard

What is OpenTelemetry (OTel)?

Why observability is crucial for distributed systems?

Practical Implementation - .NET API with OpenTelemetry

Custom Traces and Spans

04Prometheus and Grafana - Metrics in Practice

Setup Prometheus + Grafana in Kubernetes

Key Metrics to Track (Golden Signals)

Example Grafana Dashboards

05SLI, SLO, SLA - Measuring Reliability

Service Level Indicator

Service Level Objective

Service Level Agreement

Relationship Between SLI/SLO/SLA

Error Budget - How Many Failures Can You Afford

06Production Best Practices

1. Structured Logging - Logs as Data, Not Text

2. Correlation IDs - Track Requests Through System

3. Sampling - Don't Trace Every Request

4. Alert Fatigue - Don't Alert on Everything

5. Cost of Observability - It Can Be Expensive

07Frequently Asked Questions

Can I use only monitoring without full observability?

OpenTelemetry vs vendor-specific solutions (DataDog, New Relic)?

Does distributed tracing slow down the application?

How long to retain metrics, logs and traces?

What tools for small team/startup?

Key Takeaways - Observability in 2025

I Implement Observability in Production Systems

Sources