Cloud Architecture

Azure Kubernetes Service (AKS) in Production: Lessons from Real Deployments

Running Kubernetes in production is different from running it locally. This guide covers what I've learned deploying AKS clusters that handle real traffic, real users, and real money. No buzzwords, just practical advice based on production experience.

October 23, 2025•Reading time: ~6 min•Michał Wojciechowski

I've deployed AKS clusters for platforms serving hundreds of thousands of users, handling millions of requests daily, and maintaining 99.9% uptime. Whether it's .NET application deployment or complex multi-tenant architecture, I've learned what actually matters in production through real-world experience.

This article isn't about getting started with Kubernetes (check out my Kubernetes for Beginners guide for that). This is about the real challenges you'll face when AKS is running your production workloads: cost overruns, security gaps, mysterious outages, and performance issues.

If you're considering AKS for production, already running it, or debugging why your bill is 3x what you expected, this guide is for you.

Why AKS for Production in 2025

According to the CNCF Annual Survey 2023, Kubernetes adoption in production reached 96% among organizations using containers. Azure Kubernetes Service holds about 25-30% of the managed Kubernetes market, competing with EKS and GKE.

When AKS Makes Sense

You're Already on Azure

Seamless integration with Azure AD, Key Vault, Application Gateway, Azure Monitor, and other Azure services. If your infrastructure is on Azure, AKS is the natural choice.

You Need Auto-Scaling and High Availability

AKS handles node-level failures automatically, scales pods based on metrics, and can span multiple availability zones. For applications that need 99.9%+ uptime, this is critical.

You Have Microservices or Complex Deployments

Managing 20+ microservices with manual deployments is painful. Kubernetes excels at orchestrating complex, distributed applications with a microservices architecture. Below 5 services, you probably don't need it.

You Want Managed Control Plane

Azure manages the Kubernetes control plane (API server, etcd, scheduler) for free. You only pay for worker nodes. Compare this to self-managed Kubernetes where you maintain everything.

When AKS Is Overkill

Be honest with yourself. Kubernetes adds complexity. If you have:

Simple monolith: Azure App Service or Container Apps might be better (consider legacy modernization strategies first)
Small team (<5 devs): Kubernetes operational overhead might not be worth it
Low traffic (<10k requests/day): VMs or App Service will be cheaper and simpler
No DevOps expertise: You'll need someone who understands Kubernetes, networking, and security

Cost Comparison: AKS vs Azure VMs

Real numbers from production workload (web API + database, 100k requests/day, 99.9% uptime requirement):

Azure VMs Approach

• 3x Standard_D2s_v3 VMs for redundancy: $210/month
• Load Balancer: $20/month
• Managed Disks: $30/month
• Manual deployments, no auto-scaling
Total: ~$260/month

AKS Approach

• 2x Standard_D2s_v3 nodes (cluster autoscaler adds 3rd when needed): $140/month average
• Control plane: Free (managed by Azure)
• Managed Disks: $30/month
• Auto-scaling, zero-downtime deployments, self-healing
Total: ~$170/month

AKS can be cheaper due to better resource utilization and auto-scaling. However, this assumes you configure it properly (see cost optimization section below).

Common Production Pitfalls (And How to Avoid Them)

I've debugged countless AKS issues in production. Here are the most common problems and their solutions:

1. Resource Limits and Requests Misconfiguration

Problem: Pods get OOMKilled (Out of Memory Killed) or nodes run out of resources because you didn't set proper limits/requests.

I've seen production outages where a single pod consumed all node memory, causing the entire node to become unresponsive. Kubernetes couldn't evict the pod because there were no resource limits.

WRONG - No limits:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: myapp:latest
    # No resources defined - dangerous in production!

CORRECT - Proper limits:

apiVersion: v1
kind: Pod
metadata:
  name: my-app
spec:
  containers:
  - name: app
    image: myapp:latest
    resources:
      requests:
        memory: "256Mi"
        cpu: "250m"
      limits:
        memory: "512Mi"
        cpu: "500m"

Rule of thumb: Set requests to typical usage, limits to maximum expected usage. Monitor actual usage for 1-2 weeks, then adjust.

2. Cost Overruns from Over-Provisioning

Problem: Your AKS bill is 3x higher than expected because you're running nodes that are 80% idle.

Common scenario: You provision 5x Standard_D8s_v3 nodes (8 cores, 32GB RAM each) "to be safe", but your actual workload uses 30% of resources. That's $1,400/month wasted. This is a classic example of how technical debt accumulates when infrastructure decisions are made without proper monitoring and analysis.

Solution: Right-Sizing Strategy

Use Azure Monitor to check actual CPU/memory usage over 30 days
Enable cluster autoscaler to add/remove nodes based on demand
Use smaller node sizes (D2s_v3, D4s_v3) with more instances instead of fewer large nodes
Consider spot instances for non-critical workloads (70% cheaper)
Follow comprehensive cloud cost optimization strategies to reduce waste

3. Security Gaps: No Network Policies

Problem: By default, any pod can talk to any other pod in your cluster. If one pod gets compromised, attackers can access your database.

Network Policy Example:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-policy
  namespace: production
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
  - Ingress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: frontend
    ports:
    - protocol: TCP
      port: 8080

This policy ensures only frontend pods can reach backend pods on port 8080. Everything else is denied by default.

4. Networking Complexity: Address Space Planning

Problem: You run out of IP addresses because you chose a /24 subnet for a cluster that needs to scale to 100+ pods.

With Azure CNI, every pod gets an IP from your virtual network. If you have 10 nodes with max 30 pods each, that's 300 IPs needed. A /24 subnet only has 256 IPs.

IP Planning Guide:

Small cluster (10 nodes, 30 pods/node): /23 subnet (512 IPs)
Medium cluster (30 nodes, 30 pods/node): /22 subnet (1024 IPs)
Large cluster (100 nodes, 30 pods/node): /20 subnet (4096 IPs)
Formula: (max_nodes × max_pods_per_node) + buffer

5. Monitoring Blind Spots

Problem: Your app is slow, but you don't know why. No metrics, no logs, no visibility into what's happening inside the cluster.

In production, you need metrics for: pod CPU/memory usage, request latency, error rates, node health, persistent volume usage, network traffic. Without monitoring, you're flying blind.

References

[1] Microsoft Azure - Official Documentation -https://learn.microsoft.com/en-us/azure/
[2] Microsoft Learn - Azure Training Center -https://learn.microsoft.com/en-us/training/azure/
[3] Kubernetes - Official Documentation -https://kubernetes.io/docs/
[4] CNCF Annual Survey 2023 - State of Kubernetes Adoption -https://www.cncf.io/reports/cncf-annual-survey-2023/
[5] Flexera State of the Cloud Report 2024 -https://www.flexera.com/blog/cloud/cloud-computing-trends-2024-state-of-the-cloud-report/
[6] FinOps Foundation - Best Practices -https://www.finops.org/
[7] Gartner - Cloud Computing Research -https://www.gartner.com/en/information-technology/insights/cloud-computing