CKAD Certification Journey — Part 3: Application Observability & Maintenance (The Reality of Running Systems)

If deployment is where you introduce change, observability is where you survive that change.

And here is the uncomfortable truth:

Most engineers don’t fail at Kubernetes because they don’t know how to deploy — they fail because they don’t know how to understand what’s happening after deployment.

This article is not a list of tools. It is a practical mental model for diagnosing real systems under pressure.


🧠 Kubernetes Doesn’t Know Your App Is Broken

Let’s start with the most important idea:

Kubernetes does not know your application is failing.

It only knows:

  • if a process exists
  • if a container is running
  • if a node is healthy

👉 If your app is:

  • deadlocked
  • returning 500s
  • stuck waiting on a dependency

Kubernetes will happily say:

STATUS: Running

And that’s where probes become critical.


❤️ Liveness Probe — Detecting Dead Applications (Not Dead Containers)

A container can be:

  • running
  • consuming CPU
  • accepting connections

…and still be completely broken.

A liveness probe is your way of telling Kubernetes:

“If this condition fails, kill the container and start again”


🔬 What You Should Actually Check (Not Just /health)

Bad example:

livenessProbe:
  httpGet:
    path: /health
    port: 8080

👉 Returns 200 even if:

  • DB is down
  • internal threads are stuck

Better approach

Your liveness endpoint should verify:

  • internal threads responsive
  • event loop alive
  • critical dependencies reachable (with care)

⚠️ Why “with care” matters

If you include external dependencies (like DB):

DB temporary issue → probe fails → container restarts → cascading failure

👉 You’ve just amplified the problem.


🔁 What actually happens on failure

Liveness fails
Kubelet kills container
Container restarted
State lost (if no volume)

💥 Real-world anti-pattern

Aggressive probes:

failureThreshold: 1
periodSeconds: 5

👉 One slow response → restart → instability loop


🧠 Insight

Liveness is not a health check — it is a self-healing trigger


🚦 Readiness Probe — Traffic Control, Not Health

Readiness is not about fixing problems.

It is about:

👉 Protecting your system from bad instances


What readiness really does

If readiness fails:

Pod removed from Service endpoints
Traffic stops
Pod still running

Where this matters

Scenario: slow startup

Without readiness:

  • Pod starts
  • traffic arrives immediately
  • app not ready → errors

With readiness:

  • Pod hidden
  • app warms up
  • then receives traffic

🧠 Critical distinction

SituationLivenessReadiness
App stuckrestartno effect
App warming uprestart (bad)no traffic (correct)
Dependency failurerestart (dangerous)isolate instance

💥 Real production pattern

DB latency increases
Readiness fails
Pods temporarily removed
System stabilises

👉 Without readiness → cascading failure


🔁 Probes Working Together (Real Behaviour)

Startup phase:
  Readiness = false (no traffic)

Runtime failure:
  Liveness = fail → restart

Transient issues:
  Readiness = fail → isolate instance

📊 Monitoring — Not Dashboards, But Signal

Most teams confuse monitoring with graphs.

Monitoring is:

👉 Detecting abnormal behaviour early enough to act


🧱 The Observability Stack (Layered Thinking)

You must think in layers, or you will debug blindly.


1️⃣ Infrastructure (Node Level)

  • CPU saturation
  • memory pressure
  • disk I/O
  • network

👉 Example:

Node disk full → Pods evicted → app downtime

2️⃣ Kubernetes Control Plane

  • scheduling delays
  • API server latency
  • etcd health

👉 Example:

Scheduler slow → Pods Pending → deployment stalls

3️⃣ Pod / Container Level

  • restarts
  • OOMKilled
  • resource limits

👉 Example:

Memory limit too low → container killed → CrashLoopBackOff

4️⃣ Application Level

  • request latency
  • error rates
  • throughput

👉 This is where business impact lives.


🧠 Insight

If you don’t correlate these layers, you’re guessing — not debugging


🛠️ Metrics — What Kubernetes Actually Gives You

Kubernetes gives you just enough to be dangerous.


Metrics Server

kubectl top nodes
kubectl top pods

Shows:

  • CPU
  • memory

👉 Useful for:

  • quick diagnosis
  • autoscaling

👉 Useless for:

  • deep analysis
  • historical trends

cAdvisor

  • container-level metrics
  • filesystem, CPU, memory

kube-state-metrics

  • state of Kubernetes objects
  • desired vs actual

⚠️ Limitation

No history, no alerting, no correlation


📈 Real Monitoring Stack (Why Prometheus Exists)

Production requires:

  • time-series storage
  • alerting
  • correlation

Typical stack

Prometheus → collects metrics
Grafana → visualises
Alertmanager → alerts

Why this matters

Because:

CPU spike alone = useless
CPU spike + latency spike = insight

📜 Logging — Where Truth Actually Lives

Metrics tell you: 👉 something is wrong

Logs tell you: 👉 what is wrong


🔍 Kubernetes Logging Reality

Kubernetes does NOT provide:

  • log storage
  • log search
  • log correlation

What actually happens

Your app writes:

stdout / stderr

Kubernetes:

  • stores it temporarily
  • rotates logs
  • deletes when Pod dies

🧠 Insight

If you don’t aggregate logs, you are losing data continuously


🧱 Real Logging Architecture

Pod → stdout
Fluentd / Fluent Bit
Elasticsearch
Kibana / Grafana

Sources of logs

  • Application
  • kube-apiserver
  • scheduler
  • etcd
  • node system logs

Example debugging

kubectl logs <pod>
kubectl logs <pod> --previous

👉 --previous is critical for crash debugging


🔥 Troubleshooting Pods — Real Workflow

Forget theory. This is what you actually do.


Step 1 — Check state

kubectl get pods

Look for:

  • CrashLoopBackOff
  • Pending
  • Error

Step 2 — Describe

kubectl describe pod <name>

👉 Most useful part: Events


Step 3 — Logs

kubectl logs <pod>

Step 4 — Exec (if needed)

kubectl exec -it <pod> -- sh

💥 Common Failure Patterns


CrashLoopBackOff

Cause:

  • app crash
  • bad config
  • missing dependency

Pending

Cause:

  • no resources
  • node selector mismatch
  • PVC not bound

OOMKilled

Cause:

  • memory limit too low

🧠 Debugging Mindset

Symptom ≠ Cause

Example:

Pod restarting

Is NOT the problem.

👉 It is a symptom.


🧠 Cluster-Level Troubleshooting

Sometimes the issue is not your app.


Check nodes

kubectl get nodes
kubectl describe node

Check system components

kubectl logs kube-apiserver-<node> -n kube-system
journalctl -u kube-apiserver

Real failure example

etcd slow → API slow → scheduling delays → deployment issues

⚠️ What Separates Beginners from Experts


Beginners

  • look at Pod status
  • restart things
  • guess

Experts

  • correlate metrics + logs + events
  • understand layers
  • identify root cause

🧠 Final Mental Model

Probes → control lifecycle
Metrics → detect anomalies
Logs → explain anomalies
Events → show system decisions

🚀 Final Thought

Observability is not about tools — it is about reducing uncertainty.

Kubernetes gives you signals.

Your job is to:

  • interpret them
  • correlate them
  • act on them

🔜 Next Part

In Part 4, we’ll explore:

👉 Configuration, Secrets, Environment, and Security

Because:

misconfigured systems fail more often than badly deployed ones