CKAD Certification Journey — Part 3: Application Observability & Maintenance (The Reality of Running Systems)
If deployment is where you introduce change, observability is where you survive that change.
And here is the uncomfortable truth:
Most engineers don’t fail at Kubernetes because they don’t know how to deploy — they fail because they don’t know how to understand what’s happening after deployment.
This article is not a list of tools. It is a practical mental model for diagnosing real systems under pressure.
🧠 Kubernetes Doesn’t Know Your App Is Broken
Let’s start with the most important idea:
Kubernetes does not know your application is failing.
It only knows:
- if a process exists
- if a container is running
- if a node is healthy
👉 If your app is:
- deadlocked
- returning 500s
- stuck waiting on a dependency
Kubernetes will happily say:
STATUS: Running
And that’s where probes become critical.
❤️ Liveness Probe — Detecting Dead Applications (Not Dead Containers)
A container can be:
- running
- consuming CPU
- accepting connections
…and still be completely broken.
A liveness probe is your way of telling Kubernetes:
“If this condition fails, kill the container and start again”
🔬 What You Should Actually Check (Not Just /health)
Bad example:
livenessProbe:
httpGet:
path: /health
port: 8080
👉 Returns 200 even if:
- DB is down
- internal threads are stuck
Better approach
Your liveness endpoint should verify:
- internal threads responsive
- event loop alive
- critical dependencies reachable (with care)
⚠️ Why “with care” matters
If you include external dependencies (like DB):
DB temporary issue → probe fails → container restarts → cascading failure
👉 You’ve just amplified the problem.
🔁 What actually happens on failure
Liveness fails
↓
Kubelet kills container
↓
Container restarted
↓
State lost (if no volume)
💥 Real-world anti-pattern
Aggressive probes:
failureThreshold: 1
periodSeconds: 5
👉 One slow response → restart → instability loop
🧠 Insight
Liveness is not a health check — it is a self-healing trigger
🚦 Readiness Probe — Traffic Control, Not Health
Readiness is not about fixing problems.
It is about:
👉 Protecting your system from bad instances
What readiness really does
If readiness fails:
Pod removed from Service endpoints
Traffic stops
Pod still running
Where this matters
Scenario: slow startup
Without readiness:
- Pod starts
- traffic arrives immediately
- app not ready → errors
With readiness:
- Pod hidden
- app warms up
- then receives traffic
🧠 Critical distinction
| Situation | Liveness | Readiness |
|---|---|---|
| App stuck | restart | no effect |
| App warming up | restart (bad) | no traffic (correct) |
| Dependency failure | restart (dangerous) | isolate instance |
💥 Real production pattern
DB latency increases
↓
Readiness fails
↓
Pods temporarily removed
↓
System stabilises
👉 Without readiness → cascading failure
🔁 Probes Working Together (Real Behaviour)
Startup phase:
Readiness = false (no traffic)
Runtime failure:
Liveness = fail → restart
Transient issues:
Readiness = fail → isolate instance
📊 Monitoring — Not Dashboards, But Signal
Most teams confuse monitoring with graphs.
Monitoring is:
👉 Detecting abnormal behaviour early enough to act
🧱 The Observability Stack (Layered Thinking)
You must think in layers, or you will debug blindly.
1️⃣ Infrastructure (Node Level)
- CPU saturation
- memory pressure
- disk I/O
- network
👉 Example:
Node disk full → Pods evicted → app downtime
2️⃣ Kubernetes Control Plane
- scheduling delays
- API server latency
- etcd health
👉 Example:
Scheduler slow → Pods Pending → deployment stalls
3️⃣ Pod / Container Level
- restarts
- OOMKilled
- resource limits
👉 Example:
Memory limit too low → container killed → CrashLoopBackOff
4️⃣ Application Level
- request latency
- error rates
- throughput
👉 This is where business impact lives.
🧠 Insight
If you don’t correlate these layers, you’re guessing — not debugging
🛠️ Metrics — What Kubernetes Actually Gives You
Kubernetes gives you just enough to be dangerous.
Metrics Server
kubectl top nodes
kubectl top pods
Shows:
- CPU
- memory
👉 Useful for:
- quick diagnosis
- autoscaling
👉 Useless for:
- deep analysis
- historical trends
cAdvisor
- container-level metrics
- filesystem, CPU, memory
kube-state-metrics
- state of Kubernetes objects
- desired vs actual
⚠️ Limitation
No history, no alerting, no correlation
📈 Real Monitoring Stack (Why Prometheus Exists)
Production requires:
- time-series storage
- alerting
- correlation
Typical stack
Prometheus → collects metrics
Grafana → visualises
Alertmanager → alerts
Why this matters
Because:
CPU spike alone = useless
CPU spike + latency spike = insight
📜 Logging — Where Truth Actually Lives
Metrics tell you: 👉 something is wrong
Logs tell you: 👉 what is wrong
🔍 Kubernetes Logging Reality
Kubernetes does NOT provide:
- log storage
- log search
- log correlation
What actually happens
Your app writes:
stdout / stderr
Kubernetes:
- stores it temporarily
- rotates logs
- deletes when Pod dies
🧠 Insight
If you don’t aggregate logs, you are losing data continuously
🧱 Real Logging Architecture
Pod → stdout
↓
Fluentd / Fluent Bit
↓
Elasticsearch
↓
Kibana / Grafana
Sources of logs
- Application
- kube-apiserver
- scheduler
- etcd
- node system logs
Example debugging
kubectl logs <pod>
kubectl logs <pod> --previous
👉 --previous is critical for crash debugging
🔥 Troubleshooting Pods — Real Workflow
Forget theory. This is what you actually do.
Step 1 — Check state
kubectl get pods
Look for:
- CrashLoopBackOff
- Pending
- Error
Step 2 — Describe
kubectl describe pod <name>
👉 Most useful part: Events
Step 3 — Logs
kubectl logs <pod>
Step 4 — Exec (if needed)
kubectl exec -it <pod> -- sh
💥 Common Failure Patterns
CrashLoopBackOff
Cause:
- app crash
- bad config
- missing dependency
Pending
Cause:
- no resources
- node selector mismatch
- PVC not bound
OOMKilled
Cause:
- memory limit too low
🧠 Debugging Mindset
Symptom ≠ Cause
Example:
Pod restarting
Is NOT the problem.
👉 It is a symptom.
🧠 Cluster-Level Troubleshooting
Sometimes the issue is not your app.
Check nodes
kubectl get nodes
kubectl describe node
Check system components
kubectl logs kube-apiserver-<node> -n kube-system
journalctl -u kube-apiserver
Real failure example
etcd slow → API slow → scheduling delays → deployment issues
⚠️ What Separates Beginners from Experts
Beginners
- look at Pod status
- restart things
- guess
Experts
- correlate metrics + logs + events
- understand layers
- identify root cause
🧠 Final Mental Model
Probes → control lifecycle
Metrics → detect anomalies
Logs → explain anomalies
Events → show system decisions
🚀 Final Thought
Observability is not about tools — it is about reducing uncertainty.
Kubernetes gives you signals.
Your job is to:
- interpret them
- correlate them
- act on them
🔜 Next Part
In Part 4, we’ll explore:
👉 Configuration, Secrets, Environment, and Security
Because:
misconfigured systems fail more often than badly deployed ones