CKAD Certification Journey — Part 3: Application Observability & Maintenance (The Reality of Running Systems)

If deployment is where you introduce change, observability is where you survive that change.

And here is the uncomfortable truth:

Most engineers don’t fail at Kubernetes because they don’t know how to deploy — they fail because they don’t know how to understand what’s happening after deployment.

This article is not a list of tools. It is a practical mental model for diagnosing real systems under pressure.

🧠 Kubernetes Doesn’t Know Your App Is Broken

Let’s start with the most important idea:

Kubernetes does not know your application is failing.

It only knows:

if a process exists
if a container is running
if a node is healthy

👉 If your app is:

deadlocked
returning 500s
stuck waiting on a dependency

Kubernetes will happily say:

STATUS: Running

And that’s where probes become critical.

❤️ Liveness Probe — Detecting Dead Applications (Not Dead Containers)

A container can be:

running
consuming CPU
accepting connections

…and still be completely broken.

A liveness probe is your way of telling Kubernetes:

“If this condition fails, kill the container and start again”

🔬 What You Should Actually Check (Not Just /health)

Bad example:

livenessProbe:
  httpGet:
    path: /health
    port: 8080

👉 Returns 200 even if:

DB is down
internal threads are stuck

Better approach

Your liveness endpoint should verify:

internal threads responsive
event loop alive
critical dependencies reachable (with care)

⚠️ Why “with care” matters

If you include external dependencies (like DB):

DB temporary issue → probe fails → container restarts → cascading failure

👉 You’ve just amplified the problem.

🔁 What actually happens on failure

Liveness fails
   ↓
Kubelet kills container
   ↓
Container restarted
   ↓
State lost (if no volume)

💥 Real-world anti-pattern

Aggressive probes:

failureThreshold: 1
periodSeconds: 5

👉 One slow response → restart → instability loop

🧠 Insight

Liveness is not a health check — it is a self-healing trigger

🚦 Readiness Probe — Traffic Control, Not Health

Readiness is not about fixing problems.

It is about:

👉 Protecting your system from bad instances

What readiness really does

If readiness fails:

Pod removed from Service endpoints
Traffic stops
Pod still running

Where this matters

Scenario: slow startup

Without readiness:

Pod starts
traffic arrives immediately
app not ready → errors

With readiness:

Pod hidden
app warms up
then receives traffic

🧠 Critical distinction

Situation	Liveness	Readiness
App stuck	restart	no effect
App warming up	restart (bad)	no traffic (correct)
Dependency failure	restart (dangerous)	isolate instance

💥 Real production pattern

DB latency increases
   ↓
Readiness fails
   ↓
Pods temporarily removed
   ↓
System stabilises

👉 Without readiness → cascading failure

🔁 Probes Working Together (Real Behaviour)

Startup phase:
  Readiness = false (no traffic)

Runtime failure:
  Liveness = fail → restart

Transient issues:
  Readiness = fail → isolate instance

📊 Monitoring — Not Dashboards, But Signal

Most teams confuse monitoring with graphs.

Monitoring is:

👉 Detecting abnormal behaviour early enough to act

🧱 The Observability Stack (Layered Thinking)

You must think in layers, or you will debug blindly.

1️⃣ Infrastructure (Node Level)

CPU saturation
memory pressure
disk I/O
network

👉 Example:

Node disk full → Pods evicted → app downtime

2️⃣ Kubernetes Control Plane

scheduling delays
API server latency
etcd health

👉 Example:

Scheduler slow → Pods Pending → deployment stalls

3️⃣ Pod / Container Level

restarts
OOMKilled
resource limits

👉 Example:

Memory limit too low → container killed → CrashLoopBackOff

4️⃣ Application Level

request latency
error rates
throughput

👉 This is where business impact lives.

🧠 Insight

If you don’t correlate these layers, you’re guessing — not debugging

🛠️ Metrics — What Kubernetes Actually Gives You

Kubernetes gives you just enough to be dangerous.

Metrics Server

kubectl top nodes
kubectl top pods

Shows:

CPU
memory

👉 Useful for:

quick diagnosis
autoscaling

👉 Useless for:

deep analysis
historical trends

cAdvisor

container-level metrics
filesystem, CPU, memory

kube-state-metrics

state of Kubernetes objects
desired vs actual

⚠️ Limitation

No history, no alerting, no correlation

📈 Real Monitoring Stack (Why Prometheus Exists)

Production requires:

time-series storage
alerting
correlation

Typical stack

Prometheus → collects metrics
Grafana → visualises
Alertmanager → alerts

Why this matters

Because:

CPU spike alone = useless
CPU spike + latency spike = insight

📜 Logging — Where Truth Actually Lives

Metrics tell you: 👉 something is wrong

Logs tell you: 👉 what is wrong

🔍 Kubernetes Logging Reality

Kubernetes does NOT provide:

log storage
log search
log correlation

What actually happens

Your app writes:

stdout / stderr

Kubernetes:

stores it temporarily
rotates logs
deletes when Pod dies

🧠 Insight

If you don’t aggregate logs, you are losing data continuously

🧱 Real Logging Architecture

Pod → stdout
   ↓
Fluentd / Fluent Bit
   ↓
Elasticsearch
   ↓
Kibana / Grafana

Sources of logs

Application
kube-apiserver
scheduler
etcd
node system logs

Example debugging

kubectl logs <pod>
kubectl logs <pod> --previous

👉 --previous is critical for crash debugging

🔥 Troubleshooting Pods — Real Workflow

Forget theory. This is what you actually do.

Step 1 — Check state

kubectl get pods

Look for:

CrashLoopBackOff
Pending
Error

Step 2 — Describe

kubectl describe pod <name>

👉 Most useful part: Events

Step 3 — Logs

kubectl logs <pod>

Step 4 — Exec (if needed)

kubectl exec -it <pod> -- sh

💥 Common Failure Patterns

CrashLoopBackOff

Cause:

app crash
bad config
missing dependency

Pending

Cause:

no resources
node selector mismatch
PVC not bound

OOMKilled

Cause:

memory limit too low

🧠 Debugging Mindset

Symptom ≠ Cause

Example:

Pod restarting

Is NOT the problem.

👉 It is a symptom.

🧠 Cluster-Level Troubleshooting

Sometimes the issue is not your app.

Check nodes

kubectl get nodes
kubectl describe node

Check system components

kubectl logs kube-apiserver-<node> -n kube-system
journalctl -u kube-apiserver

Real failure example

etcd slow → API slow → scheduling delays → deployment issues

⚠️ What Separates Beginners from Experts

Beginners

look at Pod status
restart things
guess

Experts

correlate metrics + logs + events
understand layers
identify root cause

🧠 Final Mental Model

Probes → control lifecycle
Metrics → detect anomalies
Logs → explain anomalies
Events → show system decisions

🚀 Final Thought

Observability is not about tools — it is about reducing uncertainty.

Kubernetes gives you signals.

Your job is to:

interpret them
correlate them
act on them

🔜 Next Part

In Part 4, we’ll explore:

👉 Configuration, Secrets, Environment, and Security

Because:

misconfigured systems fail more often than badly deployed ones

CKAD Certification Journey — Part 3: Application Observability & Maintenance (The Reality of Running Systems)#

🧠 Kubernetes Doesn’t Know Your App Is Broken#

❤️ Liveness Probe — Detecting Dead Applications (Not Dead Containers)#

🔬 What You Should Actually Check (Not Just /health)#

Better approach#

⚠️ Why “with care” matters#

🔁 What actually happens on failure#

💥 Real-world anti-pattern#

🧠 Insight#

🚦 Readiness Probe — Traffic Control, Not Health#

What readiness really does#

Where this matters#

Scenario: slow startup#

🧠 Critical distinction#

💥 Real production pattern#

🔁 Probes Working Together (Real Behaviour)#

📊 Monitoring — Not Dashboards, But Signal#

🧱 The Observability Stack (Layered Thinking)#

1️⃣ Infrastructure (Node Level)#

2️⃣ Kubernetes Control Plane#

3️⃣ Pod / Container Level#

4️⃣ Application Level#

🧠 Insight#

🛠️ Metrics — What Kubernetes Actually Gives You#

Metrics Server#

cAdvisor#

kube-state-metrics#

⚠️ Limitation#

📈 Real Monitoring Stack (Why Prometheus Exists)#

Typical stack#

Why this matters#

📜 Logging — Where Truth Actually Lives#

🔍 Kubernetes Logging Reality#

What actually happens#

🧠 Insight#

🧱 Real Logging Architecture#

Sources of logs#

Example debugging#

🔥 Troubleshooting Pods — Real Workflow#

Step 1 — Check state#

Step 2 — Describe#

Step 3 — Logs#

Step 4 — Exec (if needed)#

💥 Common Failure Patterns#

CrashLoopBackOff#

Pending#

OOMKilled#

🧠 Debugging Mindset#

🧠 Cluster-Level Troubleshooting#

Check nodes#

Check system components#

Real failure example#

⚠️ What Separates Beginners from Experts#

Beginners#

Experts#

🧠 Final Mental Model#

🚀 Final Thought#

🔜 Next Part#

CKAD Certification Journey — Part 3: Application Observability & Maintenance (The Reality of Running Systems)

🧠 Kubernetes Doesn’t Know Your App Is Broken

❤️ Liveness Probe — Detecting Dead Applications (Not Dead Containers)

🔬 What You Should Actually Check (Not Just /health)

Better approach

⚠️ Why “with care” matters

🔁 What actually happens on failure

💥 Real-world anti-pattern

🧠 Insight

🚦 Readiness Probe — Traffic Control, Not Health

What readiness really does

Where this matters

Scenario: slow startup

🧠 Critical distinction

💥 Real production pattern

🔁 Probes Working Together (Real Behaviour)

📊 Monitoring — Not Dashboards, But Signal

🧱 The Observability Stack (Layered Thinking)

1️⃣ Infrastructure (Node Level)

2️⃣ Kubernetes Control Plane

3️⃣ Pod / Container Level

4️⃣ Application Level

🧠 Insight

🛠️ Metrics — What Kubernetes Actually Gives You

Metrics Server

cAdvisor

kube-state-metrics

⚠️ Limitation

📈 Real Monitoring Stack (Why Prometheus Exists)

Typical stack

Why this matters

📜 Logging — Where Truth Actually Lives

🔍 Kubernetes Logging Reality

What actually happens

🧠 Insight

🧱 Real Logging Architecture

Sources of logs

Example debugging

🔥 Troubleshooting Pods — Real Workflow

Step 1 — Check state

Step 2 — Describe

Step 3 — Logs

Step 4 — Exec (if needed)

💥 Common Failure Patterns

CrashLoopBackOff

Pending

OOMKilled

🧠 Debugging Mindset

🧠 Cluster-Level Troubleshooting

Check nodes

Check system components

Real failure example

⚠️ What Separates Beginners from Experts

Beginners

Experts

🧠 Final Mental Model

🚀 Final Thought

🔜 Next Part