Learn how to answer Kubernetes advanced, scenario-based interview questions in Devops/SRE interview in 2026.

Senior engineers do not get asked about pods and deployments. They get asked about the things that break. They are asked about scenarios in which something breaks and how they fixed it.

You have to answer them with clarity if you want them to hire you. No more generic words and lines you read on X and LinkedIn.

Interviewers are asking these scenario-based Kubernetes Questions in n Senior Devops/SRE interviews

1. What happens to a StatefulSet pod when its node goes into NotReady state? How is that different from a Deployment pod?

This question looks simple. It is not.

Most candidates say the pod gets rescheduled. That is the wrong answer for a StatefulSet and it exposes that they have never actually run stateful workloads in production.

Here is what actually happens.

Suppose your node loses network connectivity. Kubernetes does not immediately know if that node is dead or just temporarily unreachable. So it waits. By default it waits five minutes before marking pods on that node as Terminating.

Now here is where StatefulSet and Deployment behave completely differently.

For a Deployment pod, Kubernetes will reschedule that pod on another node after the timeout. Simple. The pod gets a new identity, a new IP, and life continues.

For a StatefulSet pod, Kubernetes will not reschedule it automatically. The pod stays in Terminating state indefinitely. Why? Because StatefulSet guarantees that no two pods with the same identity run at the same time. Suppose that node is not actually dead. It just lost network for a while. If Kubernetes rescheduled your Postgres pod-0 on another node, now you have two pod-0 instances both trying to write to the same data. That is a split-brain scenario. That corrupts your database.

So Kubernetes chooses to do nothing and wait for you to manually intervene.

In production this means you have to make a decision. Is that node actually dead? If yes, you manually force delete the pod with --force --grace-period=0. If no, you wait for the node to come back.

This is why running stateful workloads on Kubernetes is complex. The safety guarantee that protects you from corruption is the same thing that keeps your pod stuck when a node dies.

In my client’s banking environment we hit exactly this scenario. Node went NotReady at 11pm. The on-call engineer did not know about this behavior. They waited for automatic recovery that was never going to come. We lost two hours before someone force deleted that pod.

That is the kind of production context an interviewer is looking for when they ask this question.

Learn advanced DevOps, MLOps, and AIops by building real production infrastructure, troubleshooting live incidents, and working on projects that look exactly like what companies run in production.

25-Week AWS DevOps + MLOPS + AIOPS Bootcamp

2. Explain the difference between liveness probe, readiness probe, and startup probe. When does getting this wrong take down your production app?

Every candidate knows the textbook definition. Liveness restarts the container if it fails. Readiness removes the pod from service endpoints if it fails.

The interviewer is not testing definitions. They are testing whether you have seen what happens when these are configured wrong in production.

Let me show you three real scenarios where getting this wrong causes an incident.

Scenario one. Liveness probe that is too aggressive.

Suppose your Java application takes 90 seconds to start. Your liveness probe starts checking at 10 seconds with a 5 second timeout. The app is still loading. It does not respond. Liveness probe fails. Kubernetes restarts the container. Container starts loading again. Liveness probe fails again. Kubernetes restarts again.

Your pod is now in a restart loop and will never come up. This is called a CrashLoopBackOff and it has nothing to do with your application being broken. Your application is perfectly fine. Your probe configuration is wrong.

The fix is the startup probe. Startup probe runs first and gives your slow application time to initialize. Only after startup probe succeeds does Kubernetes start running liveness and readiness probes. In this case you give startup probe a failureThreshold of 30 with a 10 second period. That gives your app 5 minutes to start before Kubernetes gives up.

Scenario two. Readiness probe checking the wrong endpoint.

Suppose your readiness probe is checking /health but your app marks itself ready before it has finished loading its configuration from a remote config service. Traffic starts hitting the pod. The pod is serving requests with incomplete configuration. Users are getting wrong data or errors.

In production what you want is your readiness probe to check a deeper endpoint that actually validates your app is truly ready. Not just that the HTTP server started. That your database connection pool is initialized. That your config is loaded. That your cache is warmed.

The difference between a shallow health check and a meaningful one is the difference between routing traffic to a broken pod or not.

Scenario three. No readiness probe on a StatefulSet.

This one is subtle. Suppose you have a Postgres StatefulSet with three replicas. You do a rolling upgrade. Pod-0 goes down for upgrade. It comes back up. But it has not finished its recovery from the WAL logs yet. It is not ready to accept connections.

Without a readiness probe, Kubernetes has no way to know this. It marks the pod as ready and starts routing traffic to it. Your application gets connection errors because Postgres is still replaying logs.

A proper readiness probe that checks if Postgres is accepting connections would have kept that pod out of the service endpoints until it was actually ready.

This is why probe configuration is not a minor detail. It is what stands between a smooth deployment and a 2am incident.

3. What is a PodDisruptionBudget and when does ignoring it cause a real production outage?

Most candidates have heard of PodDisruptionBudget. Very few understand what actually happens when it is missing.

Here is the scenario.

Suppose you are running a three replica deployment of your payment service. Your cluster needs to do node maintenance. Maybe Karpenter is consolidating underutilized nodes. Maybe your team is upgrading the EKS node group. Kubernetes starts draining nodes one by one.

Without a PodDisruptionBudget, Kubernetes can evict all three of your payment service pods at the same time. All three were running on nodes that needed to be drained. Within seconds your payment service has zero running pods. Your service is completely down. This is not a failure. This is Kubernetes doing exactly what you asked it to do. You just did not tell it any limits.

A PodDisruptionBudget lets you tell Kubernetes the minimum number of pods that must stay running during voluntary disruptions. You say minAvailable: 2. Now Kubernetes knows it can only evict one pod at a time from your payment service. It drains the node, waits for that pod to be rescheduled and healthy somewhere else, then proceeds with the next node.

The keyword here is voluntary disruptions. PDB only applies to voluntary disruptions. Node drains, cluster upgrades, Karpenter consolidation. It does not protect you from a node crashing or a pod being killed by OOM. That is a different problem.

In production I have seen this exact scenario play out. A team was upgrading their EKS node group. No PDBs. Three critical services went to zero pods simultaneously during the drain. The upgrade ran at 2am in a maintenance window and still caused a 20 minute outage because nobody had defined what the minimum acceptable state was during disruption.

That is what PDB is for. Not theory. That specific 2am scenario.

4. You have a memory leak in one of your microservices. The pod keeps getting OOMKilled. Walk me through how you would diagnose and fix that part without taking down your production service.

This is a scenario question. The interviewer wants to see how you think under pressure, not just whether you know the commands.

Here is how you approach this in production.

First thing you do is understand the blast radius. How many replicas are running? What is the traffic impact of one pod being killed? If you have five replicas and one gets OOMKilled every 30 minutes, you have time to investigate. If you have two replicas and both are getting OOMKilled, you have an active incident and investigation comes second.

Suppose you have time. Here is your investigation path.

You run kubectl top pods to see current memory consumption across all pods. You run kubectl describe pod on the affected pod and look at the last state section. It will tell you it was OOMKilled and what the exit code was. Exit code 137 is OOM.

Now you need to understand if this is a real memory leak or if your memory limit is just too low for the actual workload. These are two completely different problems with completely different fixes.

To figure that out you look at your Prometheus metrics. Specifically container_memory_working_set_bytes over time. If memory is growing continuously with no plateau, that is a leak. If memory is stable but just above your limit, your limit is wrong.

If it is a real leak, that is a developer problem. Your job as a DevOps engineer is to give the team time to fix it without causing an outage. You increase the memory limit temporarily to stop the OOMKills. You set up an alert on memory usage at 80 percent of the new limit so you know when it is approaching again. You give the developers the metrics they need to find the leak.

If the limit was just too low, you right-size it. You look at the actual peak memory usage from Prometheus over the last 30 days and set your limit to something reasonable above that.

Now here is the part most people miss in this answer.

After you fix the immediate problem you need to make sure it does not happen again silently. You set up a Prometheus alert on OOMKilled events so you get notified immediately next time. You look at whether Vertical Pod Autoscaler can help right-size your requests and limits automatically over time.

The interviewer is checking whether you think in systems, not just in commands. Anyone can Google the kubectl commands. Not everyone thinks about the alert that catches the next incident before it becomes an outage.

5. What is the difference between RBAC and what Argo CD gives you for access control? Why do most production teams stop using raw RBAC for developer access?

This is the question that separates people who have worked in a real team from people who have only worked alone.

The textbook answer is that RBAC is Kubernetes native access control. You create Roles, ClusterRoles, RoleBindings. You give developers read access to their namespace. You give DevOps engineers full access. Done.

That works on paper. In production with a real team it becomes painful very quickly.

Here is why.

Suppose you have 20 developers across four teams. Each team owns two microservices. You want to give each team access to their own namespace only. You also have five DevOps engineers who need full cluster access. You also have product managers and stakeholders who want visibility into what is deployed without touching anything.

That means you need to create and manage Roles and RoleBindings for 25 people across multiple namespaces. You need to manage their kubeconfig files so they can actually connect. When someone joins the team you go through the whole setup. When someone leaves you revoke everything. When someone changes teams you update their bindings.

This is manageable for five people. It is painful for 25. It does not scale.

The other problem is visibility. Suppose your developer wants to know if their latest deployment went through fine. With raw RBAC you give them kubectl access. Now they need to know kubectl commands. They need to understand pod states and deployment conditions. Most developers do not want that. They want a simple dashboard that says green or red.

This is exactly why in my client’s environment we gave Argo CD access to the cluster, not the developers. Argo CD has the cluster access. Developers get access to the Argo CD dashboard only. They can see their deployments. They can see what version is running. They can see if a sync failed and why. They can trigger a manual sync if needed.

We control all of that on the Argo CD level, not the Kubernetes RBAC level. Much simpler to manage. No kubeconfig distribution. No RoleBinding for every person. And stakeholders can get read-only Argo CD access with zero Kubernetes exposure.

The other thing Argo CD gives you that raw RBAC does not is drift protection. Suppose a developer somehow has kubectl access and manually changes a deployment. With raw RBAC you will not know until something breaks. With Argo CD, the moment that manual change happens, Argo CD marks the app as OutOfSync and optionally auto-heals it back to what is in Git. Git is the source of truth and nobody can override that silently.

That is the production answer. Not just what RBAC is. But why teams move away from managing it manually and what they use instead.

6. Explain how Karpenter is different from Cluster Autoscaler. In 2026 why would you still choose Cluster Autoscaler?

Most candidates know Karpenter is newer and faster. Very few can explain the architectural difference and when Cluster Autoscaler is still the right choice.

Here is the real difference.

Cluster Autoscaler works with your existing node groups. Suppose you have a node group with m5.xlarge instances. When pods are pending because there is no capacity, Cluster Autoscaler adds another m5.xlarge to that group. It can only add nodes of the type you have already configured in your node groups.

That means you have to predict your workload in advance. Suppose you have a machine learning job that suddenly needs GPU nodes. Cluster Autoscaler cannot provision a p3.2xlarge unless you have already created a node group with that instance type. If you have not, your pod stays pending.

Karpenter works differently. It watches for pending pods and reads their requirements directly. CPU, memory, GPU, architecture, spot or on-demand. Then it goes directly to the AWS EC2 API and provisions the exact right instance type for that workload. No predefined node groups. No waiting for a node group to scale. Karpenter can provision a node in under 60 seconds in most cases.

Karpenter also does something Cluster Autoscaler cannot do well. Consolidation. When your cluster is underutilized, Karpenter actively looks for nodes it can consolidate workloads off of and terminate. That means you are not paying for half-empty nodes sitting idle at 3am when your traffic is low.

So why would you still use Cluster Autoscaler in 2026?

A few real reasons.

First, if you are not on EKS. Karpenter has strong AWS support but other cloud providers are still catching up. If your cluster is on GKE or AKS, Cluster Autoscaler is still the more mature and battle-tested option.

Second, compliance and predictability. Some regulated industries need to know exactly what instance types will run their workloads. A banking client that needs to run only approved, audited instance types cannot let Karpenter make that decision dynamically. They need a controlled, predefined node group that Cluster Autoscaler manages.

Third, migration risk. If you have a large existing cluster with complex node group configurations, migrating to Karpenter is not zero risk. Some teams choose to keep Cluster Autoscaler in production and run Karpenter experiments in lower environments first.

The real answer shows you understand both tools and can make a context-based decision. Not just that Karpenter is newer so it must be better.

7. What is etcd and what actually happens to your cluster if it goes down?

Not asking what etcd is. Everyone knows it is a key value store. The question is what actually breaks and in what order when etcd becomes unavailable.

Here is what most people do not know.

When etcd goes down, your existing workloads keep running. If your pods are healthy and running on nodes, they continue running. Kubelet on each node is independent. It does not need etcd to keep running existing containers.

What stops working is everything that requires the control plane to make decisions.

You cannot deploy anything new. The API server cannot write desired state anywhere so it rejects all writes. You cannot scale. You cannot update a ConfigMap. You cannot create a secret. You cannot run any kubectl command that modifies cluster state.

Self-healing also stops. Suppose a pod crashes while etcd is down. The controller manager cannot create a replacement because it cannot write to etcd. Your deployment said three replicas. One died. It stays dead until etcd comes back.

Now here is the really dangerous part that most people miss.

etcd uses Raft consensus. For a three node etcd cluster you need two nodes to have quorum. Lose two out of three and you lose quorum. Now even reads start failing. Your API server cannot read cluster state. kubectl get anything starts returning errors. Your cluster is now effectively read-only at best and completely unavailable at worst.

This is why etcd backup is not optional in production. This is why we ran automated etcd snapshots every six hours in my client environment and why we kept those snapshots in a separate S3 bucket in a different AWS region. If you lose your etcd data with no backup, you have lost your entire cluster state. You know what is running because you can see the pods but Kubernetes has no record of the desired state. Recovery from that without backups is extremely painful.

The answer the interviewer wants is not just what etcd is. It is that you understand the blast radius of losing it and you have a real backup and recovery plan.

If you want to crack senior Kubernetes interviews in 2026, stop memorizing

answers. Start understanding what breaks and why.

That is the difference between someone who has used Kubernetes and someone who has run it.

Learn real-world DevOps with MLOPS and AIOPS to crack any Senior Devops/SRE jobs in 2026
25-Week AWS DevOps + MLOPS + AIOPS Bootcamp

25-Week AWS DevOps + MLOPS + AIOPS Bootcamp

Akhilesh Mishra