30 Kubernetes Questions with Answers for Your Next DevOps Interview
Quick-fire answers to the most challenging Kubernetes questions that separate experienced DevOps engineers from beginners
These 30 questions cover advanced Kubernetes concepts that every DevOps professional should know. Each answer is concise but comprehensive enough to demonstrate deep understanding during technical interviews.
Are you looking to advance your DevOps career?
Join my 20-week Advanced, real-world, project-based DevOps Bootcamp is for you.
Advanced Kubernetes Knowledge Questions
1. If you have a Pod with initContainers that fail, but the main container has restartPolicy: Never, what happens to the Pod status?
Answer: The Pod will remain in Init:Error
or Init:CrashLoopBackOff
status and never start the main container. The restartPolicy applies to all containers including init containers, so with Never
, failed init containers won’t restart. The Pod status will show as failed and require manual intervention or Pod recreation.
2. When using a StatefulSet with 3 replicas and you delete replica-1, will replica-2 and replica-3 be renamed to maintain sequential ordering?
Answer: No, Kubernetes doesn’t rename existing Pods. If you delete pod-1, it will be recreated as pod-1 while pod-2 and pod-3 remain unchanged. StatefulSets maintain stable identity – each Pod keeps its ordinal index throughout its lifecycle. Only when scaling down does Kubernetes terminate Pods in reverse order (highest ordinal first).
3. Can a DaemonSet Pod be scheduled on a master node that has NoSchedule taint without explicitly adding tolerations?
Answer: No, DaemonSets don’t automatically bypass NoSchedule taints on master nodes. You must explicitly add tolerations to the DaemonSet Pod template. However, DaemonSets do automatically get tolerations for node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
taints with NoExecute effect.
4. If you update a Deployment’s image while a rolling update is in progress, will K8s wait for the current rollout to complete or start a new one immediately?
Answer: Kubernetes immediately starts a new rollout, canceling the current one. The previous ReplicaSet being rolled out will be scaled down, and a new ReplicaSet with the updated image will be created. This ensures the latest desired state is always being pursued, but can lead to resource churn during rapid updates.
5. When a node becomes NotReady, how long does it take for Pods to be evicted, and can this be controlled per Pod?
Answer: By default, Pods are evicted after 5 minutes (300 seconds) when a node becomes NotReady. This is controlled by the --pod-eviction-timeout
flag on the controller manager. You can control this per Pod using the tolerationSeconds
field in Pod tolerations for node.kubernetes.io/not-ready
and node.kubernetes.io/unreachable
taints.
6. Is it possible for a Pod to have multiple containers sharing the same port on localhost, and what happens if they try to bind simultaneously?
Answer: No, multiple containers in the same Pod cannot bind to the same port on localhost. They share the same network namespace, so port conflicts will occur. The second container trying to bind will fail with “address already in use” error. Containers must use different ports or one should bind to a specific interface while the other uses localhost.
7. If you create a PVC with ReadWriteOnce access mode, can multiple Pods on the same node access it simultaneously?
Answer: It depends on the storage implementation. ReadWriteOnce means the volume can be mounted as read-write by a single node, not a single Pod. Some storage providers allow multiple Pods on the same node to access the volume simultaneously, while others enforce single Pod access. Check your storage class documentation for specific behavior.
8. When using Horizontal Pod Autoscaler with custom metrics, what happens if the metrics server becomes unavailable during high load?
Answer: HPA enters a “unable to get metrics” state and stops scaling decisions. It won’t scale up or down until metrics are available again. This can be dangerous during high load as the system cannot auto-scale to handle demand. Implement redundant metrics collection and consider using multiple metrics sources or fallback to resource-based metrics.
9. Can you run kubectl port-forward to a Pod that’s in CrashLoopBackOff state, and will it work?
Answer: You can attempt kubectl port-forward
to a CrashLoopBackOff Pod, but it will only work if the Pod is currently running (between crashes). The port-forward will establish when the container starts but will break when it crashes. For persistent debugging, consider using kubectl debug
or troubleshooting the crash cause first.
10. If a ServiceAccount is deleted while Pods using it are still running, what happens to the mounted tokens and API access?
Answer: Running Pods keep their existing mounted tokens and continue to work until the tokens expire (default 1 hour for projected tokens). However, token refresh will fail, and new Pods cannot be created with the deleted ServiceAccount. Existing Pods will eventually lose API access when tokens expire and cannot be renewed.
Production Troubleshooting Questions
11. When using anti-affinity rules, is it possible to create a “deadlock” where no new Pods can be scheduled?
Answer: Yes, overly restrictive anti-affinity rules can create scheduling deadlocks. For example, if you require Pods to be on different nodes but have insufficient nodes, or if conflicting affinity/anti-affinity rules make scheduling impossible. Use preferredDuringSchedulingIgnoredDuringExecution
instead of requiredDuringSchedulingIgnoredDuringExecution
for non-critical constraints to avoid deadlocks.
12. If you have a Job with parallelism: 3 and one Pod fails with restartPolicy: Never, will the Job create a replacement Pod?
Answer: Yes, the Job controller will create a replacement Pod to maintain the desired parallelism level. With restartPolicy: Never
, failed Pods are not restarted in place, but the Job creates new Pods until it reaches the completion count. Failed Pods remain in Failed state for debugging while new Pods are created.
13. Can a Pod’s resource requests be modified after creation, and what’s the difference between requests and limits during OOM scenarios?
Answer: No, resource requests and limits cannot be modified after Pod creation – you must recreate the Pod. During OOM scenarios, the kernel kills processes in Pods that exceed their memory limits first. If the node runs out of memory, Pods using more memory than their requests are prioritized for eviction, regardless of their limits.
14. When using network policies, if you don’t specify egress rules, are outbound connections blocked by default?
Answer: Only if the NetworkPolicy includes “Egress” in policyTypes
. If you specify policyTypes: ["Ingress", "Egress"]
without egress rules, all outbound traffic is blocked. If you only specify policyTypes: ["Ingress"]
or omit policyTypes with only ingress rules, egress traffic remains unrestricted.
15. If a Persistent Volume gets corrupted, can multiple PVCs bound to it cause cascading failures across different namespaces?
Answer: PVs cannot be bound to multiple PVCs simultaneously (except for ReadOnlyMany/ReadWriteMany access modes). However, if a shared volume becomes corrupted, all Pods using that volume will fail. For ReadWriteMany volumes, corruption can indeed cause cascading failures across namespaces if multiple PVCs are bound to the same underlying storage.
Architecture and Scaling Questions
16. Your pod keeps getting stuck in CrashLoopBackOff, but logs show no errors. How would you approach debugging and resolution?
Answer: Check pod events with kubectl describe pod
, examine previous container logs with --previous
flag, and verify resource limits aren’t causing OOMKilled. Test with disabled health checks, check volume mounts and permissions, and use debug containers or init containers to inspect the environment before the main application starts.
17. You have a StatefulSet deployed with persistent volumes, and one of the pods is not recreating properly after deletion. What could be the reasons, and how do you fix it without data loss?
Answer: Common causes include PVC stuck in terminating state, storage class issues, or node affinity conflicts with the persistent volume. Check PVC and PV status, verify storage class availability, and ensure the target node can access the volume. Force delete stuck resources if needed, but always verify data integrity first.
18. Your cluster autoscaler is not scaling up even though pods are in Pending state. What would you investigate?
Answer: Check if pending pods have resource requests (required for autoscaling), verify node group limits and quotas, examine pod scheduling constraints like affinity rules, and review autoscaler logs for errors. Ensure cloud provider APIs are accessible and IAM permissions are correct for node provisioning.
19. A network policy is blocking traffic between services in different namespaces. How would you design and debug the policy to allow only specific communication paths?
Answer: Use namespace selectors and pod selectors to create granular rules, implement default deny policies first, then add explicit allow rules. Debug by testing connectivity between pods, checking policy selectors match intended targets, and remembering that DNS resolution to kube-dns must be allowed for cross-namespace service discovery.
20. One of your microservices has to connect to an external database via a VPN inside the cluster. How would you architect this in Kubernetes with HA and security in mind?
Answer: Deploy VPN gateway pods with high availability across multiple nodes, use a service to provide stable endpoints, implement network policies to restrict VPN access, store VPN credentials in secrets, and consider using a database proxy pattern to abstract VPN complexity from application pods.
Multi-Tenancy and Security Questions
21. You’re running a multi-tenant platform on a single EKS cluster. How do you isolate workloads and ensure security, quotas, and observability for each tenant?
Answer: Use namespace-based isolation with resource quotas, implement network policies for traffic segregation, configure RBAC for access control, deploy monitoring per tenant with label-based filtering, and consider node-level isolation for highly sensitive tenants using taints, tolerations, and dedicated node groups.
22. You notice the kubelet is constantly restarting on a particular node. What steps would you take to isolate the issue and ensure node stability?
Answer: Check system resources (CPU, memory, disk), examine kubelet logs for errors, verify container runtime health, check for node pressure conditions, and analyze system-level issues like OOM killer events. Consider cordoning the node, draining workloads, and potentially replacing it if hardware issues are suspected.
23. A critical pod in production gets evicted due to node pressure. How would you prevent this from happening again, and how do QoS classes play a role?
Answer: Implement Guaranteed QoS by setting equal requests and limits, use pod priority classes for critical workloads, configure pod disruption budgets, and ensure proper node resource reservation. BestEffort pods are evicted first, followed by Burstable, while Guaranteed pods are only evicted if they exceed their limits.
24. You need to deploy a service that requires TCP and UDP on the same port. How would you configure this in Kubernetes using Services and Ingress?
Answer: Create separate services for TCP and UDP since Kubernetes services cannot expose the same port for both protocols simultaneously. Use different external ports or separate load balancers. Ingress controllers only handle HTTP/HTTPS, so UDP traffic must be exposed directly through services.
25. An application upgrade caused downtime even though you had rolling updates configured. What advanced strategies would you apply to ensure zero-downtime deployments next time?
Answer: Implement proper readiness probes with appropriate delays, use preStop hooks for graceful shutdown, configure maxUnavailable: 0 with maxSurge for extra capacity, consider blue-green or canary deployments, and ensure application supports graceful connection draining and handles SIGTERM signals properly.
Performance and Monitoring Questions
26. Your service mesh sidecar (e.g., Istio Envoy) is consuming more resources than the app itself. How do you analyze and optimize this setup?
Answer: Right-size sidecar resources based on actual usage, disable unnecessary features like tracing for non-critical services, limit configuration scope using Sidecar resources, optimize connection pools, and consider excluding certain workloads from the mesh entirely if they don’t need service mesh features.
27. You need to create a Kubernetes operator to automate complex application lifecycle events. How do you design the CRD and controller loop logic?
Answer: Design CRDs with clear spec (desired state) and status (observed state) fields, implement idempotent reconciliation loops that continuously observe and reconcile state differences, use owner references for resource cleanup, handle errors gracefully with exponential backoff, and follow Kubernetes API conventions for conditions and events.
28. Multiple nodes are showing high disk IO usage due to container logs. What Kubernetes features or practices can you apply to avoid this scenario?
Answer: Configure container log rotation with size and file limits, implement centralized logging with DaemonSets, use structured logging to reduce volume, set log-level appropriately for production, configure ephemeral storage limits on containers, and automate log cleanup with scheduled jobs.
29. Your Kubernetes cluster’s etcd performance is degrading. What are the root causes and how do you ensure etcd high availability and tuning?
Answer: Common causes include disk I/O bottlenecks, network latency between members, or large database size. Use dedicated SSD storage, implement proper backup strategies, configure auto-compaction, tune heartbeat and election timeouts, and deploy etcd across multiple availability zones for high availability.
30. You want to enforce that all images used in the cluster must come from a trusted internal registry. How do you implement this at the policy level?
Answer: Use OPA Gatekeeper with constraint templates to validate image sources, implement ValidatingAdmissionWebhooks for custom logic, configure Pod Security Standards, use network policies to restrict registry access, and manage image pull secrets through service accounts with RBAC controls.
Key Takeaways for Interview Success
Essential Skills Demonstrated:
- Troubleshooting methodology – Systematic approach to problem-solving
- Production awareness – Understanding real-world operational challenges
- Security considerations – Multi-layered security thinking
- Performance optimization – Capacity planning and resource management
- Architecture design – Scalable and resilient system design
Interview Preparation Tips:
- Practice explaining the “why” behind your answers, not just the “how”
- Understand trade-offs – Every solution has costs and benefits
- Think in production terms – Consider monitoring, alerting, and operational impact
- Know your debugging process – Systematic troubleshooting separates experts from beginners
- Stay current – Kubernetes evolves rapidly; understand new features and deprecations
These questions represent real scenarios you’ll encounter in production Kubernetes environments. Master these concepts, and you’ll be well-prepared for senior DevOps, Platform Engineer, or SRE interviews.
Ready to ace your Kubernetes interview? Practice these scenarios hands-on and focus on understanding the underlying principles, not just memorizing answers.