10 Real-World Kubernetes Troubleshooting Scenarios and Solutions

When Kubernetes breaks in production, you need practical solutions, not theory.
After years of firefighting Kubernetes issues across various environments, I’ve compiled these 10 real-world troubleshooting scenarios you’re likely to encounter. Each includes the exact commands and steps to diagnose and fix the problem.
Must read most on DevOps interviews
- Most Asked Scenario-Based Advanced Questions With Answers For DevOps Interviews
- 20 Most Asked Scenario-Based Advanced Terraform Questions With Answers
1. ImagePullBackOff: Private Registry Authentication Failure
Scenario: Your pods are stuck in ImagePullBackOff
status because they can’t pull images from your private registry.
Diagnosis:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# frontend-85f7c5b66-j2z56 0/1 ImagePullBackOff 0 5m
kubectl describe pod frontend-85f7c5b66-j2z56
# Events:
# ... Error: ErrImagePull
# ... repository requires authentication
Solution:
- Create a Docker registry secret:
kubectl create secret docker-registry regcred \
--docker-server=your-registry.example.com \
--docker-username=username \
--docker-password=password \
--docker-email=email@example.com
- Update your deployment to use the secret:
apiVersion: v1
kind: Pod
metadata:
name: frontend
spec:
containers:
- name: app
image: your-registry.example.com/app:latest
imagePullSecrets:
- name: regcred
- Apply the updated configuration:
kubectl apply -f deployment.yaml
Note: If you are running Kubernetes on Cloud-managed Kubernetes solutions such as EKS, GKE, AKS, and using the cloud container registries then you do not require the credentials as you can use IAM roles. Ensure the IAM role attached to worker nodes is permitted to pull images from cloud docker repositories.
2. CrashLoopBackOff: Application Error on Startup
Scenario: Your container starts but immediately crashes and enters a CrashLoopBackOff state.
Diagnosis:
kubectl logs crashing-pod-78f5b67d6d-2jxgt
# Error: Database connection failed
# Connection refused at db:5432
Solution:
- Check if the application’s dependencies are available:
# Verify the database service exists
kubectl get service db
# If not found, this is likely your issue
- Create a temporary debug pod to test connectivity:
kubectl run --rm -it --image=postgres:13 db-test -- bash
# Inside the container:
apt-get update && apt-get install -y netcat
nc -zv db-service 5432
# Test if database port is reachable
- Fix the service definition if needed:
kubectl edit service db
# Ensure selector matches your database pod labels
- If connectivity is confirmed but authentication fails, check environment variables:
kubectl exec -it crashing-pod-78f5b67d6d-2jxgt -- env | grep DB_
# Check if credentials are correct
3. Environment Variables Missing: ConfigMap Not Mounted
Scenario: Your application is failing because environment variables are missing.
Diagnosis:
kubectl exec -it app-pod-6f7b4d5c9-hj2k8 -- env
# Notice that expected environment variables are absent
Solution:
- Check if the ConfigMap exists:
kubectl get configmap app-config
- Inspect the ConfigMap content:
kubectl describe configmap app-config
- Verify your deployment is correctly referencing the ConfigMap:
# Update your deployment.yaml
spec:
containers:
- name: app
envFrom:
- configMapRef:
name: app-config
- Apply the changes and verify:
kubectl apply -f deployment.yaml
kubectl exec -it app-pod-6f7b4d5c9-hj2k8 -- env
4. Service Connectivity: Microservice Cannot Reach Database
Scenario: Your application pod cannot connect to the database despite both running in the cluster.
Diagnosis:
kubectl exec -it app-pod-7d4f865bc9-k876h -- bash
# Inside container:
curl -v telnet://postgres-svc:5432
# Connection refused
Solution:
- Create a debug pod with database tools to isolate the issue:
kubectl run --rm -it --image=postgres:13 db-debugger -- bash
- Inside the debug pod, test DNS resolution:
nslookup postgres-svc
# Check if service resolves correctly
- Check service endpoints:
kubectl get endpoints postgres-svc
# Should show pod IPs - if empty, labels may be mismatched
- Verify service selector matches pod labels:
kubectl get pods -l app=postgres --show-labels
kubectl describe service postgres-svc
# Compare selector in service with pod labels
- If both exist correctly but still can’t connect, check network policies:
kubectl get networkpolicies
kubectl describe networkpolicy restrict-db-access
# Ensure policies allow traffic from your app to the database
5. Ingress Not Working: Load Balancer Creation Failed
Scenario: You created an Ingress resource, but no external IP is assigned and no load balancer is provisioned.
Diagnosis:
kubectl get ingress
# NAME CLASS HOSTS ADDRESS PORTS AGE
# app-ingress nginx app.example.com 80 10m
kubectl describe ingress app-ingress
# Events:
# ... Failed to create load balancer, no subnet found with required tags
Solution:
- Check if your cloud provider subnet has the required tags:
# AWS example - check subnet tags
aws ec2 describe-subnets --subnet-ids subnet-abc123 --query 'Subnets[0].Tags'
- Add the required tag to your subnet:
# AWS example
aws ec2 create-tags --resources subnet-abc123 --tags Key=kubernetes.io/role/elb,Value=1
- Alternative: Use a NodePort service temporarily:
kubectl expose deployment app-deployment --type=NodePort --port=80
kubectl get service app-deployment
# Access via node-ip:node-port
- Verify the ingress controller is running:
kubectl get pods -n ingress-nginx
6. Pod Stuck in Pending: Resource Constraints
Scenario: Your pod remains in Pending state and isn’t scheduled to any node.
Diagnosis:
kubectl describe pod big-app-78fb967d88-2jht5
# Events:
# ... 0/3 nodes are available: 3 Insufficient memory
Solution:
- Check node resources:
kubectl describe nodes
# Look at Allocated resources section
- Check your pod’s resource requests:
kubectl get pod big-app-78fb967d88-2jht5 -o yaml
# Look for resources.requests
- Either reduce the resource requests:
resources:
requests:
memory: "2Gi" # Reduce this value
cpu: "500m" # Reduce this value
- Or scale up your cluster if truly needed:
# Cloud provider-specific, example for GKE:
gcloud container clusters resize my-cluster --num-nodes=5
7. Pod Evicted: Node Out of Resources
Scenario: Your pods keep getting evicted with status “Evicted”.
Diagnosis:
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# analytics-84d6f7f8b-j2h35 0/1 Evicted 0 5m
kubectl describe pod analytics-84d6f7f8b-j2h35
# Message: The node was low on resource: memory
Solution:
- Identify problematic nodes:
kubectl get pods --field-selector=status.phase=Failed -o wide
# See which nodes had evictions
- Check node resource pressure:
kubectl describe node problematic-node-name | grep Pressure
# Look for MemoryPressure: true
- Find resource-hungry pods:
kubectl top pods --all-namespaces --sort-by=memory
- Set resource limits for all deployments:
resources:
limits:
memory: "1Gi"
cpu: "500m"
requests:
memory: "500Mi"
cpu: "250m"
- Consider implementing a resource quota for the namespace:
kubectl create quota mem-cpu-quota --hard=requests.cpu=2,requests.memory=4Gi,limits.cpu=4,limits.memory=8Gi --namespace=default
Note: You should use node autoscaling solutions such as Karpenter to provision nodes when there aren’t enough resources to run the workload.
8. Service Account Permissions: Pod Can’t Access API Server
Scenario: Your pod needs to interact with the Kubernetes API but receives authorization errors.
Diagnosis:
kubectl logs deployment-controller-7f8b5d4c6-2jhgt
# Forbidden: cannot list deployments.apps in the namespace default
Solution:
- Create a proper service account:
kubectl create serviceaccount deployment-controller
- Create a role with the necessary permissions:
kubectl create role deployment-manager --verb=get,list,watch,create,update,patch,delete --resource=deployments
- Bind the role to the service account:
kubectl create rolebinding deployment-controller-binding --role=deployment-manager --serviceaccount=default:deployment-controller
- Update your deployment to use this service account:
spec:
template:
spec:
serviceAccountName: deployment-controller
- Apply and verify:
kubectl apply -f deployment.yaml
kubectl get pods
# Check if new pods are running
9. Persistent Volume Claim Stuck in Pending: Storage Class Issues
Scenario: Your PVC remains in Pending state and the pod can’t start because its volume isn’t ready.
Diagnosis:
kubectl get pvc
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
# data-pvc Pending standard 5m
kubectl describe pvc data-pvc
# Events:
# ... storageclass.storage.k8s.io "standard" not found
Solution:
- Check available storage classes:
kubectl get storageclass
- Create a storage class if needed:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp2
- Update your PVC to use an existing storage class:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: data-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp2 # Change to match available class
resources:
requests:
storage: 10Gi
- Apply the changes:
kubectl apply -f pvc.yaml
Note: In Managed Kubernetes solutions like EKS, GKE, and AKS, sometimes the issue can be due to the CSI drivers that provision the persistent storage on the cluster worker nodes. Ensure you are using the compatible CSI driver and its CSI driver pods are running. You can also get this issue if the IAM role attached to nodes does not have enough permissions.
10. Container Terminated with OOMKilled: Memory Limits Too Low
Scenario: Your container keeps getting terminated with “OOMKilled” status.
Diagnosis:
kubectl describe pod memory-hog-7f8c69d5b8-3jhtf
# State:
# Terminated:
# Reason: OOMKilled
Solution:
- Check your current memory limits:
kubectl get pod memory-hog-7f8c69d5b8-3jhtf -o yaml | grep -A 5 resources
- Increase the memory limit in your deployment:
resources:
limits:
memory: "2Gi" # Increase this value
requests:
memory: "1Gi" # Consider increasing this too
- Apply the changes:
kubectl apply -f deployment.yaml
- If increasing resources doesn’t help, profile your application for memory leaks:
# Create a temporary pod with debugging tools
kubectl run --rm -it --image=nicolaka/netshoot debugger -- bash
# Inside the container, install memory profiling tools appropriate for your application
- Consider implementing resource metrics monitoring:
# Install metrics server if not already installed
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# Monitor pod resource usage
kubectl top pods
Conclusion
These 10 scenarios cover the most common Kubernetes issues you’ll encounter in real-world environments.
The key to effective Kubernetes troubleshooting is methodical isolation of the problem:
- Check pod status and events
- Examine logs
- Verify networking and connectivity
- Validate configurations and permissions
- Test with temporary debug pods when needed
By following these steps and keeping the commands above handy, you’ll be able to resolve most Kubernetes issues efficiently and minimize production downtime.