10 Real-World Kubernetes Troubleshooting Scenarios and Solutions

When Kubernetes breaks in production, you need practical solutions, not theory.

After years of firefighting Kubernetes issues across various environments, I’ve compiled these 10 real-world troubleshooting scenarios you’re likely to encounter. Each includes the exact commands and steps to diagnose and fix the problem.

Must read most on DevOps interviews

1. ImagePullBackOff: Private Registry Authentication Failure

Scenario: Your pods are stuck in ImagePullBackOff status because they can’t pull images from your private registry.

Diagnosis:

kubectl get pods
# NAME                      READY   STATUS             RESTARTS   AGE
# frontend-85f7c5b66-j2z56  0/1     ImagePullBackOff   0          5m

kubectl describe pod frontend-85f7c5b66-j2z56
# Events:
# ... Error: ErrImagePull
# ... repository requires authentication

Solution:

Create a Docker registry secret:

kubectl create secret docker-registry regcred \
  --docker-server=your-registry.example.com \
  --docker-username=username \
  --docker-password=password \
  --docker-email=email@example.com

Update your deployment to use the secret:

apiVersion: v1
kind: Pod
metadata:
  name: frontend
spec:
  containers:
  - name: app
    image: your-registry.example.com/app:latest
  imagePullSecrets:
  - name: regcred

Apply the updated configuration:

kubectl apply -f deployment.yaml

Note: If you are running Kubernetes on Cloud-managed Kubernetes solutions such as EKS, GKE, AKS, and using the cloud container registries then you do not require the credentials as you can use IAM roles. Ensure the IAM role attached to worker nodes is permitted to pull images from cloud docker repositories.

2. CrashLoopBackOff: Application Error on Startup

Scenario: Your container starts but immediately crashes and enters a CrashLoopBackOff state.

Diagnosis:

kubectl logs crashing-pod-78f5b67d6d-2jxgt
# Error: Database connection failed
# Connection refused at db:5432

Solution:

Check if the application’s dependencies are available:

# Verify the database service exists
kubectl get service db
# If not found, this is likely your issue

Create a temporary debug pod to test connectivity:

kubectl run --rm -it --image=postgres:13 db-test -- bash
# Inside the container:
apt-get update && apt-get install -y netcat
nc -zv db-service 5432
# Test if database port is reachable

Fix the service definition if needed:

kubectl edit service db
# Ensure selector matches your database pod labels

If connectivity is confirmed but authentication fails, check environment variables:

kubectl exec -it crashing-pod-78f5b67d6d-2jxgt -- env | grep DB_
# Check if credentials are correct

3. Environment Variables Missing: ConfigMap Not Mounted

Scenario: Your application is failing because environment variables are missing.

Diagnosis:

kubectl exec -it app-pod-6f7b4d5c9-hj2k8 -- env
# Notice that expected environment variables are absent

Solution:

Check if the ConfigMap exists:

kubectl get configmap app-config

Inspect the ConfigMap content:

kubectl describe configmap app-config

Verify your deployment is correctly referencing the ConfigMap:

# Update your deployment.yaml
spec:
  containers:
  - name: app
    envFrom:
    - configMapRef:
        name: app-config

Apply the changes and verify:

kubectl apply -f deployment.yaml
kubectl exec -it app-pod-6f7b4d5c9-hj2k8 -- env

4. Service Connectivity: Microservice Cannot Reach Database

Scenario: Your application pod cannot connect to the database despite both running in the cluster.

Diagnosis:

kubectl exec -it app-pod-7d4f865bc9-k876h -- bash
# Inside container:
curl -v telnet://postgres-svc:5432
# Connection refused

Solution:

Create a debug pod with database tools to isolate the issue:

kubectl run --rm -it --image=postgres:13 db-debugger -- bash

Inside the debug pod, test DNS resolution:

nslookup postgres-svc
# Check if service resolves correctly

Check service endpoints:

kubectl get endpoints postgres-svc
# Should show pod IPs - if empty, labels may be mismatched

Verify service selector matches pod labels:

kubectl get pods -l app=postgres --show-labels
kubectl describe service postgres-svc
# Compare selector in service with pod labels

If both exist correctly but still can’t connect, check network policies:

kubectl get networkpolicies
kubectl describe networkpolicy restrict-db-access
# Ensure policies allow traffic from your app to the database

5. Ingress Not Working: Load Balancer Creation Failed

Scenario: You created an Ingress resource, but no external IP is assigned and no load balancer is provisioned.

Diagnosis:

kubectl get ingress
# NAME          CLASS   HOSTS               ADDRESS   PORTS   AGE
# app-ingress   nginx   app.example.com               80      10m

kubectl describe ingress app-ingress
# Events:
# ... Failed to create load balancer, no subnet found with required tags

Solution:

Check if your cloud provider subnet has the required tags:

# AWS example - check subnet tags
aws ec2 describe-subnets --subnet-ids subnet-abc123 --query 'Subnets[0].Tags'

Add the required tag to your subnet:

# AWS example
aws ec2 create-tags --resources subnet-abc123 --tags Key=kubernetes.io/role/elb,Value=1

Alternative: Use a NodePort service temporarily:

kubectl expose deployment app-deployment --type=NodePort --port=80
kubectl get service app-deployment
# Access via node-ip:node-port

Verify the ingress controller is running:

kubectl get pods -n ingress-nginx

6. Pod Stuck in Pending: Resource Constraints

Scenario: Your pod remains in Pending state and isn’t scheduled to any node.

Diagnosis:

kubectl describe pod big-app-78fb967d88-2jht5
# Events:
# ... 0/3 nodes are available: 3 Insufficient memory

Solution:

Check node resources:

kubectl describe nodes
# Look at Allocated resources section

Check your pod’s resource requests:

kubectl get pod big-app-78fb967d88-2jht5 -o yaml
# Look for resources.requests

Either reduce the resource requests:

resources:
  requests:
    memory: "2Gi"  # Reduce this value
    cpu: "500m"    # Reduce this value

Or scale up your cluster if truly needed:

# Cloud provider-specific, example for GKE:
gcloud container clusters resize my-cluster --num-nodes=5

7. Pod Evicted: Node Out of Resources

Scenario: Your pods keep getting evicted with status “Evicted”.

Diagnosis:

kubectl get pods
# NAME                      READY   STATUS    RESTARTS   AGE
# analytics-84d6f7f8b-j2h35 0/1     Evicted   0          5m

kubectl describe pod analytics-84d6f7f8b-j2h35
# Message: The node was low on resource: memory

Solution:

Identify problematic nodes:

kubectl get pods --field-selector=status.phase=Failed -o wide
# See which nodes had evictions

Check node resource pressure:

kubectl describe node problematic-node-name | grep Pressure
# Look for MemoryPressure: true

Find resource-hungry pods:

kubectl top pods --all-namespaces --sort-by=memory

Set resource limits for all deployments:

resources:
  limits:
    memory: "1Gi"
    cpu: "500m"
  requests:
    memory: "500Mi"
    cpu: "250m"

Consider implementing a resource quota for the namespace:

kubectl create quota mem-cpu-quota --hard=requests.cpu=2,requests.memory=4Gi,limits.cpu=4,limits.memory=8Gi --namespace=default

Note: You should use node autoscaling solutions such as Karpenter to provision nodes when there aren’t enough resources to run the workload.

8. Service Account Permissions: Pod Can’t Access API Server

Scenario: Your pod needs to interact with the Kubernetes API but receives authorization errors.

Diagnosis:

kubectl logs deployment-controller-7f8b5d4c6-2jhgt
# Forbidden: cannot list deployments.apps in the namespace default

Solution:

Create a proper service account:

kubectl create serviceaccount deployment-controller

Create a role with the necessary permissions:

kubectl create role deployment-manager --verb=get,list,watch,create,update,patch,delete --resource=deployments

Bind the role to the service account:

kubectl create rolebinding deployment-controller-binding --role=deployment-manager --serviceaccount=default:deployment-controller

Update your deployment to use this service account:

spec:
  template:
    spec:
      serviceAccountName: deployment-controller

Apply and verify:

kubectl apply -f deployment.yaml
kubectl get pods
# Check if new pods are running

9. Persistent Volume Claim Stuck in Pending: Storage Class Issues

Scenario: Your PVC remains in Pending state and the pod can’t start because its volume isn’t ready.

Diagnosis:

kubectl get pvc
# NAME        STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS   AGE
# data-pvc    Pending                                      standard       5m

kubectl describe pvc data-pvc
# Events:
# ... storageclass.storage.k8s.io "standard" not found

Solution:

Check available storage classes:

kubectl get storageclass

Create a storage class if needed:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2

Update your PVC to use an existing storage class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: data-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp2  # Change to match available class
  resources:
    requests:
      storage: 10Gi

Apply the changes:

kubectl apply -f pvc.yaml

Note: In Managed Kubernetes solutions like EKS, GKE, and AKS, sometimes the issue can be due to the CSI drivers that provision the persistent storage on the cluster worker nodes. Ensure you are using the compatible CSI driver and its CSI driver pods are running. You can also get this issue if the IAM role attached to nodes does not have enough permissions.

10. Container Terminated with OOMKilled: Memory Limits Too Low

Scenario: Your container keeps getting terminated with “OOMKilled” status.

Diagnosis:

kubectl describe pod memory-hog-7f8c69d5b8-3jhtf
# State:
#   Terminated:
#     Reason:    OOMKilled

Solution:

Check your current memory limits:

kubectl get pod memory-hog-7f8c69d5b8-3jhtf -o yaml | grep -A 5 resources

Increase the memory limit in your deployment:

resources:
  limits:
    memory: "2Gi"  # Increase this value
  requests:
    memory: "1Gi"  # Consider increasing this too

Apply the changes:

kubectl apply -f deployment.yaml

If increasing resources doesn’t help, profile your application for memory leaks:

# Create a temporary pod with debugging tools
kubectl run --rm -it --image=nicolaka/netshoot debugger -- bash

# Inside the container, install memory profiling tools appropriate for your application

Consider implementing resource metrics monitoring:

# Install metrics server if not already installed
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Monitor pod resource usage
kubectl top pods

Conclusion

These 10 scenarios cover the most common Kubernetes issues you’ll encounter in real-world environments.

The key to effective Kubernetes troubleshooting is methodical isolation of the problem:

Check pod status and events
Examine logs
Verify networking and connectivity
Validate configurations and permissions
Test with temporary debug pods when needed

By following these steps and keeping the commands above handy, you’ll be able to resolve most Kubernetes issues efficiently and minimize production downtime.

Must read most on DevOps interviews

1. ImagePullBackOff: Private Registry Authentication Failure

2. CrashLoopBackOff: Application Error on Startup

3. Environment Variables Missing: ConfigMap Not Mounted

4. Service Connectivity: Microservice Cannot Reach Database

5. Ingress Not Working: Load Balancer Creation Failed

6. Pod Stuck in Pending: Resource Constraints

7. Pod Evicted: Node Out of Resources

8. Service Account Permissions: Pod Can’t Access API Server

9. Persistent Volume Claim Stuck in Pending: Storage Class Issues

10. Container Terminated with OOMKilled: Memory Limits Too Low

Conclusion

Akhilesh Mishra