Diagnosing and Resolving Kubernetes Pod Issues

Introduction / Why This Is Needed

Pods are the basic and most frequently used objects in Kubernetes. However, they can stop working for a multitude of reasons: from insufficient resources to errors in configuration or the application itself. System diagnostics with kubectl allows you to quickly pinpoint the issue, rather than guessing or blindly recreating resources. After completing this guide, you will be able to independently identify and fix most common pod problems.

Prerequisites / Preparation

Before you begin, ensure you have:

Access to a Kubernetes cluster with permissions to view pods (usually the view role or higher).
A installed and configured kubectl client (version compatible with your cluster).
A basic understanding of pod structure and its components (containers, Init Containers, volumes).
The name of the problematic pod and its namespace (if not default).

Step-by-Step Guide

Step 1: Check the Pod's Overall Status and the Cluster

First, ensure the pod exists and check its current state.

# Show all pods in all namespaces (or specify -n <namespace>)
kubectl get pods --all-namespaces

# If you know the namespace, look only within it
kubectl get pods -n <your-namespace>

Pay attention to the STATUS column. Critical states:

Pending — the pod has not been scheduled onto a node.
CrashLoopBackOff / Error — container(s) are terminating with an error.
ImagePullBackOff / ErrImagePull — failed to pull the image.
Terminating — the pod cannot be terminated (often due to finalizers).

If the pod is not in the list, it may have been deleted or you are looking in the wrong namespace.

Step 2: Examine Detailed Events and Pod Configuration

The describe command is your primary diagnostic tool. It outputs all metadata about the pod and, most importantly, Events, which often hold the key to the solution.

kubectl describe pod <pod-name> -n <namespace>

What to look for in the output:

Events: Recent events (usually at the end of the output). Look for messages with Warning or Failed prefixes. For example: Failed to pull image, Insufficient memory, NodeAffinity.
Containers: For each container, check Ready, State (Waiting/Running/Terminated), and Last State. The State will indicate the Reason and exit Exit Code.
Conditions: Conditions like PodScheduled, Initialized, ContainersReady. If PodScheduled has status False, the issue is with the scheduler.
Volumes: Ensure volumes are successfully mounted (Mounted).

Step 3: Analyze Container Logs

Application logs are a direct source of information about internal errors. Use the logs command.

# Logs of the default (first) container in the pod
kubectl logs <pod-name> -n <namespace>

# Logs of a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>

# If the pod is constantly restarting, check the logs of the previous instance
kubectl logs <pod-name> --previous -n <namespace>

# Logs from the last 10 minutes (if supported)
kubectl logs <pod-name> --since=10m -n <namespace>

If the output is empty, the container might not be starting far enough to write to stdout/stderr (e.g., it crashes immediately after start). In this case, rely on Step 2 (events and state).

Step 4: Check Resource Usage (CPU/Memory)

A common cause of Pending or OOMKilled is insufficient compute resources. First, check the pod's requests and limits in its manifest. Then compare with actual consumption and available resources on nodes.

# View resource requests/limits for the pod (if defined)
kubectl describe pod <pod-name> -n <namespace> | grep -A5 -B5 "Resources"

# Check current resource consumption by all pods in the namespace
kubectl top pods -n <namespace>

# Check available/used resources on nodes
kubectl top nodes

If the pod is Pending and an event says 0/1 nodes are available: 1 Insufficient memory, it means no node has enough free memory to satisfy the pod's requests.memory.

Step 5: Validate the Manifest and Container Image

Sometimes the problem lies in the pod definition itself (YAML/JSON) or in image unavailability.

# Validate the manifest syntax before applying (if you edited it)
kubectl apply --dry-run=client -f your-pod-manifest.yaml

# Check if the image is accessible and if you have permissions to pull it
# (if using a private registry)
kubectl get secret <secret-name> -n <namespace> -o yaml | grep -E "username|password"

# Try manually running a pod with a different image (e.g., alpine) in the same namespace
# to test general network and registry functionality
kubectl run test-pod --image=alpine --restart=Never -n <namespace> --command -- sleep 3600

If the test pod with alpine starts but yours does not, the problem is likely with your image or its parameters (tag, registry, secret).

Step 6: Check Node Status and Network Connectivity

Issues may be in the infrastructure, not the pod.

# Check node status (Ready, DiskPressure, MemoryPressure, etc.)
kubectl get nodes
kubectl describe node <node-name>

# If the pod cannot be scheduled due to tolerations or selectors,
# check if the node labels match the pod's nodeSelector/affinity.

# Check if the networking solution (CNI) is running on the node.
# Usually, this requires SSHing into the node and checking kube-system pods.
kubectl get pods -n kube-system

Verifying the Result

After performing diagnostic steps and applying fixes (e.g., increasing limits, fixing the image, changing selectors), update the pod's status:

kubectl get pod <pod-name> -n <namespace> --watch

Successful result: The pod transitions to Running state, and the READY column shows 1/1 (or the appropriate number of containers). The logs (kubectl logs) show normal application operation without fatal errors.

Potential Issues

kubectl cannot connect to the cluster: Check your configuration (kubectl config view), the KUBECONFIG variable, and network connectivity to the master node.
kubectl describe pod shows no events: Events are automatically deleted after some time. Try viewing events for the entire namespace: kubectl get events -n <namespace> --sort-by='.lastTimestamp'.
Pod in Completed: This is normal for Jobs or pods with a container that finished its task (e.g., a database migration). For long-running services, ensure the container runs a long-lived process (e.g., a web server).
Pod stuck in Terminating: Most often caused by finalizers in the pod spec or a non-terminating preStop hook. Force deletion: kubectl delete pod <pod-name> --grace-period=0 --force.
No access to logs (Error from server: ...): Check if your user (ServiceAccount) has permissions to get logs (get on the pods/log resource).

F.A.Q.

Why is my pod constantly in Pending state?

What to do if a pod exits with code 137 (OOMKilled)?

How to find out why a pod is restarting (CrashLoopBackOff)?

Can you diagnose a pod if it has no containers (e.g., an Init Container failed)?

Hints

Check the overall status of the pod and cluster

Examine detailed pod events and configuration

Analyze container logs

Check resource usage (CPU/Memory)

Validate the pod manifest and container image

Check node status and network connectivity