Introduction / Why This Is Needed
Pods are the basic and most frequently used objects in Kubernetes. However, they can stop working for a multitude of reasons: from insufficient resources to errors in configuration or the application itself. System diagnostics with kubectl allows you to quickly pinpoint the issue, rather than guessing or blindly recreating resources. After completing this guide, you will be able to independently identify and fix most common pod problems.
Prerequisites / Preparation
Before you begin, ensure you have:
- Access to a Kubernetes cluster with permissions to view pods (usually the
viewrole or higher). - A installed and configured
kubectlclient (version compatible with your cluster). - A basic understanding of pod structure and its components (containers, Init Containers, volumes).
- The name of the problematic pod and its namespace (if not
default).
Step-by-Step Guide
Step 1: Check the Pod's Overall Status and the Cluster
First, ensure the pod exists and check its current state.
# Show all pods in all namespaces (or specify -n <namespace>)
kubectl get pods --all-namespaces
# If you know the namespace, look only within it
kubectl get pods -n <your-namespace>
Pay attention to the STATUS column. Critical states:
Pending— the pod has not been scheduled onto a node.CrashLoopBackOff/Error— container(s) are terminating with an error.ImagePullBackOff/ErrImagePull— failed to pull the image.Terminating— the pod cannot be terminated (often due tofinalizers).
If the pod is not in the list, it may have been deleted or you are looking in the wrong namespace.
Step 2: Examine Detailed Events and Pod Configuration
The describe command is your primary diagnostic tool. It outputs all metadata about the pod and, most importantly, Events, which often hold the key to the solution.
kubectl describe pod <pod-name> -n <namespace>
What to look for in the output:
- Events: Recent events (usually at the end of the output). Look for messages with
WarningorFailedprefixes. For example:Failed to pull image,Insufficient memory,NodeAffinity. - Containers: For each container, check
Ready,State(Waiting/Running/Terminated), andLast State. TheStatewill indicate theReasonand exitExit Code. - Conditions: Conditions like
PodScheduled,Initialized,ContainersReady. IfPodScheduledhas statusFalse, the issue is with the scheduler. - Volumes: Ensure volumes are successfully mounted (
Mounted).
Step 3: Analyze Container Logs
Application logs are a direct source of information about internal errors. Use the logs command.
# Logs of the default (first) container in the pod
kubectl logs <pod-name> -n <namespace>
# Logs of a specific container in a multi-container pod
kubectl logs <pod-name> -c <container-name> -n <namespace>
# If the pod is constantly restarting, check the logs of the previous instance
kubectl logs <pod-name> --previous -n <namespace>
# Logs from the last 10 minutes (if supported)
kubectl logs <pod-name> --since=10m -n <namespace>
If the output is empty, the container might not be starting far enough to write to stdout/stderr (e.g., it crashes immediately after start). In this case, rely on Step 2 (events and state).
Step 4: Check Resource Usage (CPU/Memory)
A common cause of Pending or OOMKilled is insufficient compute resources. First, check the pod's requests and limits in its manifest. Then compare with actual consumption and available resources on nodes.
# View resource requests/limits for the pod (if defined)
kubectl describe pod <pod-name> -n <namespace> | grep -A5 -B5 "Resources"
# Check current resource consumption by all pods in the namespace
kubectl top pods -n <namespace>
# Check available/used resources on nodes
kubectl top nodes
If the pod is Pending and an event says 0/1 nodes are available: 1 Insufficient memory, it means no node has enough free memory to satisfy the pod's requests.memory.
Step 5: Validate the Manifest and Container Image
Sometimes the problem lies in the pod definition itself (YAML/JSON) or in image unavailability.
# Validate the manifest syntax before applying (if you edited it)
kubectl apply --dry-run=client -f your-pod-manifest.yaml
# Check if the image is accessible and if you have permissions to pull it
# (if using a private registry)
kubectl get secret <secret-name> -n <namespace> -o yaml | grep -E "username|password"
# Try manually running a pod with a different image (e.g., alpine) in the same namespace
# to test general network and registry functionality
kubectl run test-pod --image=alpine --restart=Never -n <namespace> --command -- sleep 3600
If the test pod with alpine starts but yours does not, the problem is likely with your image or its parameters (tag, registry, secret).
Step 6: Check Node Status and Network Connectivity
Issues may be in the infrastructure, not the pod.
# Check node status (Ready, DiskPressure, MemoryPressure, etc.)
kubectl get nodes
kubectl describe node <node-name>
# If the pod cannot be scheduled due to tolerations or selectors,
# check if the node labels match the pod's nodeSelector/affinity.
# Check if the networking solution (CNI) is running on the node.
# Usually, this requires SSHing into the node and checking kube-system pods.
kubectl get pods -n kube-system
Verifying the Result
After performing diagnostic steps and applying fixes (e.g., increasing limits, fixing the image, changing selectors), update the pod's status:
kubectl get pod <pod-name> -n <namespace> --watch
Successful result: The pod transitions to Running state, and the READY column shows 1/1 (or the appropriate number of containers). The logs (kubectl logs) show normal application operation without fatal errors.
Potential Issues
kubectlcannot connect to the cluster: Check your configuration (kubectl config view), theKUBECONFIGvariable, and network connectivity to the master node.kubectl describe podshows no events: Events are automatically deleted after some time. Try viewing events for the entire namespace:kubectl get events -n <namespace> --sort-by='.lastTimestamp'.- Pod in
Completed: This is normal for Jobs or pods with a container that finished its task (e.g., a database migration). For long-running services, ensure the container runs a long-lived process (e.g., a web server). - Pod stuck in
Terminating: Most often caused byfinalizersin the pod spec or a non-terminatingpreStophook. Force deletion:kubectl delete pod <pod-name> --grace-period=0 --force. - No access to logs (
Error from server: ...): Check if your user (ServiceAccount) has permissions to get logs (geton thepods/logresource).