Scheduling is the process in Kubernetes that manages pods by deciding which pod gets deployed in which node. Effective scheduling ensures proper resource (e.g., CPU, memory) allocation and optimizes load balancing to enhance overall system performance and stability.
By default, the kube-scheduler
in the control plane is responsible for dynamically scheduling pods based on current resource availability and operational requirements.
The illustration above demonstrates how different components communicate when you run the YAML file with the kubectl apply -f app-descriptor.yaml
command. Once the master node receives the YAML file:
kube-apiserver
validates and processes the YAML file by converting it into Kubernetes objects.kube-scheduler
detects the new unscheduled pods and, according to the given resource requirements, selects the most suitable node for each pod.kube-scheduler
updates the pod's information to reflect the chosen node and communicates to the kubelet
via the kube-apiserver
.kubelet
on the designated node takes over to pull the necessary container images, create the containers, and manage their lifecycle.kube-scheduler considers several factors when scheduling, including resource requirements. For example, assume the above circle image requests a minimum of 1 CPU and 1GB of memory. Then, the scheduler looks for available resources on the worker nodes to identify the best node to place the pod. In addition to resource requests, there can be other constraints such as taints and tolerations, node selectors, affinity and anti-affinity rules, and custom schedulers for more customized scheduling operations.
Note: The kube-scheduler
is responsible solely for scheduling pods onto nodes. It does not manage the running of pods; this task is handled by the kubelet
on each node. The kubelet
takes charge of running the containers in the pods once they are assigned to nodes by the kube-scheduler
.
Kubernetes offers great flexibility in running and managing containers; however, this great flexibility also introduces significant complexity. It’s common to encounter errors during the scheduling phase, which can be time-consuming and sometimes frustrating.
To help lower operational time and costs, developers should be on the alert for common problems and how to mitigate them.
It is common for pods to fail to start due to an insufficient resources error. There are no default CPU or memory requests
or limits
in Kubernetes. As a result, any pod can consume as many resources as needed, potentially suffocating other pods or processes running on the same node.
To mitigate this, you will need to know how to use the pod configuration file’s requests
and limits
. By correctly adjusting these, we can ensure sufficient resources are available.
apiVersion: v1
kind: Pod
spec:
containers:
- name: app
image: nginx
resources:
requests:
memory: "64Mb"
cpu: "0.5"
limits:
memory: "128Mb"
cpu: "1"
The general best practice is to set minimum resources with requests
but not limits
. This approach guarantees the allocation of the minimum required resources while allowing a pod to consume as much available memory as needed.
Above we see that in worker node 1, pod 1 consumes 2 GB of memory, exceeding the requested quota, yet it ensures that the requested 2 GB of memory remains available for pod 2. On worker node 2, since no other processes are running, pod 1 can utilize all available memory.
There are exceptions to this scenario, where setting limits
is advantageous; for example, when a container is publicly accessible, limits
can prevent misuse of the infrastructure by capping resource usage.
nodeSelector
in the pod specification and the node labelBy default, the kube-scheduler randomly assigns pods to nodes based on the availability of resources. However, there are scenarios where specific pods need to be hosted on specific nodes.
For instance, if worker node 2 is equipped with a GPU, it would be optimal to schedule a pod that runs a machine-learning model on this node. In such cases, nodeSelector is used to match the labels on nodes and schedule pods accordingly.
If the node label does not match the label specified in the nodeSelector
field of a pod's configuration, the Kubernetes scheduler will not schedule the pod on that node. In such a case, the pod will remain in a pending state because it can't be scheduled on any node that doesn't have a matching label.
Affinity and anti-affinity rules provide more granular control over scheduling logic. For example, there are required
or preferred
rules where the scheduler will prioritize the specified node, but if such a node is unavailable, the scheduler will still schedule the pod on a different node.
There are two types of affinity rules:
NodeSelector
by providing more expressive and flexible options, including the ability to specify rules as required
and preferred
.Affinity and anti-affinity are advanced expressions embedded in pod configuration files. Due to their flexibility, they offer a high number of options that can be specified through affinity and anti-affinity rules. Note: Writing these complex YAML configurations manually can lead to errors, so it is recommended to refer to the Kubernetes documentation whenever setting up affinity or anti-affinity rules.
Kubernetes controls pod scheduling for each node via taints, which create restrictions for nodes, and tolerations, which allow pods to be scheduled on nodes with specific taints.
A common mistake in Kubernetes cluster management is the mismatch of taints and tolerations, which can lead to scheduling issues. When a pod's toleration does not correctly match the key, value, or effect of a node's taint, the pod may be unexpectedly rejected from scheduling on that node or evicted if already running.
Sometimes, pods fail to be scheduled because the scheduler is unavailable to process the scheduling task, either due to a misconfiguration or because it is not running. This problem is easily detected by checking the status and configuration of the Kubernetes scheduler:
# checking the status of the kube-scheduler in the kube-system namespace
kubectl get pods -n kube-system | grep kube-scheduler
#checking the configuration, if scheduler exists
kubectl describe pod -n kube-system kube-scheduler
In self-hosted Kubernetes clusters where the kube-scheduler
is absent, you have the option to manually schedule pods or configure a custom scheduler to meet specific requirements.
Kubernetes has a complex architecture of multiple interacting components. Hence, cluster operators need to be prepared for potential operational challenges. Mastering the art of diagnosing and resolving issues will yield smoother operations as well as better overall cluster resilience.
Examining the cluster and kube-scheduler
can be the starting point of your troubleshooting workflow. There are no predefined steps to follow, but generally, we can start by checking the availability of the kube-scheduler
or any custom schedulers:
kubectl get pods --all-namespaces --field-selector spec.nodeName=controlplane | grep kube-scheduler
Note: Suppose you are using a managed Kubernetes service—Amazon Elastic Kubernetes Service (EKS), Azure Kubernetes Service (AKS), or Google Kubernetes Engine (GKE). In that case, there is no need to worry about control plane components since they are managed by the cloud service provider (CSP).
If the scheduler is available, then conditions of the data plane nodes, such as CPU, memory, and disk space, may be causing the issue. We can troubleshoot node conditions by checking for memory pressure, disk pressure, and CPU pressure on a node:
kubectl describe node node01 | grep pressure
If the cluster is not under CPU, memory, and disk stress, then additional node constraints might not match with the pods. For example, when using taints and tolerations, the key, value, and effect must exactly match between the node and the pod:
Kubectl describe node node01
Kubectl describe pod image-square-pod
Once you run kubectl describe
, it provides a more detailed explanation of why the scheduling failed.
Kubernetes cluster logs are stored and managed within the cluster itself, typically within the kube-system
namespace. By accessing scheduler logs, you can gain insights into the operational dynamics and troubleshooting aspects of pod scheduling:
kubectl logs $(kubectl get pods --namespace=kube-system | grep
kube-scheduler | awk '{print $1}') --namespace=kube-system
This will give you the list of logs generated by the kube-scheduler
running in the kube-system
namespace. Usually, logs are a mixture of informational messages, warnings, and error messages related to the scheduler.
Due to the complexity of these logs, leveraging Kibana (ELK Stack), Elasticsearch, Logstash, Fluentd, or other similar tools for log management and analysis can be extremely beneficial. These collect, search, and visualize log data, uncovering trends and facilitating issue resolution.
Following log analysis, enhancing cluster monitoring with Kubernetes metrics provides a deeper understanding of scheduling dynamics. Prometheus and Grafana are integral for this purpose:
Pod scheduling is crucial for resource efficiency and system stability in Kubernetes. This article detailed essential techniques for diagnosing and addressing scheduling issues to enhance performance and operational efficiency.
By implementing these strategies, developers and cluster operators can improve system performance. However, given the complexity of these practices, leveraging a monitoring solution like Site24x7 will make managing your Kubernetes environments significantly easier.
Site24x7 simplifies tracking and analyzing Kubernetes metrics, enabling proactive issue resolution.
Experience the benefits of enhanced monitoring by starting a 30-day free trial with Site24x7 today.