Kubernetes Scheduler module

Pratham Sahu

Feb 23, 2024 5 min read

Kubernetes Scheduling Module

Kubernetes Architecture

Kubernetes is an orchestration platform for efficient management of microservices in a distributed system setup. I have personally used Kubernetes to deploy the cryptographically secure dating platform at IIT Kanpur.(PuppyLove)

You can find my blog about it here

Let us first talk about the high level architecture of a Kubernetes Cluster and how it interacts with th external world using kubectl API.

Reference: https://kubernetes.io/docs/concepts/architecture/

There are multiple nodes within a given cluster which run pods within it. Pods are containers that can be spawned using docker images. Kubernetes follows a client-server architecture model where the CONTROL PLANE acts as the master node and controls the working of the servant nodes which run the processes. You can assume it’s functionality analogous to an OS in a system where the entire state of the cluster is managed by the CONTROL PLANE.

That should be more than enough context for the next part, where we deep dive into the Kubernetes Scheduler. You can read more about the architecture here.

Scheduler

As per the kubernetes documentation,

In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of terminating one or more Pods on Nodes.

We will take a look into the following directory to understand the working in a more better way.

/pkg/scheduler

The highlevel view of the scheduler is presented here

Reference: https://kubernetes.io/docs/concepts/scheduling-eviction/scheduling-framework/

The scheduling cycle selects a node for the Pod, and the binding cycle applies that decision to the cluster. The scheduler instance has a loop running indefinitely which (everytime there’s a pod) is responsible for invoking the scheduling logic and making sure a pod gets either a node assigned or requeued for future processing. Each loop consists of a blocking scheduling and a non-blocking binding cycle.

Let us walk through how the scheduler instance works

Scheduler instance is initialised

The scheduler is initialised here. It does the following tasks:

scheduler cache is initialised.
Initialises the scheduler with options
Configurator builds the instance( connects scheduler algorithm, caches, plugins)
Event handlers are registered which allow the scheduler to handle changes from external api calls.

Scheduling a pod

scheduleOne This process is serialised. It is used to It gets the next pod to be scheduled using podInfo := sched.NextPod(). It then calls the Scheduler Algorithm using scheduleResult, err := sched.Algorithm.Schedule(schedulingCycleCtx, fwk, state, pod)

Once the pod is scheduled, it adds it to the list of AssumePods. These are pods which have been scheduled to a node but are not bound to the node yet. This calls assume function whose code can be found here.

This function adds this AssumePod to the cache. The cache is used to free the scheduling context for the next pod until the binding cycle returns with FinishBinding signal. The working of the Scheduler Cache is explained in detail in the next session. The scheduling cycle then calls the Permit Plugin.

Permit Plugin: Permit plugins are invoked at the end of the scheduling cycle for each Pod, to prevent or delay the binding to the candidate node. It can either return approve, deny or wait.

Scheduler Cache

code Cache is responsible for capturing the current state of a cluster.Allowing to update the snapshot of a cluster (to pin the cluster state while a scheduling algorithm is run) with the latest state at the beginning of each scheduling cycle.

The cache also allows to run assume operation which temporarily stores a pod in the cache and makes it look as the pod is actually already running on a designated node for all consumers of the snapshot. It thus increases the throughput of the Scheduler.

Binding the pod to the node

It binds the pods to the node and returns a signal to the Schedular Cache when done. It consists of 4 stages: Consists of the following four steps ran in the same order:

Invoking WaitOnPermit(internal API) of plugins from Permit extension point. Some plugins from the extension point may send a request for an operation requiring to wait for a condition (e.g. wait for additional resources to be available or wait for all pods in a gang to be assumed).Under the hood, WaitOnPermit waits for such a condition to be met within a timeout threshold.
Invoking plugins from PreBind extension point
Invoking plugins from Bind extension point
Invoking plugins from PostBind extension point

Scheduling Framework

code

It is used to filter and score the nodes on which the pods are to be scheduled. The plugins are intitialized here by passing a framework handler, which provides interfaces to manage the pods, nodes and query other handlers.

The scheduler interfaces the Scheduling Algorithm. The algorithm does the following:

Take snapshot from cache
Find out the nodes on which the pod can be scheduled. code.
If there are atleast two nodes which the pod can be scheduled on, using a scoring mechanism to prioritize.

Queueing Mechanism

The queueing mechanism allows the scheduler to pick the best pod for the next scheduling cycle. A pod can have various dependencies to fulfill(ex: Persistant Volumes, Config Maps, etc), so the scheduler needs to be able to postpone its scheduling until it can be successfully scheduled.

The scheduler maintains 3 queues:

active (activeQ): providing pods for immediate scheduling. It is implemented as a heap and is the highest priority qeueue.
unschedulable (unschedulableQ): for parking pods which are waiting for certain condition(s) to happen. It is implemented as a map.
backoff (podBackoffQ): exponentially postponing pods which failed to be scheduled (e.g. volume still getting created) but are expected to get scheduled eventually. It is implemented as a queue. It also sets a timeout to try again, whose value increases exponentially until a threshhold.

Suggested Improvement Ideas:

Advanced Resource Matching:

Implement more granular resource matching and scheduling based on detailed resource requirements (CPU, memory, I/O bandwidth, etc.) and node characteristics.

Predictive Scheduling:

Incorporate machine learning or statistical models to predict future resource requirements and usage patterns, allowing the scheduler to preemptively make decisions that optimize resource utilization and application performance.