This article provides an easy and quick introduction to the principles and implementation of the CFS scheduler, a fully fair class scheduling class in the Linux kernel.
Principle
The core principle of the CFS (Completely Fair Scheduler) is simple: each process is given as “fair” a runtime as possible. So the process that has run the least in the past is chosen to run each time. Of course, as a scheduler, it has to meet much more than that, and a number of features such as Linux’s preemptive processes and support for priorities in the Completely Fair class fair_sched_class
make the design of CFS much more complex than simply picking the least amount of runtime. But don’t fret, let’s take a step-by-step approach to understanding the principles of CFS.
Minimum runtime
In order to avoid excessive preemptions, the Linux kernel sets a minimum runtime (or runtime granularity) for each Task (process), during which the process’s CPU resources cannot be preempted. Unless the process gives up CPU or executes a blocking system call, it can generally execute for a minimum of this time. The minimum runtime can be viewed with the kernel parameter sched_min_granularity_ns
.
Time Slices
CFS ensures that higher priority processes get more CPU time by introducing weights, and time slices are allocated between processes in proportion to their weights. The runtime allocated to processes during the scheduling cycle is calculated according to this formula.
|
|
The weight is a quantity related to the nice value, now defined in the Linux kernel in kernel/sched/core.c#L9516, with the following values
|
|
The weight for a nice value of 0 is NICE_0_LOAD = 1024, and for every difference in nice value of 1, the weight is approximately 1.25 times different. The 1.25 calculation here is based on the design that a 1 difference in nice value results in a 10% difference in running time.
Virtual Runtime
After solving the problem of how much time to allocate to each process, there is still the question of who will run next, and how can higher priority processes be guaranteed more runtime, given that the CFS implementation is based on the principle of “complete fairness”? This introduces a concept called “virtual runtime”. Suppose we want the following scenario.
|
|
Then we set this equal amount as a “virtual runtime”, so that the high priority process will almost always run three times as long as the low priority. In Linux, this virtual runtime is related to the real runtime by the weights just mentioned.
|
|
Linux defines the vruntime variable in the scheduling element (include/linux/sched.h#L453), which is used to record the cumulative virtual runtime of a process, with the following code.
|
|
The vruntime variable is updated after each run of the process. As for how to pick the process with the least vruntime, this will be done by the red-black tree.
Red-black Trees
A red-black tree is a self-balancing binary tree in which the leftmost leaf node is always the node with the smallest key. Red-black trees are guaranteed to always be balanced between these atomic operations by operations on insertion (update) and deletion.
See https://www.jianshu.com/p/e136ec79235c for details of the operations
The kernel arranges the processes on the scheduling queue into a red-black tree by vruntime, so that picking the process with the smallest vruntime becomes a simple matter of picking the leftmost leaf node.
Minimum vruntime
The kernel maintains a min_vruntime variable in addition to the red-black tree to keep track of the minimum virtual runtime at this point in time.
Why is this min_vruntime needed?
- A new process is added to the red-black tree, so what should its vruntime be?
- A process that has been dormant for 100,000 years waits for the event it requested and is now woken up, does it keep its original vruntime?
- The process is load balanced and migrated to another CPU (another scheduling queue), so how does its vruntime change?
It can be seen that the role of min_vruntime is to help the Linux kernel in these situations.