Digging into the Linux scheduler

Paulo Almeida
7 min readMar 10, 2020

One may argue that task scheduling is one of the most challenging problems to solve. Whether this is true remains to be seen but in this blog post, I will explain how Linux currently tackles this challenge.

What is the scope of this blog post?

In this blog post I will cover:

  1. some of the main problems that task scheduling has to deal with.
  2. the Linux Scheduler Core in some details.
  3. a high-level overview of some of the scheduling classes available.
  4. how all the above is implemented in the source code.

Are there pre-requisites to be able to understand this blog post?

My goal is to make it as easy as possible so that, with enough curiosity and some familiarity with algorithms, anyone will be able to grasp how most things play together.

Anything else that I need to know?

Yes, I chose to make this topic a series of blog posts. The reason behind it is because they are likely to get lengthy. Having said that, I will focus on explaining how the Linux Core Scheduler is implemented on a uni-processor (UP for short).

The next part of this series of blog posts will be about how the Completely Fair Scheduler (CFS) is implemented and how it plays its part alongside the Scheduler Core. After that, I will leave it open to what you guys want to learn about…

The task scheduling problem

In an over-simplified view of the problem, the operating system (OS) scheduler must, to the best of its ability, ensure that:

  1. every process running gets its chance of running on the CPU.
  2. some processes can have a higher priority than others.
  3. high priority processes don’t hog the CPU and prevent a low-priority process from running. (More on the second part of this blog post)
  4. a running process can be interrupted at any time if a higher priority process needs to run (we are assuming that preemption is enabled)
  5. Fend off from attacks (even unintentional ones) that can interfere with its balance. (Also on the second part of this blog post)

The anatomy of the Linux Scheduler

This is a 10,000 feet view of the most important Linux Scheduler components that will be discussed in this blog post. Honestly speaking, there is a whole lot more to it but I wanted to start simple and build on top of it.

10,000 feet view of the Linux scheduler anatomy

Runqueue

From the hierarchy point of view, the runqueue component is the rendezvous of all data structures involved in the scheduler. There will be exactly one runqueue per CPU as shown in the anatomy of the runqueue (and its main sub-components) below

We will be focusing on uni-processor (UP) so you should consider just one of them ;)

This is how the runqueue is defined in the source code:

[Figure 1] :: Runqueue being defined at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/sched.h#L1043-L1049

And this is its C definition and some comments for easier readability (some properties were omitted for brevity):

[Figure 2] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/sched.h#L847-L1002

Scheduling classes

Kernel developers realised over the years that trying to combine all different scheduling needs into a single scheduling algorithm would lead to code that would be hard to both understand and maintain. With that in mind, they had to break it down into smaller pieces called scheduling classes.

The basic idea behind it is that each scheduling class would have its own way of:

  • defining how tasks should be organised.
  • choosing which of them should be the next task to execute on the CPU.

The existing scheduling classes sit on a Linked list that establishes the precedence of each class over the next. Here is a visual representation of this list:

[Figure 3] :: Visual representation of the sched_class structure

The head of this Linked list is defined based on whether the kernel is compiled to support UP or SMP (Symmetric multiprocessor):

[Figure 4] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/sched.h#L1774-L1778

Considering that we are focusing on UP machines only, I will start straight from DL. Here is a brief description of each scheduling class:

DL (Deadline)
Used for tasks that must finish before a deadline. Audio processing and video encoding/decoding are among the typical use cases a developer would want to use this scheduling class for. DL has the highest priority in a UP machine

RT (Realtime)
Used for short latency-sensitive tasks, eg. IRQ threads. IOW, this is the POSIX ‘real-time’ task scheduling class.

FAIR (CFS)
Used for most tasks in the OS. We will take a closer look into it on the next part of this blogpost series. Stay tuned ;)

IDLE
When the scheduler has cleared all the tasks on the previous scheduling classes, an idle task is executed on the CPU. Try to look at Figure 1 now and connect the dots ;)

PS.: After you finish this article, you should have the necessary background to be able to understand this fantastic article on what an idle CPU does.

Scheduling class implementation

Implementing a scheduling class requires an instance of the sched_class struct. This is essentially a jump table containing several function pointers that will be called by the Scheduler Core when needed.

For better readability, I omitted some of the functions and added some comments.

[Figure 5] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/sched.h#L1702-L1760

And this is how one of them is actually implemented:

[Figure 6] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/rt.c#L2357-L2391

How does the Scheduler Core make use of the Scheduling classes?

To centralise the way of traversing the Linked list as new config options become available, kernel developers came up with these utility macros that will come in handy soon:

[Figure 7] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/sched.h#L1780-L1784

It’s worth mentioning that the ‘sched_class_highest’ parameter is the same variable defined in Figure 4 and NULL which is the end of the Linked list.

These macros are used in several places in the kernel, however, there is one place in the source code in particular that once you see it, you will likely be able to connect together all the pieces of the puzzle we’ve been talking about so far :)

This is place is the function called ‘pick_next_task’ from the Scheduler Core and this is the one that returns the next task that should run on the CPU.

[Figure 8] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/core.c#L3902-L3960

It’s worth mentioning that ‘class->pick_next_task’ is that function pointer defined at the jump table structure depicted in Figure 5.

The schedule() function

This is one of the most widely used functions in the kernel source code and this is for a very good reason.

Think of it as part of the public Scheduler Core API and its main purpose is to be called whenever technically applicable to give the scheduler the opportunity to make the decision about what should run on the CPU at a given moment.

There are so many DOs and DON’Ts when calling this function that I believe that it wouldn’t be feasible for me to collate them in this already lengthy article, so for the time being, it will suffice to say that a good rule of thumb (oversimplified too) is to call schedule() only when you are not in a critical section of the source code.

Once the schedule() is done with all the preparations required then it calls the main entry point of the Scheduler Core internal API: __schedule(bool preempt) which is explained below.

[Figure 9] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/core.c#L3962-L4088

The scheduling-clock interrupt

Remember that in Figure 2 although I omitted many fields, I left one called ‘u64 clock’ on the ‘struct rq’? (Take a look at there if you need to jog your memory). Here is the thing, a function called scheduler_tick() is triggered periodically.

The frequency that it happens depends on several factors that involve:

  • which architecture the kernel is running on.
  • which CONFIG_* options were enabled during the compilation.
  • which boot parameters have been used.
  • whether the CPU has already gotten to the ‘dyntick-idle mode’.

All of those details are explained in details here. Have a look at it when you finish this article.

[Figure 10] :: Full definition at https://github.com/torvalds/linux/blob/v5.5/kernel/sched/core.c#L3586-L3614

The main purpose of this function is to propagate its execution to the sched_class->task_tick function. This will give the scheduling class the opportunity to verify whether the current task is still deserving of running on the CPU and in case it isn’t then it will be marked for rescheduling.

Conclusion

I hope you guys liked this first article of mine on the subject. Feedback is always welcome :)

Stay tuned for the next part of this blogpost series on the deep dive into the CFS scheduling class.

Where do I go from now?

Well, I will list great references I used for creating this article and honestly speaking, their work is fantastic!

--

--

Paulo Almeida

Interested in technical deep dives and the Linux kernel; Opinions are my own;