Linux Scheduler Stuff Page
Davide Libenzi <>



The LatSched Kernel Patch

This lite kernel patch enable a fine grained scheduler timing measurment by using the kernel function get_cycles() that, on x86 cpu families, uses the rdtsc instruction to fetch the CPU cycle counter. A new character device /dev/latsched ( MAJOR = 10 - MINOR = 117 ) has been introduced to control the behaviour and to fetch data from the kernel scheduler measure code. Other then measuring the scheduler latency this patch can be used to study process scheduling and migration between CPUs. To use the patch a new kernel must be built ( with the patch applied ) and the new character device /dev/latsched must be created with :

# mknod /dev/latsched c 10 117

The code that will make use of the LatSched patch must open the device with :

if ((lsfd = open("/dev/latsched", O_RDWR)) == -1) {

The next step is to set the size of the sample ( circular ) buffer with :

if ((res = ioctl(lsfd, LS_SAMPLES, samples))) {

Then the code will have to instruct to sampler to start collecting scheduler timings with :

if ((res = ioctl(lsfd, LS_START, 0))) {

To stop the sampling process a new ioctl() call is necessary :

if ((res = ioctl(lsfd, LS_STOP, 0))) {

At this point collected data is held inside the scheduler data buffers and must be fetched with something like this :

int cpu, ncpus, ii;
struct lsctl_getdata lsgd;

ncpus = sysconf(_SC_NPROCESSORS_CONF);

memset(&lsgd, 0, sizeof(lsgd));
lsgd.size = samples; = (struct latsched_sample *) malloc(samples * sizeof(struct latsched_sample));

for (cpu = 0; cpu < ncpus; cpu++) {
    lsgd.cpu = cpu;
    lsgd.size = samples;

    if ((res = ioctl(lsfd, LS_FETCH, &lsgd))) {

    for (ii = 0; ii < lsgd.rsize; ii++) {

The kernel patch can be downloaded here :









This is a simple program that interfaces the LatSched kernel patch to fetch data from the scheduler cycle sampler. Its code is also an example on how to use the LatSched kernel patch.

gcc -o schedcnt schedcnt.c


schedcnt [--samples n] [--sttime u] [--klimit k] [-- cmdpath [arg] ...]

--samples   = Set the size of the sample buffer
--ttime     = Set the sample time in seconds
--sttime    = Set the sample time in microseconds
--klimit    = Set the cut factor for samples. Samples > AVSC*klimit are not evalued in KAVS
--          = Separate the optional command to be executed during the test time
cmdpath     = Command to be executed
arg         = Command argouments

The output is, for each cpu :


CPU   = CPU number
NSAMP = Number of readed samples
AVSC  = Average schedule() cycles
CHSQ  = ChiSquare of the CSCH distribution
KAVS  = High cut average schedule() cycles
SAMP[i] is in the form :


CENT  = Cycle counter at schedule() entry
CEXT  = Cycle counter at schedule() exit
CSCH  = Cycle duration of schedule()
PPID  = New scheduled PID
RTIM  = Cycles run time ( this maybe incorrect some time due schedule() nesting )

The SchedCnt code can be downloaded here :



This program creates runqueue load to let the LatSched kernel patch to work on different scheduler environments.

gcc -o cpuhog cpuhog.c


cpuhog [--ntasks n] [--ttime s] [--size b]

--ntask    = Set the number of tasks ( runqueue length )
--ttime    = Set the time cpuhog should run in seconds
--size     = Set the cache drain size in Kb

The CpuHog code can be downloaded here :



Linux kernel scheduler latency test. To make the test work with a number of tasks that is less or equal to the number of CPUs the sys_sched_yield() optimization must be removed from kernel/sched.c. This program does not make use of the LatSched patch and its result's accuracy can vary with the amount of the cache load size. When the size become high the ratio between the schedule() latency and the time spent to load the cache become very little and the accuracy of the measure could be compromised.

gcc -O0 -o lat-sched lat-sched.c


lat-sched [--ntasks n] [--ttime s] [--size b] [--stksize b] [--stkalign b]
          [--max-eck k] [--verbose] [--help]

--ntask    = Set the number of tasks ( runqueue length )
--ttime    = Set thetest measure time in seconds
--size     = Set the cache drain size in Kb
--stksize  = Set the stack size for tasks
--stkalign = Set the shift each task stack will have over stksize
--max-eck  = Set the maximum correction factor above which the measure is invalid
--verbose  = Activate verbose mode
--help     = Print help screen


NRTASK   = Number of test tasks
SBENCH   = Scheduler benchmark == counter / (time * ncpu)
MSRUN    = Number of milliseconds the test is ran
CSSEC    = Number of context switches / sec
LATSCH   = Number of seconds for a context switch
AVGWRK   = Number of context switches / NRTASK
THRZ     = Number of task with zero context switch
CHISQR   = Context switches chi-square over tasks

In case the measure will result invalid ( according to eck ) the output line
will begin with a '*' character.
On UP systems this tool gives precise timings while on MP you can only use
lat-sched as a benchmark by the meaning of the SBENCH field.
Messages are printed on <stderr> while results are printed on <stdout>
The Lat-Sched code can be downloaded here :


CPU History Scheduler Patch

This patch try to achieve a better goodness() calculation through the introduction of the concept of 'cpu time spent onto a processor', aka probable cache footprint of the task.
Right now if we've a couple of cpu bound tasks with the classical editor keyboard ticking, we can see the two cpu bound tasks swapped when the editor task kicks in.
This is wrong coz the cpu bound task has probably run for a bunch of ms and hence has a quite big memory footprint on the cpu.
While the editor task very likely run for a very short time by keeping the footprint of the cpu bound tasks amlost intact.
This patch addresses in a better way even the proc change penalty problem that, right now, is constant.
It's abvious that having to choose between the cpu bound task and the editor task, the better candidate should be the one that has less memory footprint aka the editor task.
So we can change the constant penalty into a constant plus an estimate of the memory footprint.
The patch tracks the number of jiffies that a task run as :

JR = Jout - Jin

where Jin is the jiffies value when the task in kicked in and Jout is the time when it's kicked out.
In a given time the value of the footprint is :

W = JR > (J - Jout) ? JR - (J - Jout): 0

where J is the current time in jiffies.
This means that if the task is run for 10 ms, 10 ms ago, it's current weight will be zero.
This is quite clear coz the more a process does not run the more footprint will lose.
The patch can be downloaded here :




Old ( 2.2.14 ) Priority Queue Patch

This is an old patch for 2.2.14 that divided the runqueue in priority based clusters and achieved a "pretty good"(tm) performances with long run queues while maintained the same behavior with shortest ones :


Run Queue Length Sampler Patch

This patch plug over the Balanced Multi Queue Scheduler ( BMQS ) to enable run queue length sampling. To use the patch a new device file must be created :

# mknod /dev/rqlsamp c 10 116

This is an example of how to use it :

if ((rqfd = open("/dev/rqlsamp", O_RDONLY)) == -1) {

#define MAX_CPUS    64

int i, sreaded, retcpus;
int samples[MAX_CPUS][2];

if ((sreaded = read(rqfd, samples, sizeof(samples))) < 0) {

retcpus = sreaded / (2 * sizeof(int));

for (i = 0; i < retcpus; i++) {
    cpu = samples[i][0];
    rqlen = samples[i][1];


The returned data is a set of couple of integers values, one is the CPU logical number ( -1 == global real time tasks queue ) and the next one is the run queue length. The patch is available here :



This is a simple program that uses the "Run Queue Length Sampler Patch" to print out data read from the  /dev/rqlsamp  device file. The program is available here :


Time Slice Split Scheduler ( TSSS )

This scheduler implementation split the task's time slice from the task's dynamic priority to improve the current scheduler behavior that uses the  task->counter  field as both time slice and dynamic priority value. The recalculation loop is now performed only for running tasks and a global variable  rcl_curr  keeps track of the number of recalculation loops done :

static unsigned long rcl_curr = 0;

asmlinkage void schedule(void)
        /* Do we need to re-calculate counters? */
        if (unlikely(!c)) {
                list_for_each(tmp, &runqueue_head) {
                        p = list_entry(tmp, struct task_struct, run_list);
                        p->time_slice = NICE_TO_TICKS(p->nice);
                        p->rcl_last = rcl_curr;
                goto repeat_schedule;

and the function  add_to_runqueue()  is now become :

static inline void add_to_runqueue(struct task_struct * p)
        p->dyn_prio += rcl_curr - p->rcl_last;
        p->rcl_last = rcl_curr;
        if (p->dyn_prio > MAX_DYNPRIO) p->dyn_prio = MAX_DYNPRIO;
        list_add(&p->run_list, &runqueue_head);

A new task struct field  rcl_last  is used to keep track of the last loop seen by the task and is used to add priority to the woke up task. The  goodness()  function has been changed to do :

        if (!p->time_slice)
                return 0;

        weight = p->dyn_prio + 1;

using in this way the  dyn_prio  task struct member has dynamic priority value. The  update_process_times()  function has been changed to :

void update_process_times(int user_tick)
        struct task_struct *p = current;
        int cpu = smp_processor_id(), system = user_tick ^ 1;

        update_one_process(p, user_tick, system, cpu);
        if (p->pid) {
                if (p->nice > 0)
                        kstat.per_cpu_nice[cpu] += user_tick;
                        kstat.per_cpu_user[cpu] += user_tick;
                kstat.per_cpu_system[cpu] += system;
        } else if (local_bh_count(cpu) || local_irq_count(cpu) > 1)
                kstat.per_cpu_system[cpu] += system;

void expire_task(struct task_struct *p)
        if (!p->time_slice)
                p->need_resched = 1;
        else {
                if (!--p->time_slice) {
                        if (p->dyn_prio > 0) {
                        p->need_resched = 1;
                } else if (p->time_slice < -NICE_TO_TICKS(p->nice)) {
                        p->time_slice = 0;
                        p->need_resched = 1;

Basically the  dyn_prio  field accumulates dynamic priority when the task misses the recalculation loop and this, like the current scheduler, gives I/O bound tasks a nice interactive feel. But differently from the current scheduler, where a task having an high  counter  value can exercise its CPU time all at once, this implementation will grant to tasks having  dyn_prio > 0  the ability to use an extra time slice by decreasing by one their dynamic priority. This will maintain the same interactive feel characteristic but will prevent tasks having a huge counter accumulation to freeze the system by exercising their time slice all at a time. Another improved behavior of this implementation is that CPU bound tasks, having dynamic priority zero, will be correctly sorted out by the  goodness()  function using  mm  affinities :

        if (p->mm == this_mm || !p->mm)
                weight += 1;

where the +1 given, this time, matter. The function  sys_sched_yield()  can be written like :

asmlinkage long sys_sched_yield(void)
        int nr_pending = nr_running;

        int i;

        for (i = 0; i < smp_num_cpus; i++) {
                int cpu = cpu_logical_map(i);
                if (aligned_data[cpu].schedule_data.curr != idle_task(cpu))
        if (nr_pending) {
                struct task_struct *ctsk = current;
                if (ctsk->policy == SCHED_OTHER)
                        ctsk->policy |= SCHED_YIELD;
                ctsk->need_resched = 1;

                ctsk->time_slice = 0;
        return 0;

where the task time slice is zeroed and its dynamic priority is increased. This will fix the infinite-switch behavior of  sys_sched_yield()  and the increased dynamic priority will grant to the yielding task its time slice later. To show the effect of this patch to  sys_sched_yield()  two  schedcnt  dumps are reported. The test has been run by creating two CPU hog processes with  cpuhog  and by running the  lat-sched  with two tasks using sched_yield() to switch. The first dump is the one generated by the current scheduler where the tasks 847 and 848 are the yield()ing tasks and tasks 843 and 844 are the CPU hog tasks. It clearly shows the huge number of switches that the system has to perform ( wasting CPU cycles ) before the CPU hog task has the opportunity to run. The second dump is generated by using the proposed scheduler where tasks 1763 and 1764 are the yield()ing tasks and tasks 1758 and 1759 are the CPU hog tasks. It's clear that this scenario is much better and the CPU hog tasks get the CPU way before the previous dump without wasting CPU time and without loading the scheduler with tons of switches. The patch is available here :


RTLats - Real Time Tasks Scheduler Latency Sampler

This program uses the LatSched patch to measure the scheduler latency for real time tasks. The source code is available here :