2.6 Kernel O(1) Sched Algorithm
- Scheduler Tuning
- Multi tasking kernels (Linux) allow more than 1 process to exist at any given time and each process is allowed to run as if it is the only one on the system. Many threads from many processes appear to be running at the same time. A scheduler acomplishes this goal and it runs in a thread and is woken up by a timer interrupt, another kernel thread, or a system call.
- kernel/sched.c : In the kernel source tree you can adjust these values to tune the scheduler.
- MIN_TIMESLICE (bare minimum timeslice that a task can receive)
- MAX_TIMESLICE (maximun timeslice that a task can receive)
- The average timeslice is determined by averaging the MIN & MAX Values.
Increasing the value of the MIN and MAx will increase timeslice lengths in general. Increasing TimeSlice lengths will increase EFFICIENCY because there will be less context switches (GOOD for HPC and server systems)
- PRIO_BONUS_RATIO : is the middle percentage of the total priority range that a task can receive as a bonus or punishment in dynamic priority calculations. (Default value is 25) When the value is high, using nice() to set the static priority is less effective and when it is low, static priorities are more effective.
- MAX_SLEEP_AVG :the larger this value gets, the longer tasks will need to sleep in order to be considered active. Increaing this value hurts interactivity, but for non interactive workloads, equality among tasks may be desirable.
- STARVATION_LIMIT : Decreasing this value hurts ineractivity because the task will have to give up the CPU more often, and increasing this value will increase interactive performance at the cost of non-ineractive tasks.
- Schedulers : enforce a thread scheduling policy, including when, for how long, and in some case where (in SMP kernels-where is a question) threads can execute.
- SMP Scheduling (scheduling across CPUs)
- SMT Scheduling (Hyper Threading or Symetric Mult-Thread scheduling)
- NUMA Scheduling(Non Uniform Memory Access which means the scheduler can run a single system image across multiple nodes or physical machines. (one instance of the kernel across many phyical machines , HPC) There is aconcept of a scheduler domain that contains all CPU's in the system. CPU's are divided into groups.
- Soft Real Time Scheduling : schedule tasks that have strict timing requirements. Real Time tasks are assigned special scheduling modes and the scheduler gives them priority over another task on the system. RT scheduling modes include FIFO (first in first out) and Round Robin. This effectively ignores non RT tasks on the system.
- Context Switches : is the process of switching from one thread of executuion to another.
- Source Code Organization
- arch/ : Architecture specific code
- include/ :Header files
- kernel/ : Main Kernel Code (non architecture specific)
- mm/ : Kernel's memory management code
- Programs & Processes :
- Program : A combination of instructions and data put together to perform a task when executed.
- Process : An instance of a program. (An abstraction to embody the state of a program in execution) A process is simply a group of threads that share something called a thread group id (TGID).
- Threads : A process can have multiple threads of execution that work together to accomplish its goals. Only 1 thread can be executed on a CPU at any given time.All threads are simply processes.
- CPU Bound Threads : threads that spend alot of time doing cpu computations. (ie HPC or alot of number crunching)
- I/O Bound Threads : threads that spend time waiting on slow I/O (ie read from disk)
- Scheduling Goals
- Efficiency : it must try to allow as much real work as possible to be done while staying within the restraints of other requirements. (ie Context Switching is expensive and allowing processes to run for longer periods of time increases efficency) Efficency suffers for the sake of other goals such as interactivity, because interactivity typicaly imples more context switches.
- Interactivity : An example of interactivity is a mouse click or a keystroke. Such events require a quick response or more context switches. Many context switches give the impression of immediate response at the cost of efficiency.
- Fairness and Starvation : Tasks should be treated with a certain degree of fairness, including the stipulation that no thread starves. Starvation happens when a thread is not allowed to run for an unacceptabilty long period of time due to the prioritization of other threads. Fairness does mean that every thread should ever starve or be able to trick the scheduler into giving a higher priority ot more CPU than it out have.
- Scheduler Performance Tuning
- There is no universal ideal for schedule performance. There is no single goal that the scheduler can strive for. An example is the give and take of a very interactive desktop sistem that requires a high level of interactivity (context switching) and HPC systems where less context switching results increases efficiency. With perceived performance (multi tasking and interactive desktops), the current executing thread must give up the processor before it's timeslice is up so the context can switch to a mouse click or a key stroke. A server system on the other hand is less concerned with perceived performance and seeks actual performance. In HPC, systems are solving very complex and large problems (takes days on end) and context switching is not as important and hurts efficiency. Hence the art of scheduler tuning.
- Linux 2.4 O(n) algorithm :old and inefficient legacy linux scheduler. The algorithms execution time grows linearly sd the input size grows.
- Linux 2.6 O(1) Alorithm : New Linux scheduler (rewriiten by Ingo Molnar) -tries to create a constant upper bound for the running time of the algorithm or the algorithm is guaranteed to complete in a certain amount of time regardless of the size of the input.
- O(1) algorithm :
- Each CPU is assigned a primitve called a runqueue and it contains 2 priority array. All tasks begin on the active priority array and when they run out of their timeslice, they are moved to a expired priority array. When there are no more tasks on the priority array, it is swapped with the expired array.
- Only 1 task may modify a CPU runqueue at any given time.
- The highest priority task on the system is always scheduled first and if multiple tasks exist at the same priority level, it uses round robin.
- All tasks have a static priority (unix nice value).
- The 2.6 scheduler rewards I/) bound tasks and punishes CPU bound tasks by adding or subtracting from a task's static priority.
- Calculating TimeSlices :TimeSlices are calculated by simply scaling a task's static priority onto the possibile timeslice range and making sure a certain minimum and maximum timeslice is enforeced.
- WaitQueues : are essentially a list of tasks waiting for some event or condition to occur(waiting for I/O etc).
- Schedule() : is the main scheduler function. You cannot preempt the Schedule() function when it is running
- On SMP machines, ther are migration threads that run with high priority and they make sure that runqueues across all CPU's are balanced.
- The Future of Linux Schedulers :
- Swappable Kernels : Being able to switch schedulers (more than 1 scheduler and you get to choose based the system's function.
- Scheduler Mode :means breaking scheduler work loads into categories, and allowing root users to pick scheduler behavior of a system dynamically. Using sysctl and /etc/sysctl.conf to create dynamic changes to the kernel's scheduler on the fly.