Linux Testbed for Multiprocessor Scheduling in Real-Time ... Why You Should Be Using LITMUSRT If you...
Transcript of Linux Testbed for Multiprocessor Scheduling in Real-Time ... Why You Should Be Using LITMUSRT If you...
Björn B. BrandenburgManohar Vanga
Mahircan Gül
— July 2016 —
Getting Started with
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Enter event or section title
MPI-SWS 2
Agenda
Major Features What sets LITMUSRT apart?
Key Concepts What you need to know to get started
1 What? Why? How? The first decade of LITMUSRT
2
3
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Getting Stated with
Enter event or section title
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Getting Stated with
What? Why? How? The first decade of LITMUSRT
— Part 1 —
Enter event or section title
MPI-SWS 4
What is LITMUSRT?
A real-time extension of the Linux kernel.
Enter event or section title
MPI-SWS 5
What is LITMUSRT?
Linux kernel patch
+user-space interface
+tracing infrastructure
Enter event or section title
MPI-SWS 6
What is LITMUSRT?
Linux kernel patch
+user-space interface
+tracing infrastructure
{RT schedulersRT synchronization[cache & GPU]
{C APIdevice filesscripts & tools
{overheadsscheduleskernel debug log
Enter event or section title
MPI-SWS 7
What is LITMUSRT?
Linux kernel patch
+user-space interface
+tracing infrastructure
{RT schedulersRT synchronization[cache & GPU]
{C APIdevice filesscripts & tools
{overheadsscheduleskernel debug log
Releases2007.12007.22007.32008.12008.22008.32010.12010.22011.12012.12012.22012.32013.12014.12014.22015.12016.1
MPI-SWS
Mission
8
Enable practical multiprocessor real-time systems research under realistic conditions.
MPI-SWS
Mission
Efficiently…➡ enable apples-to-apples comparison
with existing systems (esp. Linux)
…support real applications…➡ I/O, synchronization, legacy code
…on real multicore hardware…➡ Realistic overheads on commodity
platforms.
…in a real OS.➡ Realistic implementation
constraints and challenges.
9
Enable practical multiprocessor real-time systems research under realistic conditions.
practical and realistic:
release
completion
deadline
scheduled on processor 1
scheduled on processor 2
20151050 time
T1
T2
T3
T4
T5
group deadline(if past window)
release
completion
deadline
scheduled on processor 1
scheduled on processor 2
pfair window
1050 time
T1
T2
T3
T4
T5
1
21 3
2
1
4
3
5
4
21
1
6
2
5 6
1
7
3
6
12
3
4
5
12
1
1
1
2
3
6
12
3
45
7
Going from this…
“At any point in time, the system schedules the m highest-priority jobs, where a job’s current priority is given by…”
… to this!
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
0
20000
40000
60000
80000
100000
120000
140000
4.00 32.00 60.00 88.00 116.00 144.00 172.00 200.00 228.00 256.00
num
ber o
f sam
ples
overhead in microseconds (bin size = 4.00us)
G-EDF: measured scheduling overhead for 3 tasks per processor (host=ludwig)min=0.79us max=314.60us avg=54.63us median=40.15us stdev=45.83us
samples: total=560090[IQR filter not applied]
Global EDFscheduler overhead
for 72 task on 24 cores
MPI-SWS
Why You Should Be Using LITMUSRT
If you are doing kernel-level work anyway…➡ Get a head-start — simplified kernel interfaces, debugging
infrastructure, user-space interface, tracing infrastructure➡ As a baseline — compare with schedulers in LITMUSRT
If you are developing real-time applications…➡ Get a predictable execution environment with “textbook
algorithms” matching the literature➡ Isolate processes with reservation-based scheduling!➡ Understand kernel overheads with just a few commands!
If your primary focus is theory and analysis…➡ To understand the impact of overheads.➡ To demonstrate practicality of proposed approaches.
12
Enter event or section title
MPI-SWS 13
Theory vs. Practice
Why is implementing “textbook” schedulers difficult?
Besides the usual kernel fun: restricted environment, special APIs, difficult to debug, …
Enter event or section title
MPI-SWS 14
Scheduling in Theory
=SchedulingAlgorithm
CPU 1
CPU 2
CPU m
…
Scheduler: a function that, at each point in time, maps elements from the set of ready jobs onto a set of m processors.
Ready Queue
MPI-SWS
Scheduling in Theory
Global policies based on global state➡E.g., “At any point in time, the m highest-priority...”
Sequential policies, assuming total order of events.➡E.g., “If a job arrives at time t…”
15
=SchedulingAlgorithm
CPU 1
CPU 2
CPU m
…
Scheduler: a function that, at each point in time, maps elements from the set of ready jobs onto a set of m processors.
Ready Queue
Enter event or section title
MPI-SWS 16
Scheduling in Theory
current event
CPU 1 CPU 2 CPU m…
=SchedulingAlgorithm
CPU 1
CPU 2
CPU m
…
Practical scheduler: job assignment changes only in response towell-defined scheduling events (or at well-known points in time).
Ready Queue
Enter event or section title
MPI-SWS 17
Scheduling in Practice
event 1
CPU 1
CPU 2
CPU m
…
schedule() CPU 1
CPU 2
CPU m
…
schedule()
event 2
event m
schedule()
…
Ready Queue
MPI-SWS
Scheduling in Practice
Each processor schedules only itself locally.➡ Multiprocessor schedulers are parallel algorithms.➡ Concurrent, unpredictable scheduling events!➡ New events occur while making decision!➡ No globally consistent atomic snapshot for free!
18
event 1
CPU 1
…
schedule() CPU 1
……
CPU m CPU mschedule()
event m
Ready Queue
Enter event or section title
MPI-SWS 19
Original Purpose of LITMUSRT
group deadline(if past window)
release
completion
deadline
scheduled on processor 1
scheduled on processor 2
pfair window
1050 time
T1
T2
T3
T4
T5
1
21 3
2
1
4
3
5
4
21
1
6
2
5 6
1
7
3
6
12
3
4
5
12
1
1
1
2
3
6
12
3
45
7
release
completion
deadline
scheduled on processor 1
scheduled on processor 2
20151050 time
T1
T2
T3
T4
T5
Develop efficient implementations.
Inform: what works well and what doesn’t?
Theory
Practice
Enter event or section title
MPI-SWS 20
History — The first Ten YearsReleases[RTSS’06]
2007.12007.22007.32008.12008.22008.32010.12010.22011.12012.12012.22012.32013.12014.12014.22015.12016.1
Calandrino et al. (2006)[not publicly released]
[2011– ]
[2006–2011]
Project initiated by Jim Anderson (UNC); first prototype implemented byJohn Calandrino, Hennadiy Leontyev, Aaron Block, and Uma Devi.
Graciously supported over the years by:NSF, ARO, AFOSR, AFRL, and Intel, Sun, IBM, AT&T, and Northrop Grumman Corps.
Thanks!
MPI-SWS
History — The first Ten Years
Continuously maintained➡ reimplemented for 2007.1➡ 17 major releases spanning
40 major kernel versions (Linux 2.6.20 — 4.1)
Impact➡ used in 50+ papers,
and 7 PhD & 3 MSc theses➡ several hundred citations➡ used in South & North
America, Europe, and Asia
21
Releases[RTSS’06]
2007.12007.22007.32008.12008.22008.32010.12010.22011.12012.12012.22012.32013.12014.12014.22015.12016.1
Calandrino et al. (2006)[not publicly released]
[2011– ]
[2006–2011]
MPI-SWS
Goals and Non-GoalsGoal: Make life easier for real-time systems researchers ➡ LITMUSRT always was, and remains, primarily a research vehicle➡ encourage systems research by making it more approachable
Goal: Be sufficiently feature complete & stable to be practical➡ no point in evaluating systems that can’t run real workloads
Non-Goal: POSIX compliance➡ We provide our own APIs — POSIX is old and limiting.
Non-Goal: API stability➡ We rarely break interfaces, but do it without hesitation if needed.
Non-Goal: Upstream inclusion➡ LITMUSRT is neither intended nor suited to be merged into Linux.
22
Enter event or section title
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Getting Stated with
Major Features What sets LITMUSRT apart?
— Part 2 —
Enter event or section title
MPI-SWS 24
Partitioned vs. Clustered vs. Global
Main Memory
L2 Cache
L2 Cache
Core 1 Core 2 Core 4Core 3
Q1
Q2
J1
J2
J3
J4
clustered schedulingMain Memory
L2 Cache
L2 Cache
Core 1 Core 2 Core 4Core 3
Q1
J1
J2
J3
J4
global schedulingMain Memory
L2 Cache
L2 Cache
Core 1 Core 2 Core 4Core 3
Q1
Q3
Q2
Q4
J1 J2 J3 J4
partitioned scheduling
real-time multiprocessor scheduling approaches
Enter event or section title
MPI-SWS 25
Predictable Real-Time SchedulersMatching the literature!
Global EDF
Partitioned EDF
Clustered EDF
Pfair (PD2)
Partitioned Fixed-Priority (FP)
Partitioned Reservation-Basedpolling + table-driven
maintained in mainline LITMUSRT
Enter event or section title
MPI-SWS 26
Predictable Real-Time SchedulersMatching the literature!
Global EDF
Partitioned EDF
Clustered EDF
Pfair (PD2)
Partitioned Fixed-Priority (FP)
Partitioned Reservation-Basedpolling + table-driven
Global FIFOGlobal FPRUN
QPS
EDF-fmEDF-WM NPS-F
EDF-C=D
maintained in mainline LITMUSRT
CBSSporadic Servers
CASH
EDF-HSB
Global & Clustered Adaptive EDF
soft-polling slack sharing
…
Global Message-Passing EDF &FPStrong Laminar APA FP
MC2
external branches & patches /paper-specific prototypes
slot shifting
MPI-SWS
Jump-Start Your Research
Bottom line:➡ The scheduler that you need might already be available.
(Almost) never start from scratch:➡ If you need to implement a new scheduler, there likely
exists a good starting point (e.g., of similar structure).
Plenty of baselines:➡ At the very least, LITMUSRT can provide you with
interesting baselines to compare against.
27
Enter event or section title
MPI-SWS 28
Predictable Locking ProtocolsMatching the literature!
MPCP
DPCPDFLP
Global OMLP
OMIP
FMLP+
Clustered OMLP
SRP
PCP
MPCP-VS
non-preemptive spin locks k-exclusion locks
RNLP
MBWIMC-IPC
maintained in mainline LITMUSRT external branches & patches /paper-specific prototypes
…
Enter event or section title
MPI-SWS 29
Lightweight Overhead Tracing
minimal static trace points
+binary rewriting (jmp ↔ nop)
+per-processor, wait-free buffers
Enter event or section title
MPI-SWS 30
Evaluate Your Workload with Realistic Overheads
0
10000
20000
30000
40000
50000
60000
70000
0.25 1.75 3.25 4.75 6.25 7.75 9.25 10.75 12.25 13.75
num
ber o
f sam
ples
overhead in microseconds (bin size = 0.25us)
G-EDF: measured context-switch overhead for 3 tasks per processor (host=ludwig)min=0.62us max=37.74us avg=5.52us median=5.31us stdev=2.10us
samples: total=560087 filtered=105 (0.02%)IQR: extent=12 threshold=37.80us
0
10000
20000
30000
40000
50000
60000
70000
0.25 1.75 3.25 4.75 6.25 7.75 9.25 10.75 12.25 13.75
num
ber o
f sam
ples
overhead in microseconds (bin size = 0.25us)
P-EDF: measured context-switch overhead for 3 tasks per processor (host=ludwig)min=0.63us max=44.59us avg=5.70us median=5.39us stdev=2.39us
samples: total=560087 filtered=14 (0.00%)IQR: extent=12 threshold=46.30us
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
1.00 7.00 13.00 19.00 25.00 31.00 37.00 43.00 49.00 55.00
num
ber o
f sam
ples
overhead in microseconds (bin size = 1.00us)
P-EDF: measured job release overhead for 3 tasks per processor (host=ludwig)min=0.27us max=23.54us avg=5.48us median=4.93us stdev=2.72us
samples: total=152059[IQR filter not applied]
Partitioned EDFjob release overhead
[72 tasks, 24 cores]
0
5000
10000
15000
20000
25000
4.00 32.00 60.00 88.00 116.00 144.00 172.00 200.00 228.00 256.00
num
ber o
f sam
ples
overhead in microseconds (bin size = 4.00us)
G-EDF: measured job release overhead for 3 tasks per processor (host=ludwig)min=1.75us max=291.17us avg=62.05us median=43.40us stdev=52.43us
samples: total=152059[IQR filter not applied]
Global EDFjob release overhead
[72 tasks, 24 cores]
Global EDFcontext-switch overhead
[72 tasks, 24 cores]
Partitioned EDFcontext-switch overhead
[72 tasks, 24 cores]
Note the scale!
Enter event or section title
MPI-SWS 31
Automatic Interrupt Filtering
measured activitystart timestamp end timestamp
Overhead tracing, ideally:
With outliers:
start timestamp end timestampISR
noise due to untimely interrupt
MPI-SWS
Automatic Interrupt Filtering
How to cope?➡ can’t just turn off interrupts➡ Used statistical filters…‣…but which filter?‣… what if there are true
outliers?
Since LITMUSRT 2012.2:➡ ISRs increment counter➡ timestamps include
counter snapshots & flag➡ interrupted samples
discarded automatically32
measured activitystart timestamp end timestamp
Overhead tracing, ideally:
With outliers:
start timestamp end timestampISR
noise due to untimely interrupt
Enter event or section title
MPI-SWS 33
Cycle Counter Skew Compensation
start timestamp
end timestamp
Tracing inter-processor interrupts (IPI):
Core 1
Core 27
… IPI
Enter event or section title
MPI-SWS 34
Cycle Counter Skew Compensation
start timestamp
end timestamp
Tracing inter-processor interrupts (IPI), with non-aligned clock sources:
Core 1
Core 27
… IPI1000
990IPI received before it was sent!?
[➞ overflows to extremely large outliers]
Enter event or section title
MPI-SWS 35
Cycle Counter Skew Compensation
start timestamp
end timestamp
Tracing inter-processor interrupts (IPI), with non-aligned clock sources:
Core 1
Core 27
… IPI1000
990IPI received before it was sent!?
[➞ overflows to extremely large outliers]
In LITMUSRT, simply run ftcat -c to measure and automatically compensate for unaligned clock sources.
Enter event or section title
MPI-SWS 36
Lightweight Schedule Tracingtask parameters
+context switches & blocking
+job releases & deadlines & completions
Built on top of:
Enter event or section title
MPI-SWS 37
Schedule Visualization: st-drawEver wondered what a Pfair schedule looks like in reality?
Enter event or section title
MPI-SWS 38
Schedule Visualization: st-draw
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms
rtspin/30084(17.00ms, 30.00ms)
rtspin/30085(17.00ms, 30.00ms)
rtspin/30086(17.00ms, 30.00ms)
rtspin/30087(17.00ms, 30.00ms)
rtspin/30088(17.00ms, 30.00ms)
rtspin/30089(17.00ms, 30.00ms)
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
222
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
2
2
2
2
2
2
2
2
2
2 2
2
2
3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
3
3
3
3
3
3
3
3
3
3 3
3
3
Ever wondered what a Pfair schedule looks like in reality?Easy! Just record the schedule with sched_trace and run st-draw!
Note: this is real execution data from a 4-core machine, not a simulation! [Color indicates CPU identity].
Enter event or section title
MPI-SWS 39
Easy Access to Workload Statisticscombining multiple wait queues, which are readily availablein Linux. Most complexity stems from the need to coordinatescheduling decisions across cluster boundaries when a job iseligible to be scheduled in multiple clusters due to migratorypriority inheritance. In particular, if a processor decides tonot schedule a lock-holding job J
l
because it observes thatJl
is already scheduled in another cluster, the processor mustbe notified if J
l
is subsequently preempted. In our prototype,this is realized with a per-semaphore bitmask that tracks onwhich processors a job holding the semaphore may currentlybe scheduled, and by sending an inter-processor interrupt(IPI) to available processors when a lock-holder preemptionoccurs. A migration, if possible, is then carried out by thefirst available processor to reschedule in response to the IPI.A detailed discussion of the implementation is omitted heredue to space constraints; however, our changes are publiclyavailable as a patch on the LITMUSRT homepage [1].
5.1 Response-Time Measurements
Based on LITMUSRT, we conducted an experiment to em-pirically demonstrate that the OMIP is effective at protectinglatency-sensitive tasks, that is, to rule out the possibility thatindependence-preservation is merely an “analytical trick”without practical impact. To this end, we configured a sim-ple synthetic task set on a 2.0 GHz Intel Xeon X7550 systemwith m = 8 cores using the modified LITMUSRT C-EDFplugin with clusters defined by the L1 cache topology, whichresults in P-EDF scheduling (c = 1) on the test platform.
On each core, we launched four tasks with periods 1ms ,25ms , 100ms , and 1000ms and execution costs of roughly0.1ms , 2ms , 15ms , and 600ms , resp. The latency-sensitive,one-millisecond tasks did not access any shared resources.All other tasks shared a single lock with an associated maxi-mum critical section length of approximately Lmax
= 1ms ,and each of their jobs acquired the lock once. While thetask set is synthetic, the chosen parameters are inspired bythe range of periods found in automotive systems [16, 37]and we believe them to be a reasonable approximation ofheterogenous timing constraints as they arise in practice.
We ran the task set once using the C-OMLP, once us-ing the OMIP, and once using no locks at all (as a base-line assuming that all tasks are independent) for 30 minuteseach. We traced the resulting schedules using LITMUSRT’ssched trace facility and recorded the response times ofmore than 45, 000, 000 individual jobs. Fig. 3 shows two his-tograms of recorded response times under each configuration.Fig. 3(a) depicts response times of the one-millisecond tasks.Due to the short period (and consistent deadline tie-breaking),their jobs always have the highest priority and thus incur nodelays at all in the absence of locks (all response times arein the first bin). In contrast, under the C-OMLP, jobs are notprotected from pi-blocking due to unrelated critical sectionsand response times exceeding 8ms were observed—priorityboosting causes deadline misses in latency-sensitive tasks
1E+001E+011E+021E+031E+041E+051E+061E+071E+08
[0,1) [1,2) [2,3) [3,4) [4,5) [5,6) [6,7) [7,8) [8,9) [9,10)nu
mbe
r of j
obs
response time (bin size = 1ms)
NO LOCKS C-OMLP OMIP
(a) Response times of latency-sensitive tasks with period pi = 1ms .
1E+011E+021E+031E+041E+051E+061E+071E+08
[15,20)[20,25)
[25,30)[30,35)
[35,40)[40,45)
[45,50)[50,55)
[55,60)[60,65)
[65,70)[70,75)
[75,80)[80,85)
[85,90)[90,95)
num
ber o
f job
s
response time (bin size = 5ms)
NO LOCKS C-OMLP OMIP
(b) Response times of regular tasks with period pi = 100ms .
Figure 3: Response times measured in LITMUSRT under C-EDFscheduling with the OMIP, the C-OMLP, and without locks.
(not coincidentally, 8ms ⇡ m · Lmax ). Not so under theOMIP, where the response-times are identical to the casewithout locks—task independence was indeed preserved.
However, mutual exclusion invariably causes some delays,and if such delays do not manifest in the highest-priorityjobs, then they will be necessarily observable elsewhere.This is apparent in Fig. 3(b), which depicts the response-time distribution of the 100ms-tasks. For these tasks withconsiderable slack, worst-case response-times are notice-ably shorter under the C-OMLP than under the OMIP. Thisemphasizes that there is an obvious tradeoff between not pe-nalizing higher-priority jobs and rapidly completing criticalsections. To explore this tradeoff, we conducted schedula-bility experiments, which allow analytical differences w.r.t.response-time guarantees to be quantified (as opposed to thedifferences in observed response times discussed so far).
5.2 Schedulability Experiments
In preparation, we derived and implemented fine-grained(i.e., non-asymptotic) pi-blocking analysis of the OMIP suit-able for schedulability analysis based on a recently developedanalysis technique using linear programming [11]; detailscan be found online [10]. To quantify each protocol’s impacton schedulability, we generated task sets using Embersonet al.’s method [22] consisting of n 2 {20, 30, 40} tasks withtotal utilization U 2 {0.4m, 0.5m, 0.7m}. Of the n tasks,nlat 2 {0, 1, . . . , 8} were chosen to be latency-sensitive.Latency-sensitive tasks were assigned a period randomly cho-sen from [0.5ms, 2.5ms], and regular tasks were assigned
9
“We traced the resulting schedules using LITMUSRT’s sched_trace facility and recorded the response times of more than 45,000,000 individual jobs.”
[—, “A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications”, ECRTS’13]
Enter event or section title
MPI-SWS 40
Easy Access to Workload Statistics“We traced the resulting schedules using LITMUSRT sched_trace facility and recorded the response times of more than 45,000,000 individual jobs.”
[—, “A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications”, ECRTS’13]
(1) st-trace-schedule my-ecrts13-experiments-OMIP […run workload…]
(2) st-job-stats *my-ecrts13-experiments-OMIP*.bin # Task, Job, Period, Response, DL Miss?, Lateness, Tardiness, Forced?, ACET # task NAME=rtspin PID=29587 COST=1000000 PERIOD=10000000 CPU=0 29587, 2, 10000000, 1884, 0, -9998116, 0, 0, 1191 29587, 3, 10000000, 1019692, 0, -8980308, 0, 0, 1017922 29587, 4, 10000000, 1089789, 0, -8910211, 0, 0, 1030550 29587, 5, 10000000, 1034513, 0, -8965487, 0, 0, 1016656 29587, 6, 10000000, 1032825, 0, -8967175, 0, 0, 1016096 29587, 7, 10000000, 1037301, 0, -8962699, 0, 0, 1016078 29587, 8, 10000000, 1033699, 0, -8966301, 0, 0, 1016535 29587, 9, 10000000, 1037287, 0, -8962713, 0, 0, 1015794 …
just a name
Enter event or section title
MPI-SWS 41
Easy Access to Workload Statistics“We traced the resulting schedules using LITMUSRT sched_trace facility and recorded the response times of more than 45,000,000 individual jobs.”
[—, “A Fully Preemptive Multiprocessor Semaphore Protocol for Latency-Sensitive Real-Time Applications”, ECRTS’13]
(1) st-trace-schedule my-ecrts13-experiments-OMIP […run workload…]
(2) st-job-stats *my-ecrts13-experiments-OMIP*.bin # Task, Job, Period, Response, DL Miss?, Lateness, Tardiness, Forced?, ACET # task NAME=rtspin PID=29587 COST=1000000 PERIOD=10000000 CPU=0 29587, 2, 10000000, 1884, 0, -9998116, 0, 0, 1191 29587, 3, 10000000, 1019692, 0, -8980308, 0, 0, 1017922 29587, 4, 10000000, 1089789, 0, -8910211, 0, 0, 1030550 29587, 5, 10000000, 1034513, 0, -8965487, 0, 0, 1016656 29587, 6, 10000000, 1032825, 0, -8967175, 0, 0, 1016096 29587, 7, 10000000, 1037301, 0, -8962699, 0, 0, 1016078 29587, 8, 10000000, 1033699, 0, -8966301, 0, 0, 1016535 29587, 9, 10000000, 1037287, 0, -8962713, 0, 0, 1015794 …
How long did each job use the processor?
Enter event or section title
MPI-SWS 42
Synchronous Task System Releases
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms
rtspin/29059
(10.00ms, 100.00ms)
rtspin/29060
(1.00ms, 10.00ms)
rtspin/29061
(2.00ms, 20.00ms)
rtspin/29062
(3.00ms, 30.00ms)
rtspin/29063
(4.00ms, 40.00ms)
rtspin/29064
(5.00ms, 50.00ms)
rtspin/29065
(6.00ms, 60.00ms)
rtspin/29066
(7.00ms, 70.00ms)
0
0
0
0
0
0
0
1
1
1
1
1 1
1
1
1
2
2
2 2
2
23
3
3
3
3
3
3
all tasks release their first job at a common time “zero.”
Enter event or section title
MPI-SWS 43
Synchronous Task System Releases
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms
rtspin/29059
(10.00ms, 100.00ms)
rtspin/29060
(1.00ms, 10.00ms)
rtspin/29061
(2.00ms, 20.00ms)
rtspin/29062
(3.00ms, 30.00ms)
rtspin/29063
(4.00ms, 40.00ms)
rtspin/29064
(5.00ms, 50.00ms)
rtspin/29065
(6.00ms, 60.00ms)
rtspin/29066
(7.00ms, 70.00ms)
0
0
0
0
0
0
0
1
1
1
1
1 1
1
1
1
2
2
2 2
2
23
3
3
3
3
3
3
int wait_for_ts_release(void);
int release_ts(lt_t *delay);
➞ task sleeps until synchronous release
➞ trigger synchronous release in <delay> nanoseconds
MPI-SWS
Asynchronous Releases with Phase/OffsetLITMUSRT also supports non-zero phase/offset.➡ release of first job occurs with some known offset after task system release.
44
0ms 10ms 20ms 30ms 40ms 50ms 60ms 70ms 80ms 90ms
rtspin/29587
(1.00ms, 10.00ms)
rtspin/29588
(2.00ms, 20.00ms)
rtspin/29589
(3.00ms, 30.00ms)
rtspin/29590
(4.00ms, 40.00ms)
rtspin/29591
(5.00ms, 50.00ms)
rtspin/29592
(6.00ms, 60.00ms)
rtspin/29593
(7.00ms, 70.00ms)
rtspin/29595
(10.00ms, 100.00ms)0
0
0
0
0 0
0
1
1
1
1
1
1
1
2
2
2
2 2
2
2
3
3
3
3 3 3 3
3
3 3
release of first job is staggered w.r.t. time “zero”➔ can use schedulability tests for asynchronous periodic tasks
Enter event or section title
MPI-SWS 45
Easier Starting Point for New Schedulers
struct sched_plugin { […] schedule_t schedule; finish_switch_t finish_switch; […] admit_task_t admit_task; fork_task_t fork_task;
task_new_t task_new; task_wake_up_t task_wake_up; task_block_t task_block;
task_exit_t task_exit; task_cleanup_t task_cleanup; […] }
simplified scheduler plugin interfacesimplified interface
+richer task model
+plenty of working code to steal from
MPI-SWS
Many More Features…Support for true global scheduling➡ supports proper pull-migrations‣moving tasks among Linux’s per-processor runqueues
➡ Linux’s SCHED_FIFO and SCHED_DEADLINE global scheduling “emulation” is not 100% correct (races possible)
Low-overhead non-preemptive sections➡ Non-preemptive spin locks without system calls.
Wait-free preemption state tracking➡ “Does this remote core need to be sent an IPI?”➡ Simple API suppresses superfluous IPIs
Debug tracing with TRACE()➡ Extensive support for “printf() debugging” ➞ dump from Qemu
46
MPI-SWS
LITMUSRT
47
Predictable execution platform and research accelerator.
Apply schedulability analysis
+under consideration of overheads
+jump-start your development
Enter event or section title
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Getting Stated with
Key Concepts What you need to know to get started
— Part 3 —
MPI-SWS
Scheduler Plugins
SCHED_LITMUS “class” invokes active plugin.➡ LITMUSRT tasks have highest priority.➡ SCHED_DEADLINE & SCHED_FIFO/RR: ➔ best-effort from SCHED_LITMUS point of view
49
SCHED_IDLE
SCHED_OTHER (CFS)
SCHED_FIFO/RR
SCHED_DEADLINE
SCHED_LITMUS
Linux scheduler classes:
pick_next_task() active plugin Linux (dummy)
PSN-EDF
GSN-EDF
C-EDF
P-FP
P-RES
LITMUSRT plugins:
MPI-SWS
Plugin Switch
Active plugin can be switched at runtime.➡ But only if no real-time tasks are present.
50
SCHED_IDLE
SCHED_OTHER (CFS)
SCHED_FIFO/RR
SCHED_DEADLINE
SCHED_LITMUS
Linux scheduler classes:
pick_next_task() active plugin Linux (dummy)
PSN-EDF
GSN-EDF
C-EDF
P-FP
P-RES
LITMUSRT plugins:
$ setsched PSN-EDF
MPI-SWS
Classic Process-Based Plugins
Plugin manages real-time tasks directly➡ one sporadic task (scheduling theory)
= exactly one process in Linux
51
SCHED_LITMUS active plugin
PSN-EDF
$ rt_launch -p 1 10 100 <my-app-binary>
T1 T2 Tn…
WCET
periodcore
MPI-SWS
Reservation-Based Scheduling (RBS)
Plugin schedules reservations(= process containers)➡ one sporadic task in analysis
= one reservation (➞ 0..n processes)➡ second-level dispatcher in each reservation
52
SCHED_LITMUS active plugin
$ resctl -n 123 -c 1 -b 10 -p 100 $ rt_launch -p 1 -r 123 <my-app-component-1> $ rt_launch -p 1 -r 123 <my-app-component-2> […]
…
budget periodcore
P-RES
arbitrary ID
R1 Rn…
⌇⌇ ⌇
MPI-SWS
RBS: Motivation 1/2
Temporal Decomposition: Sequential, Recurrent Tasks➡ sequential tasks: basic element of RT scheduling theory
Logical Decomposition: Software Modularization➡ split complex applications into many logical modules or
components to manage spheres of responsibility
53
vs
Example: one real-time “task” may consist of multiple processes middleware process + redundant, isolated application threads + database process
MPI-SWS
RBS: Motivation 2/2Irregular execution patterns➡ e.g., http server triggered by unpredictable incoming requests➡ with RBS, can safely encapsulate arbitrary activation patterns
“Legacy” and complex software➡ Cannot require existing, complex software (e.g., ROS) to adopt and
comply with LITMUSRT API➡ with RBS, can transparently encapsulate any Linux process‣Even kernel threads (e.g., interrupt threads)!
Worst-case execution time is a fantasy➡ Most practical systems must live with imperfect measurements and
cannot always provision on a (measured) worst-case basis➡ with RBS, can manage overruns predictably and actively exploit slack
54
Enter event or section title
MPI-SWS 55
Three Main RepositoriesLinux kernel patch➔ litmus-rt
+user-space interface
➔ liblitmus
+tracing infrastructure
➔ feather-trace-tools
Enter event or section title
MPI-SWS 56
liblitmus: The User-Space Interface
C API (task model + system calls)
+user-space tools
➔ setsched, showsched, release_ts, rt_launch, rtspin
MPI-SWS
/proc/litmus/* and /dev/litmus/*
/proc/litmus/*➡ Used to export information about the plugins and existing
real-time tasks.➡ Read- and writable files.➡ Typically managed by higher-level wrapper scripts.
/dev/litmus/*➡ Special device files based on custom character device drivers.➡ Primarily, export trace data (use only with ftcat):‣ft_cpu_traceX — core-local overheads of CPU X‣ft_msg_traceX — IPIs related to CPU X‣sched_traceX — scheduling events on CPU X
➡ log — debug trace (use with regular cat)
57
MPI-SWS
Control Page: /dev/litmus/ctrlA (private) per-process page mapped by each real-time task➡ Shared memory segment between kernel and task.➡ Purpose: low-overhead communication channel➡ interrupt count➡ preemption-disabled and preemption-needed flags➡ current deadline, etc.
Second purpose, as of 2016.1➡ implements LITMUSRT “system calls” as ioctl() operations➡ improves portability and reduces maintenance overhead
Transparent use➡ liblitmus takes care of everything
58
MPI-SWS
(Lack of ) Processor Affinities
Most LITMUSRT plugins ignore affinity masks.➡ In particular, all plugins in the mainline version do so.‣Global is global; partitioned is partitioned…
Recent out-of-tree developments➡ Support for hierarchical affinities [ECRTS’16]
59
Xth bit set ➜ process may execute on core X
In Linux, each process has a processor affinity mask.
MPI-SWS
Things That Are Not Supported
Architectures other than x86 and ARM➡ Though not difficult to add support if someone cares…
Running on top of a hypervisor➡ It works (➞ hands-on session), but it’s not “officially” supported.➡ You can use LITMUSRT as a real-time hypervisor by encapsulating kvm in a
reservation.
CPU Hotplug➡ Not supported by existing plugins.
Processor Frequency Scaling➡ Plugins “work,” but oblivious to speed changes.
Integration with PREEMPT_RT➡ For historic reasons, the two patches are incompatible➡ Rebasing on top of PREEMPT_RT has been on the wish list for some time…
60
With limited resources, we cannot possibly support & test all Linux features.
Enter event or section title
MPI-SWS 61
Enable practical multiprocessor real-time systems research under realistic conditions.
Linux Testbed for Multiprocessor Scheduling in Real-Time Systems
Connect theory and practice. Don’t reinvent the wheel.
Use LITMUSRT as a baseline.