Storage Systems CSE 598d, Spring 2007 Lecture 11: Disk scheduling Feb 27, 2007 (ACK: Several slides...
-
Upload
egbert-french -
Category
Documents
-
view
219 -
download
0
Transcript of Storage Systems CSE 598d, Spring 2007 Lecture 11: Disk scheduling Feb 27, 2007 (ACK: Several slides...
Storage SystemsStorage SystemsCSE 598d, Spring 2007CSE 598d, Spring 2007Lecture 11: Disk schedulingLecture 11: Disk scheduling
Feb 27, 2007Feb 27, 2007
(ACK: Several slides borrowed from (ACK: Several slides borrowed from Shiva Chaitanya)Shiva Chaitanya)
Disk Access Time: Components
• CPU time to issue and process I/O• contention for controller• contention for bus• contention for memory• verifying block correctness with checksums (retransmissions)
• waiting in scheduling queues• ...
Disk Scheduling
Seek time is a dominant factor of total disk I/O time
Let operating system or disk controller choose which request to serve next depending on the head’s current position and requested block’s position on disk
Disk scheduling is much more difficult than CPU scheduling
– a mechanical device – hard to determine (accurate) access times
– disk accesses cannot be preempted – runs until it finishes
– disk I/O often the main performance bottleneck
Scheduling at Multiple Locations!
S/W, H/W Components between an application and the disk:
- File system- Device driver- SCSI bus- RAID controller (if employing RAID)- Some bus - Disk controller
Why?- Why not do it only at FS/DD level?- Why not do it only within the disk?
Scheduling locations
Scheduling at Multiple Locations!
Why?Key ideas that disk scheduling employs:
Request re-ordering for seek/positioning minimization
Exploit temporal localityAnticipation for sequential streamsIntroduce non-work conserving behavior!
Exploit spatial localityCoalesce consecutively placed requestsFree-block scheduling
Different optimizations are best done at different locations
Furthermore, the best location to do an optimization depends on the workload!
Goals
– Short response time– High overall throughput – Fairness (equal probability for all blocks to be accessed in the same time)
Tradeoff: Throughput vs. Fairness Socialism vs. Capitalism?
Disk Scheduling
Several traditional algorithms– First-Come-First-Serve (FCFS)– Shortest Seek Time First (SSTF)
• Shortest Positioning Time First (SPTF)
– SCAN – C-SCAN– LOOK– C-LOOK– …
First–Come–First–Serve (FCFS)
FCFS serves the first arriving request first: Long seeks Short average response time
time
cylinder number1 5 10 15 20 25
12
incoming requests (in order of arrival):
14 2 7 21 8 24
Shortest Seek Time First (SSTF)
SSTF serves closest request first: short seek times longer maximum seek times – may lead to starvation
time
cylinder number1 5 10 15 20 25
incoming requests (in order of arrival):
12 14 2 7 21 8 24
SCANSCAN moves head edge to edge and serves requests on the way: bi-directional compromise between response time and seek time optimizations
time
cylinder number1 5 10 15 20 25
12 14 2 7 21 8 24incoming requests (in order of arrival):
C–SCANCircular-SCAN moves head from edge to edge serves requests on one way – uni-directional improves response time (fairness)
time
cylinder number1 5 10 15 20 25
1214 2 721 8 24
incoming requests (in order of arrival):
LOOK and C–LOOKLOOK (C-LOOK) is a variation of SCAN (C-SCAN): same schedule as SCAN does not run to the edges stops and returns at outer- and innermost request increased efficiency
time
cylinder number1 5 10 15 20 251214 2 7 21 8 24
incoming requests (in order of arrival):
V–SCAN(R)V-SCAN(R) combines SCAN (or LOOK) and SSTF
– define a R-sized unidirectional SCAN window, i.e., C-SCAN, and use SSTF outside the window
– Example: V-SCAN(0.6) • makes a C-SCAN (C-LOOK) window over 60 % of the cylinders
uses SSTF for requests outside the window
V-SCAN(0.0) equivalent with SSTF– V-SCAN(1.0) equivalent with SCAN– V-SCAN(0.2) is supposed to be an appropriate configuration
Shortest Positioning Time First (SPTF)
Given the complete knowledge of the actual mapping of data blocks onto the media, the scheduler can choose the request with the minimum positioning delay (combined seek and rotational latency)
SPTF, like SSTF suffers from poor starvation resistance. To reduce response time variance, priority can be given to requests that have been in pending queue for excessive periods of time
Aged Shortest Positioning Time First (ASPTF)
ASPTF(W) adjusts each positioning delay (Tpos) by subtracting a weighted value corresponding to the amount of time the request has been waiting for service (Twait)
Teff = Tpos – (W*Twait)
For large values of W, ASPTF behaves like FCFS
Scheduling in Modern Disk Drives
Features of current disk drives that affect traditional
scheduling algorithms
Host interface Data layout On-Board Cache
Ref: B.L. Worthington, Greg Ganger, N. Patt : Scheduling Algorithms for Modern
Disk Drives ACM Sigmetrics 1994
Host interface
Controller presents a request to the disk drive in terms of the starting logical block number and request size
Subsequent media access hidden from the host
Scheduling entities outside of the drive have little knowledge of overhead delays
Data Layout Many systems assume sequentiality of LBN-to-PBN mappings in seek reducing algorithms
Aggressive algorithms require highly accurate knowledge of the data layout which is typically hidden
Complexity of mappings increased by zoned recording, track/cylinder skew and defect management
On-Board CacheMemory within disk drives has progressed from small speed-matching buffers to megabytes of cache memory
Disk logic typically prefetches data into cache to satisfy sequential read requests. This affects scheduling in two ways: Position of the head cannot be determined
easily Requests that can be satisfied by cache could
be given higher priority
Scheduling by Logical Block Number
• As expected, FCFS quickly saturates as workload increases
• SSTF provides lower mean response time
Scheduling by Logical Block Number
FCFS has the lowest coefficient for lighter workloads
As FCFS begins to saturate and its response time variance increases, C-LOOK emerges as a better algorithm for response time variance
Scheduling with Full knowledge
As W increases, the average response time slowly grows, though variance drops
Scheduling with Full Knowledge
Modern Disk Scheduling
In modern drives, C-LOOK best exploits the prefetching cache for workloads with significant read sequentiality
SSTF and LOOK perform better for random workloads
Powerful disk controllers use variants of Shortest Positioning Time First (SPTF).
Freeblock Scheduling
An approach to utilizing more of a disk’s potential media bandwidth
Fill rotational latency periods with useful media transfers for background applications
It has been observed that 20-50% of a never-idle disk’s bandwidth can often be provided to background applications without affecting foreground response times
Ref: Christopher R. Lumb, Jiri Schindler, Greg Ganger : “Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives”, OSDI , 2000
Disk-intensive background tasks
Disk Reorganization File system cleaning Backup Prefetching Write-back Integrity Checking RAID scrubbing Virus detection Index Reorganization …
Free Bandwidth
Time required for a disk media access
Taccess = Tseek + Trotate + Ttransfer
Freeblock scheduling uses the Trotate component of disk access to transfer additional data
Instead of just waiting for desired sector to arrive, this technique transfers the intermediate sectors
Steps in Freeblock Scheduling
Predict how much rotational latency will occur before the next foreground media transferRequires detailed knowledge of disk attributes, including layout algorithms and time dependent mechanical positioning overheads
Squeeze additional media transfers into that time
Get to the destination track in time for the foreground transfer
Anticipatory Disk Scheduling
Reorder available disk requests for
performance by seek optimization, proportional resource allocation, etc.
Any policy needs multiple outstanding requests to make good decisions!
Ref: Sitaram Iyer, Peter Druschel : “Anticipatory scheduling : A disk scheduling framework to overcome deceptive idleness in synchronous I/O”, SOSP 2001
With enough requests…
issued by process A issued by process B
E.g., Throughput = 21 MB/s (IBM Deskstar disk)
seek
time
location on disk
With synchronous I/O…
E.g., Throughput = 5 MB/sNext
schedule
issued by process A issued by process B
forced!
too late!
forced!
Deceptive idleness
Process A is about to issue next request.
but
Scheduler hastily assumes that process A has no further requests!
Proportional scheduler
Allocate disk service in say 1:2 ratio:
Deceptive idleness causes 1:1 allocation:
Next
BABA
Anticipatory scheduling
Key idea: Sometimes wait for process whose request was last serviced.
Keeps disk idle for short intervals.
But with informed decisions, this: Improves throughput Achieves desired proportions
Cost-benefit analysis
Balance expected benefits of waiting against cost of keeping disk idle.
Tradeoffs sensitive to scheduling policye.g., 1. seek optimizing scheduler
2. proportional scheduler
Statistics
For each process, measure:
1. Expected median and 95percentile thinktime
2. Expected positioning time
Median 95percentile
Num
ber
of
request
s
Thinktime
last next
Benefit =
best.positioning_time — next.positioning_time
Cost = next.median_thinktime
Waiting_duration =
(Benefit > Cost) ? next.95percentile_thinktime : 0
Cost-benefit analysisfor seek optimizing
schedulerbest := best available request chosen by scheduler
next := expected forthcoming request from process whose request was last serviced
Proportional scheduler
Costs and benefits are different.
e.g., proportional scheduler:
Wait for process whose request was last serviced,
1. if it has received less than its allocation, and2. if it has think time below a threshold (e.g., 3ms)
Waiting_duration = next.95percentile_thinktime
Prefetch
Overlaps computation with I/O.
Side-effect: avoids deceptive idleness!
Application-driven Kernel-driven
Conclusion
Anticipatory scheduling:
overcomes deceptive idleness achieves significant performance improvement on real applications
achieves desired proportions
and is easy to implement!
Fairness : Evaluating disk scheduling
algorithms Storage system designers prefer to keep the queue length at disks small regardless of the load
When queuing threshold is reached at the disk, the controller or the device driver queues the requests until disk queue is processed
Low queuing threshold minimizes request starvation at the disk level when unfair scheduling algorithms are deployed
Ref: Alma Riska, Erik Riedel : “It’s not fair – evaluating efficient disk scheduling” , MASCOTS 2003
Results Queuing more requests at the disk provides the scheduling algorithms more information used for better disk resource utilization
Percentage of requests starved remains small even if longer queues build up at the disk
Overall request starvation is independent
from the queuing threshold at the disk
Storage subsytem architecture
Queues at various levels
Outstanding requests queued at disk and at device driver in a single disk system
And, at the disks and the controller(s) in a multiple disk system
Impact of queuing thresholds
Average load of 16 outstanding requests in system
Average load of 64 outstanding requests in system
Response time distribution
Higher the load the larger the gap between the performances of different scheduling algorithms
Fair and simple FCFS yields longest average request response time
Best performance obtained when increasing the queue threshold under SPTF
How about request starvation and variability in the request response time ?
Response time distribution
Tail of response time distribution with average load of 16 outstanding requests and threshold of 8
Tail of response time distribution with average load of 16 outstanding requests and threshold of 16
Observations .. Majority of requests under FCFS exhibit long
response times, while seek-reducing algorithms result in majority of short response times
More than 90% of requests under SPTF have shorter response times than FCFS and only 1% exhibit upto double the response times in FCFS
Amount of starvation in position-based scheduling algorithms for both queuing thresholds is the same relative to FCFS
Hence, queuing more requests improves disk performance without introducing more request starvation
Scheduling at Device driver level
Depends on workload and filesystem layout Eg, with SCAN, seek times to sectors in the
middle of the disks are shorter OS could choose between algorithms based on
current queue Likely to be expensive in CPU cycles Queue changes as new requests arrive
SSTF or SCAN are reasonable defaults Allow algorithm selection as part of OS tuning FreeBSD: C-SCAN Linux 2.2 :SCAN Linux 2.6 : four different versions of elevator
algorithm
Discussion: Scheduling at
Multiple Locations Positioning-based optimizations best done within the disk Seek-based optimizations best done at device driver Why do scheduling within FS?
Device and DD independent Aware of buffer cache Application isolation
Disk queue length crucial Short queue results in degraded throughput
Locally good but globally bad schedules Long queue results in unfairness
Non-work conservation can improve fairness and throughout! Anticipatory scheduling
Achieving proportional fairness non-trivial Solutions based on hierarchy of queues, anticipatory scheduling can
help Request coalescing can result in great improvement in
throughput FS and device driver are good places Improve the sequentiality of the request stream seen by the disk
Free-block scheduling can improve throughput Can view this as a “corrector” for the non work conserving nature of
disk
Additional slides on free-block scheduling
Illustration of two freeblock scheduling
possibilities
Desired Characteristics of
tasks
Low priority Freeblock requests will only be served opportunistically
Not appropriate for a set of equally important requests
Large sets of desired blocks Larger the set of disk locations desired, higher the probability to find a free bandwidth opportunity
Desired Characteristics …
No particular order of access Ordering requirements restrict the set of requests that can be considered by scheduler
Effectiveness of freeblock scheduling directly related to number of outstanding requests
Small working memory footprints Need to buffer multiple blocks before processing creates artificial ordering requirements due to memory limitations
A Simple Interface
No call into the freeblock scheduler subsystem waits for a disk access. Calls return immediately
Freeblock read requests do not specify memory locations for read data. Completion callbacks provide pointers to buffers owned by freeblock scheduling subsystem
Applications Scanning applications : tasks that scan large
portion of disk contents like report generation, RAID scrubbing, virus detection, tamper detection and backup
Internal storage optimization: reorganizing stored data to improve performance, e.g placing related data contigously, placing hot data near the center of the disk, replicating data for subsequent reads, ..etc
Prefetching and Prewriting: prewriting is early writing out of dirty blocks under the assumption they will not be overwritten or deleted before writeback is necessary
Availability of free bandwidth
Availability of potential free bandwidth = Total bandwidth * Fraction of time spent on rotational latency
Results in the next few slides obtained using Disksim
Default disk drive used is Quantum Atlas 10k
Default workload consists of 10,000 foreground requests issued one at a time with uniform distribution of starting locations
Impact of disk characteristics
Overall, about one third of each disk’s head usage is on rotational latency
Characteristics of simulated disk drives
Impact of workload characteristics
Impact of scheduling algorithm
• C-LOOK and SSTF reduce seek times without affecting transfer times and rotational latencies
• SPTF tends to decrease both overhead components. Figure shows rotational latency decreases to 22%
Feasibility Freeblock scheduling relies heavily on ability to accurately predict positioning delays
Firmware of most disk drives now supports SPTF which requires similar predictions
Freeblock scheduling resembles advanced disk schedulers for environments with a mixed workload of real-time and non-real-time activities
Additional slides on anticipatory scheduling - experimental evaluation
Experiments
• FreeBSD-4.3 patch + kernel module (1500 lines of C code)
• 7200 rpm IDE disk (IBM Deskstar)
• Also in the paper: 15000 rpm SCSI disk (Seagate Cheetah)
Microbenchmark
0
5
10
15
20
25
Sequential Alternate Random within file
Th
rou
gh
pu
t (M
B/s
)
OriginalAnticipatory
no prefetch
no prefetch
no prefetch
prefetch
prefetch
prefetch
Real workloadsWhat’s the impact on real applications and benchmarks?
Andrew benchmarkApache web server (large working set)Database benchmark
•Disk-intensive•Prefetching enabled
Andrew filesystem benchmark
Overall 8% performance improvement
0
5
10
15
20
25
30
mkdir cp stat scan gcc
-16% -5% -5% -54% +1.7%
Execution time (minutes)
Original
Anticipatory
5
62 (or more) concurrent clients
Apache web server
0
1
2
3
4
read+29%
mmap+71%
Throughput (MB/s)
no prefetch
• CS.Berkeley trace
• Large working set
• 48 web clients