INF5070 – Media Servers and Distribution Systems: Storage ...€¦ · INF5070 – media servers...

Storage Systems Storage Systems ––Part IPart I

3/10 - 2005

INF5070 – Media Servers and Distribution Systems:

2005 Carsten Griwodz & Pål HalvorsenINF5070 – media servers and distribution systems

OverviewDisks

mechanics and properties

Disk schedulingtraditionalreal-timestream orientedmixed media

Data placement

Complicating factors


DisksDisks are orders of magnitude slower than main memory, but are cheaper and have more capacity

Disks are used to have a persistent system and manage huge amounts of information

Because......there is a large speed mismatch compared to main

memory (this gap will increase according to Moore’s law), ...disk I/O is often the main performance bottleneck...we need to minimize the number of accesses,...

...we must look closer on how to manage disks


Disks

Two resources of importance

storage space

I/O bandwidth

Several approaches to manage multimedia data on disks:

specific disk scheduling and large buffers

optimize data placement for continuous media

replication / striping

combinations of the above


Mechanics of Disks

Platterscircular platters covered with magnetic material to provide nonvolatile storage of bits

Tracksconcentric circles on asingle platter

Sectorssegment of the track circle –usually each contains 512 bytes –separated by non-magnetic gaps.The gaps are often used to identifybeginning of a sector

Cylinderscorresponding tracks on the different platters are said to form a cylinder

Spindleof which the platters rotate around

Disk headsread or alter the magnetism (bits) passing under it. The heads are attached to an arm enabling it to move across the platter surface


Disk SpecificationsSome existing (Seagate) disks today:

Note 1:disk manufacturers usuallydenote GB as 109 whereascomputer quantities often arepowers of 2, i.e., GB is 230

Note 3:there is usually a trade off between speed and capacity

Note 2:there is a difference between internal and formatted transfer rate. Internalis only between platter. Formatted is after the signals interfere with the electronics (cabling loss, interference, retransmissions, checksums, etc.)

8 MB4 MB16 MBdisk buffer cache

609 – 891 520 – 682282 – 508 internal transfer rate (Mbps)

234.17average latency

71216max (full stroke) seek (ms)

0.20.60.8min (track-to-track) seek (ms)

3.6 5.77.4average seek time (ms)

18.4799.77224.247#cylinders

15.00010.0007200Spindle speed (RPM)

73.436.4181.6Capacity (GB)

Cheetah X15.3Cheetah 36Barracuda 180


Disk CapacityThe size of the disk is dependent on

the number of platters whether the platters use one or both sidesnumber of tracks per surface(average) number of sectors per tracknumber of bytes per sector

Example (Cheetah X15):4 platters using both sides: 8 surfaces18497 tracks per surface617 sectors per track (average)512 bytes per sectorTotal capacity = 8 x 18497 x 617 x 512 ≈ 4.6 x 1010 = 42.8 GBFormatted capacity = 36.7 GB

Note 1:the tracks on the edge of the platter is larger than the tracks close to the spindle. Today, most disks are zoned, i.e., the outer tracks have more sectors than the inner tracks

Note 2:there is a difference between formatted and total capacity. Some of the capacity is used for storing checksums, spare tracks, gaps, etc.


Disk Access Time

How do we retrieve data from disk?position head over the cylinder (track) on which the block (consisting of one or more sectors) are locatedread or write the data block as the sectors move under the head when the platters rotate

The time between the moment issuing a disk request and the time the block is resident in memory is called disk latency or disk access time


Disk Access Time

+ Rotational delay

+ Transfer time

Seek time

Disk access time =

+ Other delays

Disk platter

Disk arm

Disk head

block xin memory

I wantblock X


Disk Access Time: Seek TimeSeek time is the time to position the head

the heads require a minimum amount of time to start and stop moving the headsome time is used for actually moving the head –roughly proportional to the number of cylinders traveled

Time to move head:

~ 10x - 20x

x

1 NCylinders Traveled

Time

“Typical” average: 10 ms → 40 ms7.4 ms (Barracuda 180)5.7 ms (Cheetah 36)3.6 ms (Cheetah X15)

nβα + number of tracksseek time constantfixed overhead


Disk Access Time: Rotational Delay

Time for the disk platters to rotate so the first of the required sectors are under the disk head

head here

block I want

Average delay is 1/2 revolution

“Typical” average: 8.33 ms (3.600 RPM)5.56 ms (5.400 RPM)4.17 ms (7.200 RPM)3.00 ms (10.000 RPM)2.00 ms (15.000 RPM)


Disk Access Time: Transfer TimeTime for data to be read by the disk head, i.e., time it takes the sectors of the requested block to rotate under the head

Transfer rate =

Transfer time = amount of data to read / transfer rate

Example – Barracuda 180:406 KB per track x 7.200 RPM ≈ 47.58 MB/sExample – Cheetah X15:316 KB per track x 15.000 RPM ≈ 77.15 MB/s

Transfer time is dependent on data density and rotation speedIf we have to change track, time must also be added for moving the head

amount of data per tracktime per rotation

Note:one might achieve these transfer rates reading continuously on disk, but time must be added for seeks, etc.


Disk Access Time: Other Delays

There are several other factors which might introduce additional delays:

CPU time to issue and process I/Ocontention for controllercontention for buscontention for memoryverifying block correctness with checksums (retransmissions)waiting in scheduling queue...

Typical values: “0”(maybe except from waiting in the queue)


Disk ThroughputHow much data can we retrieve per second?

Throughput =

Example:for each operation we have

- average seek - average rotational delay- transfer time - no gaps, etc.

Cheetah X15 (max 77.15 MB/s)4 KB blocks 0.71 MB/s64 KB blocks 11.42 MB/s

Barracuda 180 (max 47.58 MB/s)4 KB blocks 0.35 MB/s64 KB blocks 5.53 MB/s

data sizetransfer time (including all)


Block Size The block size may have large effects on performanceExample:assume random block placement on disk and sequential file access

doubling block size will halve the number of disk accesseseach access take some more time to transfer the data, but the total transfer time is the same (i.e., more data per request)halve the seek times halve rotational delays are omitted

e.g., when increasing block size from 2 KB to 4 KB (no gaps,...)for Cheetah X15 typically an average of:☺ 3.6 ms is saved for seek time☺ 2 ms is saved in rotational delays

0.026 ms is added per transfer time

increasing from 2 KB to 64 KB saves ~96,4 % when reading 64 KB

} saving a total of 5.6 ms when reading 4 KB (49,8 %)


100 200 300 400 5000

10

20

30

40

50

60

70

80

90

100

Amount of data read per operation (KB)E

ffic

iency in %

of m

ax. th

roughput

Block SizeThus, increasing block size can increase performance by reducing seek times and rotational delays(figure shows calculation on some older device)

But, blocks spanning several tracks still introduce latencies…

… and a large block size is not always best

small data elements may occupy only a fraction of the block (fragmentation)

Which block size to use therefore depends on data size and data reference patterns

The trend, however, is to use large block sizes as new technologies appear with increased performance – at least in high data rate systems


Writing and Modifying BlocksA write operation is analogous to read operations

must add time for block allocation

a write operation may has to be verified – must wait another rotation and then read the block to see if it is the block we wanted to write

Total write time ≈ read time (+ time for one rotation)

Cannot modify a block directly:read block into main memory

modify the block

write new content back to disk

(verify the write operation)

Total modify time ≈ read time + time to modify + write time


Disk ControllersTo manage the different parts of the disk, we use a disk controller, which is a small processor capable of:

controlling the actuator moving the head to the desired track

selecting which platter and surface to use

knowing when right sector is under the head

transferring data between main memory and disk

New controllers acts like small computers themselves

both disk and controller now has an own buffer reducing disk access time

data on damaged disk blocks/sectors are just moved to spare room at the disk – the system above (OS) does not know this, i.e., a block may lie elsewhere than the OS thinks


Efficient Secondary Storage UsageMust take into account the use of secondary storage

there are large access time gaps, i.e., a disk access will probably dominate the total execution timethere may be huge performance improvements if we reduce the number of disk accessesa “slow” algorithm with few disk accesses will probably outperform a “fast” algorithm with many disk accesses

Several ways to optimize .....block size - 4 KB file management / data placement - variousdisk scheduling - SCAN derivatemultiple disks - a specific RAID levelprefetching - read-ahead prefetchingmemory caching /replacement algorithms - LRU variant…

Disk Scheduling


Disk Scheduling – ISeek time is the dominant factor of total disk I/O time

Let operating system or disk controller choose which request to serve next depending on the head’s current position and requested block’s position on disk (disk scheduling)

Note that disk scheduling ≠ CPU schedulinga mechanical device – hard to determine (accurate) access timesdisk accesses cannot be preempted – runs until it finishesdisk I/O often the main performance bottleneck

General goalsshort response timehigh overall throughput fairness (equal probability for all blocks to be accessed in the same time)

Tradeoff: seek and rotational delay vs. maximum response time


Disk Scheduling – II

Several traditional (performance oriented) algorithmsFirst-Come-First-Serve (FCFS)Shortest Seek Time First (SSTF)SCAN (and variations)Look (and variations)…


First–Come–First–Serve (FCFS)FCFS serves the first arriving request first:

Long seeks“Short” average response time

time

cylinder number1 5 10 15 20 25

12

incoming requests (in order of arrival):

14 2 7 21 8 24

schedulingqueue

248

217

2

14

12

Note:the lines only indicate some time – not exact amount


Shortest Seek Time First (SSTF)SSTF serves closest request first:

short seek timeslonger maximum seek times – may even lead to starvation

time


12


14 2 7 21 8 24

schedulingqueue

24821721412


SCANSCAN (elevator) moves head edge to edge and serves requests on the way:

bi-directionalcompromise between response time and seek time optimizations

time


12


14 2 7 21 8 24

schedulingqueue

24821721412


SCAN vs. FCFS

Disk scheduling makes a difference!

In this case, we see that SCAN requires much less head movement compared to FCFS(here 37 vs. 75 tracks)


time

time

12incoming requests (in order of arrival): 14 2 7 21 8 24

FCFS

SCAN


C–SCANCircular-SCAN moves head from edge to edge

serves requests on one way – uni-directionalimproves response time (fairness)

time


12


14 2 7 21 8 24

schedulingqueue

24821721412


SCAN vs. C–SCANWhy is C-SCAN in average better in reality than SCAN when both service the same number of requests in two passes?

modern disks must accelerate (speed up and down) when seekinghead movement formula:

requests: navg. dist: xtotal cost:

requests: navg. dist: 2xtotal cost:

uni-directionalbi-directional

C-SCANSCAN

cylinders traveled

time

nβα + number of tracksseek time constantfixed overhead

xnxn ××=× 22 xnnxnxn ×+=×+× )(

22 22

2

nnnnn

nnn

++>×

+>cif n is large:


LOOK and C–LOOKLOOK (C-LOOK) is a variation of SCAN (C-SCAN):

same schedule as SCANdoes not run to the edgesstops and returns at outer- and innermost requestincreased efficiency SCAN vs. LOOK example:

time


12


14 2 7 21 8 24

schedulingqueue

248

21

7

2

14

12


V–SCAN(R)V-SCAN(R) combines SCAN (or LOOK) and SSTF

define an R-sized unidirectional SCAN window, i.e., C-SCAN, and use SSTF outside the window

Example: V-SCAN(0.6) makes a C-SCAN window over 60 % of the cylindersuses SSTF for requests outside the window

V-SCAN(0.0) equivalent with SSTFV-SCAN(1.0) equivalent with C-SCAN

V-SCAN(0.2) is supposed to be an appropriate configuration



What About Continuous Media?

Suitability of classical algorithmsminimal disk arm movement (short seek times)no provision of time or deadlinesgenerally not suitable

Continuous media server requirementsserve both periodic and aperiodic requestsnever miss deadline due to aperiodic requestsaperiodic requests must not starvesupport multiple streamsbalance buffer space and efficiency tradeoff


Real–Time Disk Scheduling

Traditional algorithms have no provision of time or deadlines

Real-time algorithms targeted for real-time applications with deadlines

Several proposed algorithmsearliest deadline first (EDF)SCAN-EDFshortest seek and earliest deadline by ordering/value (SSEDO / SSEDV)priority SCAN (PSCAN)...


Earliest Deadline First (EDF)EDF serves the request with nearest deadline first

non-preemptive (i.e., an arriving request with a shorter deadline must wait)

excessive seeks poor throughput

time


12,5

incoming requests (<block, deadline>, in order of arrival):

14,6 2,4 7,7 21,1 8,2 24,3

schedulingqueue

12,5 14,6 2,4 7,7 21,1 8,2 24,3


SCAN–EDFSCAN-EDF combines SCAN and EDF:

the real-time aspects of EDFseek optimizations of SCANespecially useful if the end of the period of a batch is the deadline

increase efficiency by modifying the deadlinesalgorithm:

serve requests with earlier deadline first (EDF)sort requests with same deadline after track location (SCAN)

time


2,3

incoming requests (<block, deadline>, in order of arrival):

14,1 9,3 7,2 21,1 8,2 24,2

schedulingqueue

2,3 14,1 9,3 7,2 21,1 8,2 24,2 16,116,1

Note:similarly, we can combine EDF with C-SCAN, LOOK or C-LOOK


Stream Oriented Disk SchedulingStreams often have soft deadlines and tolerate some slack due to buffering, i.e., pure real-time scheduling is inefficient

Stream oriented algorithms targeted for streaming continuous media data

Several algorithms proposed:group sweep scheduling (GSS)mixed disk scheduling strategycontiguous media file system (CMFS)lottery schedulingstride schedulingbatched SCAN (BSCAN)greedy-but-safe EDF (GS_EDF)bubble up…


Group Sweep Scheduling (GSS)GSS combines Round-Robin (RR) and SCAN

requests are serviced in rounds (cycles)principle:

divide S active streams into G groupsservice the G groups in RR orderservice each stream in a group in C-SCAN order playout can start at the end of the group

special cases:G = S: RR schedulingG = 1: SCAN scheduling

tradeoff between buffer space and disk arm movementtry different values for G giving minimum buffer requirement – select minimuma large G smaller groups, more arm movements, smaller buffersa small G larger groups, less arm movements, larger buffers

with high loads and equal playout rates, GSS and SCAN often service streams in same orderreplacing RR with FIFO and group requests after deadline gives SCAN-EDF


Group Sweep Scheduling (GSS)GSS example: streams A, B, C and D g1:{A,C} and g2:{B,D}

RR group scheduleC-SCAN block schedule within a group

time


A2 A1A3 B2 B3B1C1 C2 C3D3 D1 D2

g1

A2

C1

A1

A3

B2

B3

B1

C2

C3

D3

D1

D2

g2

g1

g2

g1

g2

{A,C}

{B,D}

{C,A}

{B,D}

{A,C}

{B,D}


Mixed Disk Scheduling Strategy (MDSS)MDSS combines SSTF with buffer overflow and underflow prevention

data delivered to several buffers (one per stream)share of disk bandwidth allocated according to buffer fill levelSSTF is used to schedule the requests

......

share allocator

SSTF

sch

edul

er


Continuous Media File System Disk Scheduling

CMFS provides (proposes) several algorithmsdetermines new schedule on completion of each requestorders requests so that no deadline violations occurdelays new streams until it is safe to proceed (admission control)

based on slack-time –based on amount of data in buffers and deadlines of next requests(how long can I delay the request before violating the deadline/buffer running empty?)

defines the amount of time that o can be used for non-real-time requests oro work-ahead for continuous media requests

useful algorithmsgreedy – serve one stream as long as possiblecyclic – distribute current slack time to maximize future slack timeboth always serve the stream with shortest slack-time, but cyclic is more CPU intensive


Mixed Media Oriented Disk Scheduling

Multimedia applications may require both RT and NRT data – desirable to have all on same diskTargeted for various streaming continuous media data

Several algorithms proposed:Felini’s disk schedulerDelta LFair mixed-media scheduling (FAMISH)MARS schedulerCello Adaptive disk scheduler for mixed media workloads (APEX) …


MARS Disk SchedulerMassively-parallel And Real-time Storage (MARS) scheduler supports mixed media on a single system

a two-level schedulinground-based

top-level:1 NRT queue and n (1) RT queue(SCAN, but future GSS, SCAN-EDF, or…)

use deficit RR fair queuing to assign quantums to each queue per round –divides total bandwidth among queues

bottom-level:select requests from queues according to quantums, use SCAN order

work-conserving(variable round times, new round starts immediately)

…

deficit round robin fair queuingjob selector

NRT RT


CelloCello is part of the Symphony FS supporting mixed media

two-level schedulinground-based

top-level: n (3) service classes (queues)deadline (= end-of-round) real-time (EDF)throughput intensive best effort (FCFS)interactive best effort (FCFS)

divides total bandwidth among queues according to a static proportional allocation scheme(equal to MARS’ job selector)

bottom-level: class independent scheduler (FCFS)select requests from queues according to BW sharesort requests from each queue in SCAN order when transferred

partially work-conserving(extra requests might be added at the end of the classindependent scheduler if space, but constant rounds)

deadline RT throughput intensivebest-effort

interactivebest-effort

31

27

84

2 1 2

sort each queue in SCAN order when transferred


Adaptive Disk Scheduler for Mixed Media Workloads

APEX is another mixed media schedulertwo-level, round-based scheduler similar to Cello and MARS

uses token bucket for traffic shaping(bandwidth allocation)

the batch builder select requests inFCFS order from the queues based on number of tokens – each queue must sort according to deadline (or another strategy)

work-conservingadds extra requests if possible to a batchstarts extra batch between ordinary batches

Request Distributor/Queue Scheduler

Queue/BandwidthManager

...

Batch Builder


APEX, Cello and C–LOOK ComparisonResults from Ketil Lund (2002)Configuration:

Atlas Quantum 10KAvg. seek: 5.0msAvg. latency: 3.0mstransfer rate: 18 – 26 MB/s

data placement: random, video and audio multiplexedround time: 1 secondblock size: 64KB

Video playback and user queriesSix video clients:

each playing back a random videorandom start time (after 17 secs, all have started)


APEX, Cello and C–LOOK ComparisonNine different user-query traces, each with the following characteristics:

Inter-arrival time of queries and disk requests are exponentially distributed

Start with one trace, and then add traces, in order to increase workload (⇒ queries may overlap)

Video data disk requests are assigned to a real-time queueUser-query disk requests to a best-effort queue

Bandwidth is shared 50/50 between real-time queue and best-effort queue

We measure response times (i.e., time from request arrived at disk scheduler, until data is placed in the buffer) for user-query disk requests, and check whether deadline violations occur for video data disk requests


APEX, Chello and C–LOOK ComparisonAverage response time for user-query disk requests

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 2 3 4 5 6 7 8 9

# User-query traces

Res

pons

e tim

e (m

s)

APEXCelloC-LOOK

32662059127181140428890180C-LOOK

000000000Cello

000000000APEX

987654321

Deadlineviolations(video)

Data Placement on Disk


Data Placement on Disk

Disk blocks can be assigned to files many ways, and several schemes are designed for

optimized latencyincreased throughput

access pattern dependent


Disk Layout

Constant angular velocity (CAV) disksequal amount of data in each track(and thus constant transfer time)constant rotation speed

Zoned CAV diskszones are ranges of trackstypical few zonesthe different zones have

different amount of data different bandwidth i.e., more better on outer tracks


Disk Layoutouter:

inner:

outer:

inner:

constant transfer rate

variabletransfer rate

zoned disk

non-zoned disk


Disk LayoutCheetah X15.3 is a zoned CAV disk:

4192,69075,3%8188848596,07438234210

4428,26875,5%8648960632,4746623259

4781,50675,7%9338880649,4148024378

5345,64176,3%10440704687,0551225547

5875,00375,5%11474616728,4753726766

6603,66978,1%12897792755,2957628055

7148,07376,0%13961080801,8859529394

7854,29376,5%15340416835,7662430793

9013,24876,0%17604000878,4365233822

9735,63577,2%19014912890,9867235441

Formatted Capacity (MB)Efficiency

Sectors per Zone

Zone Transfer Rate (MBps)

Sectors per Track

Cylinders per ZoneZone

Always place often used or high rate data on outermost tracks (zone 1) …!?

NO, arm movement is often more important than transfer time


Data Placement on DiskContiguous placement stores disk blocks contiguously on disk

minimal disk arm movement reading the whole file (no intra-file seeks)

pros/cons☺ head must not move between read operations - no seeks / rotational delays☺ can approach theoretical transfer rate

but usually we read other files as well (giving possible large inter-file seeks)

real advantagedo not have to pre-determine block (read operation) size (whatever amount to read, at most track-to-track seeks are performed)

no inter-operation gain if we have unpredictable disk accesses

file A file B file C


Using Adjacent Sectors, Cylinders and Tracks

To avoid seek time (and possibly rotational delay), we canstore data likely to be accessed together on

adjacent sectors (similar to using larger blocks)

if the track is full, use another track on the same cylinder (only use another head)

if the cylinder is full, use next (adjacent) cylinder (track-to-track seek)


Data Placement on DiskInterleaved placement tries to store blocks from a file with a fixed number of other blocks in-between each block

minimal disk arm movement reading the files A, B and C(starting at the same time)

fine for predictable workloads reading multiple files

no gain if we have unpredictable disk accesses

Non-interleaved (or even random) placement can be used for highly unpredictable workloads

file Afile B

file C


Data Placement on DiskOrgan-pipe placement consider the ‘average’ disk head position

place most popular data where head is most often

center of the disk is closest to the head using CAV disksbut, a bit outward for zoned CAV disks (modified organ-pipe)

disk:innermost

outermost

head

bloc

k ac

cess

pro

babi

lity

cylinder number

bloc

k ac

cess

pro

babi

lity

cylinder number

organ-pipe: modified organ-pipe:Note:skew dependent on

tradeoff between zoned transfer time and storagecapacity vs.

seek time


BSD Example: Fast File System

FFS is a general file systemidea is to keep inode and associated blocks close(no long seeks when getting the inode and data)

organizes the disks in partitions – cylinder groupshaving several inodesfree block bitmap…

tries to store a file within a cylinder groupnext block on same cylindera block within the cylinder groupfind a block in another group using a hash functionsearch all cylinder groups for a free block


BSD Example: Log-Structured File SystemLog-structured placement is based on assumptions (facts?) that

RAM memory is getting largerwrite operations are most expensivereads can often be served from buffer cache (!!??)

Organize disk blocks as a circular log

periodically, all pending (so far buffered) writes are performed as a batchwrite on next free block regardless of content (inode, directory, data, …)

a cleaner reorganizes holes and deleted blocks in the background

stores blocks contiguously when writing a single fileefficient for small writes, other operations as traditional UNIX FS

disk:


Linux Example: XFS, JFS, …Count-augmented address indexing in the extent sections

observation: indirect block reads introduce disk I/O and break access locality

introduce a new inode structure

add counter field to data pointer entries – direct points to a disk blockand count indicated how many other blocks is following the first block (contiguously)

if contiguous allocation is assured, each direct entry is able to access much more blocks without additional retrieving an indirect block

direct 0

direct 1

direct 2

…

direct 10

direct 11

triple indirect

single indirect

double indirect

attributes

count 0

count 1

count 2

…

count 10

count 11

data3 data data

inode


Windows Example: NTFSEach partition contains a master file table (MFT)

a linear sequence of recordseach record describes a directory or a file (attributes and disk addresses)

first 16 reserved forNTFS metadata

record header

standard info

file name

data header

info about data blocks

…data…

A file can be …

• stored within the record (immediate file, < few 100 B)

• represented by disk block addresses (which hold data):runs of consecutive blocks (<addr, no>, like extents)

• using several records if more runs are needed

20 4

run 1

30 2

run 2

74 7

run 3unused

24 - base record

26 - first extension record

27 - second extension record

10 2

run 1

78 3

run k

MFT 27

2nd extension

MFT 26

1st extension

run 2, run 3, …, run k-1


BSD Example: Minorca File System Minorca is a multimedia file system (from IFI/UiO)

enhanced allocation of disk blocks for contiguous storage of media filessupports both continuous and non-continuous files in the same system using different placement policies

Multimedia-Oriented Split Allocation (MOSA) –one file system, two sections:

cylinder group sections for non-continuous fileslike traditional BSD FFS disk partitionssmall block sizes (like 4 or 8 KB)traditional FFS operations

extent sections for continuous filesextents contain one or more (adjacent) cylinder group sections

o summary informationo allocation bitmapo data block area

expected to store one media filelarge block sizes (e.g., 64 KB) new “transparent” file operations, create file using O_CREATEXT

cylinder group

cylinder group

…

extent

extent

extent

…

extent

extent


Other File Systems Examples

Contiguous AllocationPresto

similar to Minorca extents for continuous filesdoesn’t support small, discrete files

Fellinisimple flat file systemmaintains free block list with grouping contiguous blocks

Continuous Media File System...

Modern Disks:Complicating Factors


Complicating FactorsDisk used to be simple devices and disk scheduling used to be performed by OS (file system or device driver) only…

… but, new disks are more complex hide their true layout, e.g.,

only logical block numbersdifferent number of surfaces, cylinders, sectors, etc.

OS view real view



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylinders

Seagate X15.3 - zone 1 - 10: (7,7,6,6,6,5,5,5,5,5)e.g., due to bad disk blocks

OS view real view



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylindershave different zones

OS view real view

NB! illustration of transfer time, not rotation speed

Constant angular velocity (CAV) disks

constant rotation speedequal amount of data in each trackthus, constant transfer time

Zoned CAV disksconstant rotation speed zones are ranges of trackstypical few zonesthe different zones have different amount of data, i.e., more better on outer tracksthus, variable transfer time



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylindershave different zoneshead accelerates – most algorithms assume linear movement overhead

~ 10x - 20x

x

1 NCylinders Traveled

Time



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylindershave different zoneshead accelerateson device (and controller) buffer caches may use read-ahead prefetching

diskbufferdisk



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylindershave different zoneshead accelerateson device (and controller) buffer caches may use read-ahead prefetchinggaps and checksums between each sector

track



… but, new disks are more complex hide their true layouttransparently move blocks to spare cylindershave different zoneshead accelerates on device (and controller) buffer caches may use read-ahead prefetchinggaps and checksums between each sector

“smart” with a build-in low-level scheduler (usually SCAN-derivate)we cannot fully control the device (black box)


Why Still Do Disk Related Research?

If the disk is more or less a black box – why bother?

many (old) existing disks do not have the “new properties”

according to Seagate technical support:

“blocks assumed contiguous by the OS probably still will be contiguous, but the whole section of blocks might be elsewhere”

[private email from Seagate support]

delay sensitive requests

But, the new disk properties should be taken into accountexisting extent based placement is probably goodOS could (should?) focus on high level scheduling only


Next Generation Disk Scheduling?Thus, due to the complicating factors…(like head acceleration, disk buffer caches, hidden data layout, built-in “SCAN” scheduler,…)

… an VoD server scheduler can be (???):

hierarchical high-level software schedulerseveral top-level queues (at least RT & NRT)

process queues in rounds (RR)o dynamic assignment of quantumso work-conservation with variable round length

(full disk bandwidth utilization vs. buffer requirement)

only simple collection of requests according to quantums in lowest level and forwarding to disk, because ...

...fixed SCAN scheduler in hardware (on disk)

Current research also look at on-device programmable processors

…

RT NRT

SCAN

EDF / FCFS

No sorting

The End:Summary


SummaryThe main bottleneck is disk I/O performance due to disk mechanics: seek time and rotational delays

Many algorithms trying to minimize seek overhead(most existing systems uses a SCAN derivate)

Proper block allocation may improve transfers rate

World today more complicated (both different media and unknown disk characteristics)

Next week, storage systems (part II)multiple disksmultimedia file system examples...


Some References1. Anderson, D. P., Osawa, Y., Govindan, R.:”A File System for Continuous Media”, ACM Transactions on

Computer Systems, Vol. 10, No. 4, Nov. 1992, pp. 311 - 3372. Elmasri, R. A., Navathe, S.: “Fundamentals of Database Systems”, Addison Wesley, 20003. Garcia-Molina, H., Ullman, J. D., Widom, J.: “Database Systems – The Complete Book”, Prentice Hall, 20024. Lund, K.: “Adaptive Disk Scheduling for Multimedia Database Systems”, PhD thesis, IFI/UniK, UiO 5. Plagemann, T., Goebel, V., Halvorsen, P., Anshus, O.: “Operating System Support for Multimedia Systems”,

Computer Communications, Vol. 23, No. 3, February 2000, pp. 267-289 6. Seagate Technology, http://www.seagate.com7. Sitaram, D., Dan, A.: “Multimedia Servers – Applications, Environments, and Design”, Morgan Kaufmann

Publishers, 2000

INF5070 – Media Servers and Distribution Systems: Storage ...€¦ · INF5070 – media servers...

Documents

Transcript of INF5070 – Media Servers and Distribution Systems: Storage ...€¦ · INF5070 – media servers...