IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance...
Transcript of IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance...
![Page 1: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/1.jpg)
IDO: Intelligent Data Outsourcing with Improved RAID Reconstruction Performance
in Large-Scale Data Centers
Suzhen Wu§*, Hong Jiang*, Bo Mao* §Xiamen University
*University of Nebraska–Lincoln
![Page 2: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/2.jpg)
Data Deluge
2
Social Network
Business Intelligence
Scientific Simulation
Mobile Apps
2,300 tweets per
second
275 EB data flowing per day in 2020
How to safely store such a huge data volume proposes a big challenge to
the system administrators!
![Page 3: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/3.jpg)
Where Are We?
3
Laptop and Desktop Data Center
Interruptible Event
Common Case
![Page 4: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/4.jpg)
Disk Failure in the Real World
4
• Higher error rates than expected – Complete disk failures, 2%~4% on average; – Latent sector errors, 3.45%;
• CorrelaBon in drive failures – e.g., aCer one disk fails, another disk failure will likely occur soon.
• RAID reconstrucBon becomes an operaBonal state in data centers – Increasing disk capacity and number of drives
![Page 5: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/5.jpg)
More Observations
• Linux software RAID (MD) mailing list: too many complains about the slow recovery speed.
• Storage at Exascale: Some thoughts from Panasas CTO Garth Gibson. Disk failure is a normal case in exascale storage systems.
• ……
5
![Page 6: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/6.jpg)
RAID Reconstruction Challenges
6
• Online RAID Reconstruction:
• Two challenges:
– Real-time user performance;
– Window of vulnerability.
User I/O Requests
Reconstruction I/O Requests
How many user I/O requests can be eliminated from degraded RAID directly affects the reconstruction performance.
![Page 7: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/7.jpg)
The State of the arts
7
• Optimizing the reconstruction workflow: – DOR (CMU PDL) – Live-block recovery (USENIX FAST’04)
– PRO (USENIX FAST’07)
• Optimizing the user I/O requests: – MICRO (IEEE TC’08)
– WorkOut (USENIX FAST’09) – VDF (USENIX ATC’11)
![Page 8: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/8.jpg)
Compare with State of the arts
8
Characteris*cs PRO (FAST’07)
WorkOut (FAST’09)
VDF (USENIX’11)
IDO (LISA’12)
ProacBve √
Temporal Locality √ √ √ √
SpaBal Locality √ √
User I/O √ √ √
ReconstrucBon I/O √ √
![Page 9: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/9.jpg)
Observation 1
9
• RAID reconstruction is an operational state in large-scale data centers which means reactive scheme is inefficient. – Reactive vs. Proactive?
• Existing studies are all reactive schemes.
![Page 10: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/10.jpg)
Example 1: Reactive vs. Proactive
10
![Page 11: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/11.jpg)
Example 1: Reactive vs. Proactive
11
![Page 12: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/12.jpg)
Observation 2
12
• With large RAM and SSDs, the temporary locality is poor at HDD level. However, the spatial locality is good due to the sequential accesses of HDDs. – Temporal locality vs. Spatial locality?
• Existing studies mostly focus on temporal locality and ignore spatial locality.
![Page 13: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/13.jpg)
Example 2: Temporal vs. Spatial
13
a b c d(a) Request-based approach
Migrate requested “a” to Surrogate Set
a
(b) Zone-based approach a b c d
a b c d
Migrate hot zone to Surrogate Set
![Page 14: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/14.jpg)
The Motivation
14
0%
20%
40%
60%
80%
100%
WebSearch2 Financial2 Microsoft Project
Use
r I/O
traf
fic re
mov
ed fr
om
degr
aded
RA
IDReactive-requestReactive-zone
Proactive-requestProactive-zone
![Page 15: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/15.jpg)
IDO: Intelligent Data Outsourcing
15
• The main idea: – Proactively identify the hot data zones; – Upon disk failure,
• Recovery the hot data zones first;
• Migrate the hot data zones to surrogate set;
• Redirect the user I/O requests.
• The design objectives – Reducing reconstruction time; – Improving the user I/O performance;
– Applicable to other background tasks.
![Page 16: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/16.jpg)
System Overview
16
Failed Disk
New Disk
Software RAID Controller
Network
Stor
age
Nod
e
Stor
age
Nod
e Data Migration
Working / Degraded RAID Surrogate RAID Working / Surrogate RAID
RAID Reconstruction
IDO
RAID Reconstruction
Hot Zone Identifier
Data Migrator
Request Distributor
Data Reclaimer
Software RAID Controller
IDO
Request Distributor
Hot Zone Identifier
Data Migrator
Data Reclaimer
![Page 17: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/17.jpg)
Performance Evaluation
17
• IDO prototype is a built-in module in Linux MD, compared with WorkOut and VDF.
• Intel Xeon 3440 processor, 8GB DDR memory, WDC WD1600AAJS SATA disks.
• Trace-driven evaluations
![Page 18: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/18.jpg)
RAID5 Results
18
(a) Average Response Time during Recovery
(b) Reconstruction Time
0
10
20
30
40
50
60
Fin1 Fin2 Web2 Proj
Ave
rage
Res
pons
e T
ime
(ms) WorkOut
VDFIDO
0
500
1000
1500
2000
2500
3000
3500
Fin1 Fin2 Web2 Proj
Rec
onst
ruct
ion
Tim
e (s
)
WorkOutVDFIDO
![Page 19: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/19.jpg)
RAID6 Results
19
0
300
600
900
1200
1500
1800
Fin1 Fin2 Web2 Proj
Rec
onst
ruct
ion
Tim
e (s
)
WorkOutVDFIDO
0
10
20
30
40
Fin1 Fin2 Web2 Proj
Ave
rage
Res
pons
e T
ime
(ms) WorkOut
VDFIDO
(a) Average Response Time during Recovery
(b) Reconstruction Time
![Page 20: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/20.jpg)
Detailed Real-time Results
20
(a) WebSearch2.spc
0.1
1
10
100
1000
0 500 1000 1500 2000 2500Use
r R
espo
nse
Tim
e (m
s)
Reconstruction Time (s)
WorkOut VDF IDO
(b) Microsoft Project
VDF ends
WorkOut ends
IDO ends
1
10
100
1000
0 100 200 300 400 500
Use
r R
espo
nse
Tim
e (m
s)Reconstruction Time (s)
WorkOut VDF IDO
VDF ends
WorkOut ends
IDO ends
Shorter Reconstruction Time Shorter Reconstruction Times
Shorter Reconstruction Time Lower user response times
![Page 21: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/21.jpg)
Reduce I/Os and Sensitivity Study
21
• Sensitivity & overhead analysis (in the paper).
0
20
40
60
80
100
Fin1 Fin2 Web2 Proj
Perc
enta
ge (%
of T
otal
)WorkOutVDF
3.4 1.3
IDO
• Reduced I/Os:
![Page 22: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/22.jpg)
Extendibility Evaluation
22
(a) Re-synchronization Time (b) Average Response Time
0
500
1000
1500
2000
2500
3000
Fin1 Fin2 Web2 Proj
DefaultWorkOutIDO
Re-
sync
hron
izat
ion
Tim
e (s
)
0510152025303540
Fin1 Fin2 Web2 Proj
Ave
rage
Res
pons
e T
ime
(ms) Default
WorkOutIDO
![Page 23: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/23.jpg)
Summary of IDO
23
• RAID reconstruction is an operational state in large-scale data centers!
• Salient features of IDO: – Proactive; – Exploit both temporal and spatial localities; – Optimize both user and reconstruction IOs;
– Portability and extendibility.
![Page 24: IDO: Intelligent Data Outsourcing with Improved RAID ......Improved RAID Reconstruction Performance in Large-Scale Data Centers ... – Latentsector’errors,’3.45%;’ ... DF ends](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ece2c63ee11c142a623dab3/html5/thumbnails/24.jpg)
Thanks!
24