An Active and Hybrid Storage System for Data-intensive Applications
-
Upload
xiao-qin -
Category
Technology
-
view
704 -
download
2
description
Transcript of An Active and Hybrid Storage System for Data-intensive Applications
![Page 1: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/1.jpg)
04/08/2023
An Active and Hybrid Storage System for Data-intensive Applications
Ph.D Candidate: Zhiyang Ding
Defense Committee Members:Dr. Xiao QinDr. Kai H. ChangDr. David A. UmphressUniversity Reader:Prof. Wei Wang,Chair of the Art Design Dept.
![Page 2: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/2.jpg)
2
Cluster Computing
04/08/2023
• Large-scale Data Processing is everywhere.
![Page 3: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/3.jpg)
3
Motivation
04/08/2023
• Traditional Storage Nodes on the Cluster
Client Network switch
Compute Nodes
Storage Node (or Storage Area Network)Internet
Head Node
![Page 4: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/4.jpg)
4
Motivation
04/08/2023
• What’s the next? • More “Active”.
Storage Node
Client Network switch
Compute Nodes
Internet
Head Node
Computation OffloadI/O Request
Raw DataPre-processed Data
![Page 5: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/5.jpg)
5
About the Active Storage
04/08/2023
pp-mpiBlast:How to deploy Active Storage?
McSD: A Smart Disk Model
Storage Node HcDD:Hybrid Disk for Active Storage
![Page 6: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/6.jpg)
604/08/2023
McSD: A Multicore Active Storage Device
• I/O Wall Problem: CPU--I/O Gap– Limited I/O Bandwidth– CPU Waiting and Dissipating the Power
• How to – Bridge CPU--I/O Gap– Reduce I/O Traffic
![Page 7: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/7.jpg)
7
• “Active”: – Leveraging the Processing Power of Storage Devices
• Benefits:– Offloading Data-intensive Computation– Reducing I/O Traffic– Pipeline Parallel Programming
04/08/2023
Why McSD?
![Page 8: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/8.jpg)
8
• Design a prototype of a multicore active storage
• Design a pre-assembled processing module
• Extend a shared-memory MapReduce system
• Emulate the whole system on a real testbed
04/08/2023
Contributions
![Page 9: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/9.jpg)
9
• Traditional Smart/Active Disks– On-board: Embedding a processor into the hard disk– Various Research Models• e.g. active disk, smart disk, IDISK, SmartSTOR, and etc.
04/08/2023
Background: Active Disks
• However, “active disk” is not adopted by hardware vendors
Improved attachment technologies
I/O Bound Workloads
Cost of the System
Reliability
![Page 10: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/10.jpg)
10
• Multi-core Processors or Multi-processors– 45% transistors increase 20% processing power
• MapReduce: a Parallel Programming Model– MapReduce by Google– Hadoop, Mars, Phoenix, and etc.
• Multicore and Shared-memory Parallel Processing
04/08/2023
Background: Parallel Processing
![Page 11: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/11.jpg)
1104/08/2023
Design: System Overview
Multicore and Shared-memory
Parallel Processing
Communication Mechanism
Hybrid Storage Disks
Pipeline Parallel Processing
Design of an Active Storage
![Page 12: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/12.jpg)
12
• Computation Mechanism– Pre-assembled Processing Model– smartFAM
• Extend the Shared-Memory MapReduce by Partitioning
04/08/2023
Design and Implementation
![Page 13: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/13.jpg)
13
• Pre-assembled Processing Modules– Meet the nature of embedded services– Reduce Complexity and Cost– Provide Services• E.g. Multi-version antivirus service, Pre-process of data-
intensive apps, De-duplication, and etc.
• How to invoke services?
04/08/2023
Pre-assembled Processing Modules
![Page 14: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/14.jpg)
14
• smartFAM = Smart File Alternation Monitor– Invokes the pre-assembled processing modules or
functions by monitoring the changes of the system log file.
• Two Components:– an inotify function: a Linux system function– a trigger daemon
04/08/2023
smartFAM
![Page 15: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/15.jpg)
1504/08/2023
Design and Implementation
12
3
![Page 16: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/16.jpg)
1604/08/2023
Extend the Phoenix:A Shared-memory MapReduce Model
• Extend the Phoenix MapReduce Programming Model by partitioning and merging– New API: partition_input– New Functions:
• partition (provided by the new API)• merge (Develop by user)
• Example:– wordcount [data-file][partition-size][]
![Page 17: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/17.jpg)
1704/08/2023
Pipeline Processing
![Page 18: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/18.jpg)
18
• Testbed
• Benchmarks– Word Count– String Match– Matrix Multiplication
• Individual Node Performance• System Performance04/08/2023
Evaluation Environment
![Page 19: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/19.jpg)
19
Word Count (seconds) String Match (seconds)
1 GB 1.25 GB 1 GB 1.25 GB
w/ Partition 40.60 50.91 17.76 20.61
w/o Partition 85.74 139.54 17.62 21.00
04/08/2023
Individual Node Performance
![Page 20: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/20.jpg)
20
Matrix-Multiplication and Word-Count (Speedups)
Input Data Size vs Single Machine vs Single-core Active vs McSD w/o Partition
500 MB 1.47 X 2.15 X 0.99 X
750 MB 1.45 X 2.09 X 1.04 X
1 GB 7.62 X 2.14 X 6.07 X
1.25 GB 19.01 X 2.50 X 15.39 X
04/08/2023
System Evaluation
![Page 21: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/21.jpg)
21
• It can improve system performance by offloading data-intensive computation
• McSD is a promising active storage model with– Pre-assembled processing modules– Parallel data processing – Better Evaluation Performance
04/08/2023
Summary
![Page 22: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/22.jpg)
22
Storage Node
About the Active Storage
04/08/2023
pp-mpiBlast:How to deploy Active Storage?
McSD: A Smart Disk Model
HcDD:Hybrid Disk for Active Storage
![Page 23: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/23.jpg)
23
• So far, we know the potential of Active Storages
• Challenge: How to coordinate active storage nodes with computing nodes?
• Propose a Pipeline-parallel Processing pattern
04/08/2023
Apply Active Storages to a Cluster
![Page 24: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/24.jpg)
24
• Propose a pipeline-parallel processing framework to
“connect” a Active Storage node with computing nodes.
• Evaluate the framework using both an analytic model
and a real implementation.
• Case Study: Extend an existing bioinformatics
application based on the framework.
04/08/2023
Contributions
![Page 25: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/25.jpg)
2504/08/2023
Background: Active Storage
SSD
Mass Storage
Active Storage Node
SSD
Memory
Buff Disks
Processor
Computation
Bridge?
![Page 26: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/26.jpg)
27
• BLAST*: Basic Local Alignment Search Tool– Comparing primary biological sequence
information
• mpiBLAST** is a freely available, open-source, parallel implementation of NCBI BLAST. – Format raw data files– Run a parallel BLAST function
04/08/2023
Background: Bioinformatics App
*http://blast.ncbi.nlm.nih.gov/**http://www.mpiblast.org/
![Page 27: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/27.jpg)
28
• Offload the raw-data formatting task to where data stores.
• Intra-application Pipeline-parallel Processing by “partition” and “merge”.
• pp-mpiBlast, a case study.
04/08/2023
Pipeline-parallel Design
![Page 28: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/28.jpg)
29
Active Storage Node Computing Nodes
04/08/2023
Pipelining Workflow
Output File
RawInput File
Partition 1
2
…Partition
n
Intermediate 12
…Intermediate
n
Partition
Sub-output 1
2
…Sub-output
n
FormatDB mpiBlast Merge
(n-1) times
n
(n-1) times
1
Inter-mediat
esFormart DB OutputFormart DB
![Page 29: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/29.jpg)
3004/08/2023
Analytic Model
• Three Critical Measures
![Page 30: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/30.jpg)
31
Computing Nodes Configuration Active Storage ConfigurationCPU Intel XEON X3430 Intel Core 2 Q9400
Memory 2 GB DDR3 (PC3-10600)OS Ubuntu 9.04 Jaunty Jackalope 32bit Version
Kernel 2.6.28-15-genericNetwork Gigabit LAN
04/08/2023
Evaluation Environment
Our Testbed Opposite Testbeds“Pipeline-parallel” “12-node Cluster” “13-node Cluster”12 Computing Nodes 12 Computing Nodes 13 Computing Nodes1 Active Storage Node 1 Storage Node 1 Storage Node
![Page 31: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/31.jpg)
3204/08/2023
Pipeline-parallel Design
Results: Compared With 12-node System
Results: Compared With 13-node System
![Page 32: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/32.jpg)
3304/08/2023
Speedups Trends: Partition Size
![Page 33: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/33.jpg)
34
• We proposed a pipeline-parallel processing mechanism to apply an Active Storage Node.
• As a case study, we extended a classic bioinformatics application based on the pipeline-parallel style.
04/08/2023
Summary
![Page 34: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/34.jpg)
35
About the Active Storage
04/08/2023
pp-mpiBlast:How to deploy Active Storage?
McSD: A Smart Disk Model
Storage Node HcDD:Hybrid Disk for Active Storage
![Page 35: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/35.jpg)
3604/08/2023
What’s Hybrid?
A Hybrid Combination of a Gas Engine and a Electronic Engine
Power Efficiency
![Page 36: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/36.jpg)
3704/08/2023
Hybrid Disk Drives
• A Hybrid Combination of Two Types of Storage Devices: HDD and SSD– HDD: Magnetic Hard Disk– Solid State Disk: Built by NAND-based flash memory.
What are their roles?
![Page 37: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/37.jpg)
3804/08/2023
Motivation
• However, SSDs suffer reliability issues.
• In a hybrid storage system, using SSDs as the buffer can boost the performance.
![Page 38: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/38.jpg)
39
• Flash Memory:– Each Block consists 32 or 64 or128 pages. – Each Page is typically 512 or 2,048 or 4,096 bytes.
• “Erase-before-write” at block level.• Lifespan is 10,000 Program/Erase cycles.– E.g., *The lifespan of an 80 GB MLC SSD can only
last 106 days, if the write rates is 30 MB/s.
04/08/2023
Limitations Related to SSDs
• Rethink about their roles?*Based on the SSD lifespan calculator provided by Virident.com
![Page 39: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/39.jpg)
40
• Hybrid Combination of HDD and SSD disks
• De-duplication Service using HDDs as a Write Buffer
• Internal-parallel Processing in SSD
• Simulation of the Whole System For Evaluation
04/08/2023
Contributions
![Page 40: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/40.jpg)
4104/08/2023
Hybrid Disk Configuration
HDD
SSD
I/O Requests
Read Requests
Data of Write Requests
data
Data
De-duplication
Dedicated Processor
Pre-processingRead RequestsPre-processed Data
dataDeduplicated
![Page 41: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/41.jpg)
4204/08/2023
HcDD Architecture
![Page 42: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/42.jpg)
4304/08/2023
Deduplication Design
![Page 43: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/43.jpg)
4404/08/2023
Internal Parallel Processing
![Page 44: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/44.jpg)
4504/08/2023
Evaluation
![Page 45: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/45.jpg)
4604/08/2023
Internal Parallelism Evaluation:Single Node
![Page 46: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/46.jpg)
4704/08/2023
Single Node: Dedup Ratio
![Page 47: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/47.jpg)
4804/08/2023
System Performance Evaluation
![Page 48: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/48.jpg)
4904/08/2023
System Performance Evaluation
![Page 49: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/49.jpg)
5004/08/2023
Summary
![Page 50: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/50.jpg)
51
Conclusion
04/08/2023
pp-mpiBlast:How to deploy Active Storage?
McSD: A Smart Disk Model
Storage Node HcDD:Hybrid Disk for Active Storage
![Page 51: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/51.jpg)
52
Future Work
04/08/2023
![Page 52: An Active and Hybrid Storage System for Data-intensive Applications](https://reader038.fdocuments.us/reader038/viewer/2022102804/546bb5c9af795900458b5665/html5/thumbnails/52.jpg)
53
Many Thanks!And Questions?
04/08/2023