Post on 14-Apr-2018
5/28/2014University of Minnesota Digital Technology
Center Intelligent Storage Consortium1
David Hung-Chang Du
Qwest Chair Professor
Computer Science and Engineering
University of Minnesota
du@cs.umn.edu
CRIS: NSF I/UCRC Center on Intelligent Storage
More information on http://cris.cs.umn.edu
Storage Research for Solving Big Data Problem
2
Outline of Talk
• Two Major Changes in Computing & Communication Environment
• Big Data Problem
• Solving Big Data Problem
– Software Defined Network vs. Software Defined Storage
• Storage Research Projects at NSF I/UCRC Center on Intelligent Storage
• Conclusions
5/28/2014 3
Bridge Monitoring
Building
Environment
Controls
Earthquake
Monitoring
Elder Care
Factories
Fire Response
First Responders
Forest Management
Soil Monitoring
Supply Chain
Wind Response
… and more more
Instrument and Connect the World !
44 OOPSLA Jeannette M. Wing
Sensors Everywhere
Sonoma
Redwood Forest smart buildings
Kindly donated by Stewart Johnston
smart bridges
Credit: MO Dept. of Transportation
Hudson River Valley
Credit: Arthur Sanderson at RPI
Digital Explosion: Data Centric
The digital universe will grow over six-fold, from 281 exabytes in 2007 to 1,773 exabytes in 2011
> 90% of the information in the digital universe is unstructured and absolute # of files growing faster than the TBs
----from IDC Survey presented in ISW 2008
6
Big Data Problem
Converting Analog to Digital
All Data Access Traces in Digital World
How to Gain Information from All Stored Data?
How to Make Better Decisions?
What to Keep and What to Preserve?
Can We Develop Knowledge from All These Data?
75/28/2014 7
Blocks
Files
Objects
Information
Knowledge
Traditional storage
device view - raw bits, no
associated semantics.
Extended attributes augmented
view high level semantics associated.
Need New
Architectures
& Systems to
Capture
Exploited to store
and retrieve data
more efficiently with
Indexing/Search
capability
[ INTELLIGENCE ]
Intelligent Storage
28 May 2014 8
Current Cyber Space
“A domain characterized by the use of electronics and the electromagnetic spectrum to
store, modify, and exchange data via networked systems and associated physical
infrastructure.”
9
Inside the ‘Net: A Different Story…
• Closed equipment
– Software bundled with hardware
– Vendor-specific interfaces
• Over specified
– Slow protocol standardization
• Few people can innovate
– Equipment vendors write the code
– Long delays to introduce new features
9
10
Do We Need Innovation Inside?Many boxes (routers, switches,
firewalls, …) with different interfaces
and not programmable.
11
Proposed SDN Solution
Control Plane
Data Plane
Standard API to
Enable
Programmable
Separation of
Control Plane
and Data Plane
Logically
Centralized
Controller
Open API
12
Seamless Mobility• See host sending traffic at new location
• Modify rules to reroute the traffic
12
13
Server Load Balancing
• Pre-install load-balancing policy
• Split traffic based on source IP
src=0*,
dst=1.2.3.4
src=1*,
dst=1.2.3.4
10.0.0.1
10.0.0.2
14
Example SDN Applications
• Seamless mobility and migration
• Server load balancing
• Dynamic access control
• Using multiple wireless access points
• Energy-efficient networking
• Adaptive traffic monitoring
• Denial-of-Service attack detection
• Network virtualization
14See http://www.openflow.org/videos/
15
Network Function Virtualization (NFV)
Slide from: http://docbox.etsi.org/Workshop/2013/201304_FNTWORKSHOP/S07_NFV/BT_REID.pdf
What is SDS ?1. Policy-Driven Storage (IOPS, latency,
reliability, Fault tolerance, Provisioning,
QoS)
2. Scale-out Architecture
3. Storage as a Seamless Pool of Resource
(Storage Virtualization)
4. Control Integration from Multi-Vendors
5. Heterogeneous Storage Containers
6. Logical Centralized Resource Allocation
18
Web 2.0
PatternJ2EE/OLTP
Map/Reduce Pattern
Transactional Analytics Web
Availability•Clustering•Replication
Capacity/Performance• Storage Class
• De-duplication/Compression/Thin Provisioning
Security & Compliance• Encryption
• Archival/WORM
Data storage and retrieval services
Plan Deploy Optimize
Legacy high-function
(external) storage systems
Portable storage software on
commodity hdwr
Public Cloud Private Cloud Hybrid CloudBare Metal
Cloud
Software Defined Storage
Slide from One Vendor
19
Platinum
Gold
Silver
Bronze
Authentication/Auditing
Encryption
Mirroring/DR
High Availability
Striping
Clustering
Compression
Tiering/ILM
Backup & Recovery
Deduplication
Security and Availability
Performance and Opt.
`Workload Abstraction Resource Abstraction Continuous OptimizationMapping to Resource
Sto
rage
Se
rvic
es
Laye
r
RESILIENCYCAPABILITY
OPTIMIZATION
FABRIC
MANAGEMENT
SOFTWARE DEFINED STORAGE
• Storage Abstraction• Storage Provisioning• Storage Monitoring• SAN/GPFS/NAS/DAS
••FC/FCoE/iSCSI/Infiniband•Zone management
• Storage replication• Disaster recovery• Consistency groups• Backup
HETEROGENEITY
• Storage tiers• Performance aware placement• Continous optimizations• Migration
SOFTWARE
DEFINED
COMPUTE
SOFTWARE
DEFINED
NETWORK
Service Abstractions Putting Things Together
SDN vs. SDS
• Consensus on Definition
• OpenFlow Switches as De Facto Devices
• Wide Area Networks
• Benefit Big Network Users
• IP Network Focus
• Support Applications
• No Clear Definition Yet
• Heterogeneous Types of Storage Containers
• Data Center Deployment
• Ensure QoS & Efficiency
• Virtual Machine Focus
• Integration with SDN and Compute
23
• Research on New Storage Technologies (Flash Memory based SSD, PCM, Shingled Write Disks: (Seagate, LSI, SGI and Western Digital (HGST))
• Research on New Storage Hierarchies (multi-level caching/prefetching, data allocation/migration, and tiered storage: (HP, NetApp and Dell)
• Cloud Storage and Big Data (HP, NetApp, FedCentricand NEC-Labs)
• I/O Workload Characterization and Synthetic Workload Generation (Seagate, Xyratex and NetApp)
Current Research Thrusts
24
New Storage Technologies
Flash Memory based
SSD
FTL Design
PCM Prototype
Shingled Write Disk
Design and Layout
25
Challenges in New Technologies
• Investigating and Understanding Fundamental Properties
• Research of Design Issues
• What are their impacts on applications?
• How to effectively integrate the new technologies into existing memory/storage hierarchies?
265/28/2014 26
Summary of SSD Research Results
• Robust and Reliable Design of SSDs
• Integrating SSDs into Storage Hierarchy
• New FTL Design: A Convertible FTL Design
• Efficient Wear-Leveling Algorithm
• Optimal/Efficient Read/Write Caching
• Hot and Cold Data Classification
• Bloom Filter Design and Key-Value Store Based on Flash Memory
• Using Sampling Technique for Meta-Data Management in FTL
28
• NVM Replaces DRAM as Main Memory
• NVM to Be Used As A Cache
• DRAM+NVM
Non-Volatile Memory
CPU
NVM
HDD
Main Memory
Storage
CPU
NVM
SSD
Main Memory
Storage
DRAM
SSD
CPU
NVMMain Memory
Storage
29
New Memory and Storage Hierarchies
• Data Storage
• Data
Migration
• Multi-Level
Caching
• Data
Prefetching
• Tiered
Storage
31
• “In-place Update”: many small bands
– Protect previously-written data by
Read-Modify-Write
– Behaves similar to regular disks
• “Out-of-place Update”: few large band
– Maintain data in circular log structure
• Data Addition to head pointer
• Data removal from tail pointer
– LBA-to-PBA mapping is not fixed
– Transfer random writes into sequential write
– Compromise sequential read performance
Possible Methods
Indirected
Addressing
Higher Space
overhead
Defragmentation
(Garbage Collection)
Write
Amplification
32
• How to build large scale storage systems with SSD or SWD?
• Modeling multi-channel multi-chip SSD
• Investigating SSD reliability and performance with a wide set of metrics
• Investigating the impact of non-volatile memory as main memory
• Revisit FTL design issues for SSD when SSDs are composed of a large storage system instead of caching devices
Current Research Focuses on New Storage Technologies
33
Storage Layer Management and Caching
SATA Disks
off off On
SSD
Read Queues
(RT)
Read Queues
(Prefetch)
Write Queues
(Offloading)
Big Memory with PCM
When/ Where/how
much
Cloud
Storage
35
NAND Flash Package with Integrated ECC and General Purpose Processor
Host CPU
DDR
PCIe
SSD Controller
Block Management
Data buffer
Host communication
DDR
Wear Leveling
Garbage Collection
……
NAND Flash Package
NAND
Flash
Die
NAND
Flash
Die
…
… …
…ECC
Processor
NAND Flash Package
NAND
Flash
Die
NAND
Flash
Die
…
…
ECC
Processor
NAND Flash Package
NAND
Flash
Die
NAND
Flash
Die
…
… …
…ECC
Processor
NAND Flash Package
NAND
Flash
Die
NAND
Flash
Die
…
…
ECC
Processor
Manufacturers incorporated hardware in flash package
36
Accelerating Hadoop on SGI UV2000(In-
Memory System)
Hadoop & MapReduce Are
for Data Intensive Applications
How to Speed Up in High
Performance Based Computers
37
• Emphasize more on Virtual Machine environment
• Ensure QoS support for VMs in Cloud (VDI as An Application)
• How data deduplication can be applied in cloud + big data (more on primary storage dedupe)?
• Integration of cloud and local storage
• Integration of various file systems with federated file system
Research Focuses of Cloud Storage + Big Data
38
Framework of I/O Workload Characterization
Original trace
WorkloadParameters
Synthetic trace
Workload characterization
AdjustedParameters
Parameter adjustment
Workload generation
Replay by workload replayer
Replayed trace
Changes to applications and /or
system ( either host or storage)
Arrival pattern, File/Data access pattern in the form of parameters
Replay on same/different storage system
Action
Output
Comparison 2
Comparison 1
Comparison 3
39
• Completed a tool for I/O workload characterization and generation for parallel file systems
• Hfplayer v.2 (replay engine) is now available
• Proposed a new cache replacement scheme for non-volatile memory as main memory and disk as storage device
• A detailed design of integrating cloud storage with local storage
• Proposed a journaling based scheme for SSD reliability
Recent Accomplishments
40
• Further Integration with block I/O, parallel file system I/O and replay engine
• How to improve the performance of storage systems?
• I/O workload phase detection
• How to apply knowledge in I/O workload to multi-level caching?
Research Focuses on I/O Workload Characterization and Generation
41
Conclusions
• Storage Research Face Challenges from Applications (Big Data, Long-Term Data Preservation, Cloud Storage, Scalability)
• Also Face Challenges from New Technologies (Emerging Memory/Storage Hierarchies)
• Integrated Approach Including Compute, Storage and Network Systems Consideration Is A Must (SDS???)