Storage Research for Solving Big Data Problem ·  · 2015-04-082014-05-28 · Map/Reduce Pattern...

42
5/28/2014 University of Minnesota D igital Technology Center I ntelligent S torage C onsortium 1 David Hung-Chang Du Qwest Chair Professor Computer Science and Engineering University of Minnesota [email protected] CRIS: NSF I/UCRC Center on Intelligent Storage More information on http://cris.cs.umn.edu Storage Research for Solving Big Data Problem

Transcript of Storage Research for Solving Big Data Problem ·  · 2015-04-082014-05-28 · Map/Reduce Pattern...

5/28/2014University of Minnesota Digital Technology

Center Intelligent Storage Consortium1

David Hung-Chang Du

Qwest Chair Professor

Computer Science and Engineering

University of Minnesota

[email protected]

CRIS: NSF I/UCRC Center on Intelligent Storage

More information on http://cris.cs.umn.edu

Storage Research for Solving Big Data Problem

2

Outline of Talk

• Two Major Changes in Computing & Communication Environment

• Big Data Problem

• Solving Big Data Problem

– Software Defined Network vs. Software Defined Storage

• Storage Research Projects at NSF I/UCRC Center on Intelligent Storage

• Conclusions

5/28/2014 3

Bridge Monitoring

Building

Environment

Controls

Earthquake

Monitoring

Elder Care

Factories

Fire Response

First Responders

Forest Management

Soil Monitoring

Supply Chain

Wind Response

… and more more

Instrument and Connect the World !

44 OOPSLA Jeannette M. Wing

Sensors Everywhere

Sonoma

Redwood Forest smart buildings

Kindly donated by Stewart Johnston

smart bridges

Credit: MO Dept. of Transportation

Hudson River Valley

Credit: Arthur Sanderson at RPI

Digital Explosion: Data Centric

The digital universe will grow over six-fold, from 281 exabytes in 2007 to 1,773 exabytes in 2011

> 90% of the information in the digital universe is unstructured and absolute # of files growing faster than the TBs

----from IDC Survey presented in ISW 2008

6

Big Data Problem

Converting Analog to Digital

All Data Access Traces in Digital World

How to Gain Information from All Stored Data?

How to Make Better Decisions?

What to Keep and What to Preserve?

Can We Develop Knowledge from All These Data?

75/28/2014 7

Blocks

Files

Objects

Information

Knowledge

Traditional storage

device view - raw bits, no

associated semantics.

Extended attributes augmented

view high level semantics associated.

Need New

Architectures

& Systems to

Capture

Exploited to store

and retrieve data

more efficiently with

Indexing/Search

capability

[ INTELLIGENCE ]

Intelligent Storage

28 May 2014 8

Current Cyber Space

“A domain characterized by the use of electronics and the electromagnetic spectrum to

store, modify, and exchange data via networked systems and associated physical

infrastructure.”

9

Inside the ‘Net: A Different Story…

• Closed equipment

– Software bundled with hardware

– Vendor-specific interfaces

• Over specified

– Slow protocol standardization

• Few people can innovate

– Equipment vendors write the code

– Long delays to introduce new features

9

10

Do We Need Innovation Inside?Many boxes (routers, switches,

firewalls, …) with different interfaces

and not programmable.

11

Proposed SDN Solution

Control Plane

Data Plane

Standard API to

Enable

Programmable

Separation of

Control Plane

and Data Plane

Logically

Centralized

Controller

Open API

12

Seamless Mobility• See host sending traffic at new location

• Modify rules to reroute the traffic

12

13

Server Load Balancing

• Pre-install load-balancing policy

• Split traffic based on source IP

src=0*,

dst=1.2.3.4

src=1*,

dst=1.2.3.4

10.0.0.1

10.0.0.2

14

Example SDN Applications

• Seamless mobility and migration

• Server load balancing

• Dynamic access control

• Using multiple wireless access points

• Energy-efficient networking

• Adaptive traffic monitoring

• Denial-of-Service attack detection

• Network virtualization

14See http://www.openflow.org/videos/

15

Network Function Virtualization (NFV)

Slide from: http://docbox.etsi.org/Workshop/2013/201304_FNTWORKSHOP/S07_NFV/BT_REID.pdf

16

Use Case: vWOC(virtualized WAN Optimization Controller)

What is SDS ?1. Policy-Driven Storage (IOPS, latency,

reliability, Fault tolerance, Provisioning,

QoS)

2. Scale-out Architecture

3. Storage as a Seamless Pool of Resource

(Storage Virtualization)

4. Control Integration from Multi-Vendors

5. Heterogeneous Storage Containers

6. Logical Centralized Resource Allocation

18

Web 2.0

PatternJ2EE/OLTP

Map/Reduce Pattern

Transactional Analytics Web

Availability•Clustering•Replication

Capacity/Performance• Storage Class

• De-duplication/Compression/Thin Provisioning

Security & Compliance• Encryption

• Archival/WORM

Data storage and retrieval services

Plan Deploy Optimize

Legacy high-function

(external) storage systems

Portable storage software on

commodity hdwr

Public Cloud Private Cloud Hybrid CloudBare Metal

Cloud

Software Defined Storage

Slide from One Vendor

19

Platinum

Gold

Silver

Bronze

Authentication/Auditing

Encryption

Mirroring/DR

High Availability

Striping

Clustering

Compression

Tiering/ILM

Backup & Recovery

Deduplication

Security and Availability

Performance and Opt.

`Workload Abstraction Resource Abstraction Continuous OptimizationMapping to Resource

Sto

rage

Se

rvic

es

Laye

r

RESILIENCYCAPABILITY

OPTIMIZATION

FABRIC

MANAGEMENT

SOFTWARE DEFINED STORAGE

• Storage Abstraction• Storage Provisioning• Storage Monitoring• SAN/GPFS/NAS/DAS

••FC/FCoE/iSCSI/Infiniband•Zone management

• Storage replication• Disaster recovery• Consistency groups• Backup

HETEROGENEITY

• Storage tiers• Performance aware placement• Continous optimizations• Migration

SOFTWARE

DEFINED

COMPUTE

SOFTWARE

DEFINED

NETWORK

Service Abstractions Putting Things Together

SDN vs. SDS

• Consensus on Definition

• OpenFlow Switches as De Facto Devices

• Wide Area Networks

• Benefit Big Network Users

• IP Network Focus

• Support Applications

• No Clear Definition Yet

• Heterogeneous Types of Storage Containers

• Data Center Deployment

• Ensure QoS & Efficiency

• Virtual Machine Focus

• Integration with SDN and Compute

CRIS Research Summary

http://cris.cs.umn.edu

22

Current Sponsor Companies

Two

MembershipsOne

Membership

23

• Research on New Storage Technologies (Flash Memory based SSD, PCM, Shingled Write Disks: (Seagate, LSI, SGI and Western Digital (HGST))

• Research on New Storage Hierarchies (multi-level caching/prefetching, data allocation/migration, and tiered storage: (HP, NetApp and Dell)

• Cloud Storage and Big Data (HP, NetApp, FedCentricand NEC-Labs)

• I/O Workload Characterization and Synthetic Workload Generation (Seagate, Xyratex and NetApp)

Current Research Thrusts

24

New Storage Technologies

Flash Memory based

SSD

FTL Design

PCM Prototype

Shingled Write Disk

Design and Layout

25

Challenges in New Technologies

• Investigating and Understanding Fundamental Properties

• Research of Design Issues

• What are their impacts on applications?

• How to effectively integrate the new technologies into existing memory/storage hierarchies?

265/28/2014 26

Summary of SSD Research Results

• Robust and Reliable Design of SSDs

• Integrating SSDs into Storage Hierarchy

• New FTL Design: A Convertible FTL Design

• Efficient Wear-Leveling Algorithm

• Optimal/Efficient Read/Write Caching

• Hot and Cold Data Classification

• Bloom Filter Design and Key-Value Store Based on Flash Memory

• Using Sampling Technique for Meta-Data Management in FTL

27

PCM Prototyping Effort

28

• NVM Replaces DRAM as Main Memory

• NVM to Be Used As A Cache

• DRAM+NVM

Non-Volatile Memory

CPU

NVM

HDD

Main Memory

Storage

CPU

NVM

SSD

Main Memory

Storage

DRAM

SSD

CPU

NVMMain Memory

Storage

29

New Memory and Storage Hierarchies

• Data Storage

• Data

Migration

• Multi-Level

Caching

• Data

Prefetching

• Tiered

Storage

30

31

• “In-place Update”: many small bands

– Protect previously-written data by

Read-Modify-Write

– Behaves similar to regular disks

• “Out-of-place Update”: few large band

– Maintain data in circular log structure

• Data Addition to head pointer

• Data removal from tail pointer

– LBA-to-PBA mapping is not fixed

– Transfer random writes into sequential write

– Compromise sequential read performance

Possible Methods

Indirected

Addressing

Higher Space

overhead

Defragmentation

(Garbage Collection)

Write

Amplification

32

• How to build large scale storage systems with SSD or SWD?

• Modeling multi-channel multi-chip SSD

• Investigating SSD reliability and performance with a wide set of metrics

• Investigating the impact of non-volatile memory as main memory

• Revisit FTL design issues for SSD when SSDs are composed of a large storage system instead of caching devices

Current Research Focuses on New Storage Technologies

33

Storage Layer Management and Caching

SATA Disks

off off On

SSD

Read Queues

(RT)

Read Queues

(Prefetch)

Write Queues

(Offloading)

Big Memory with PCM

When/ Where/how

much

Cloud

Storage

34

Local Storage + Cloud StorageWhy? How? Where?

35

NAND Flash Package with Integrated ECC and General Purpose Processor

Host CPU

DDR

PCIe

SSD Controller

Block Management

Data buffer

Host communication

DDR

Wear Leveling

Garbage Collection

……

NAND Flash Package

NAND

Flash

Die

NAND

Flash

Die

… …

…ECC

Processor

NAND Flash Package

NAND

Flash

Die

NAND

Flash

Die

ECC

Processor

NAND Flash Package

NAND

Flash

Die

NAND

Flash

Die

… …

…ECC

Processor

NAND Flash Package

NAND

Flash

Die

NAND

Flash

Die

ECC

Processor

Manufacturers incorporated hardware in flash package

36

Accelerating Hadoop on SGI UV2000(In-

Memory System)

Hadoop & MapReduce Are

for Data Intensive Applications

How to Speed Up in High

Performance Based Computers

37

• Emphasize more on Virtual Machine environment

• Ensure QoS support for VMs in Cloud (VDI as An Application)

• How data deduplication can be applied in cloud + big data (more on primary storage dedupe)?

• Integration of cloud and local storage

• Integration of various file systems with federated file system

Research Focuses of Cloud Storage + Big Data

38

Framework of I/O Workload Characterization

Original trace

WorkloadParameters

Synthetic trace

Workload characterization

AdjustedParameters

Parameter adjustment

Workload generation

Replay by workload replayer

Replayed trace

Changes to applications and /or

system ( either host or storage)

Arrival pattern, File/Data access pattern in the form of parameters

Replay on same/different storage system

Action

Output

Comparison 2

Comparison 1

Comparison 3

39

• Completed a tool for I/O workload characterization and generation for parallel file systems

• Hfplayer v.2 (replay engine) is now available

• Proposed a new cache replacement scheme for non-volatile memory as main memory and disk as storage device

• A detailed design of integrating cloud storage with local storage

• Proposed a journaling based scheme for SSD reliability

Recent Accomplishments

40

• Further Integration with block I/O, parallel file system I/O and replay engine

• How to improve the performance of storage systems?

• I/O workload phase detection

• How to apply knowledge in I/O workload to multi-level caching?

Research Focuses on I/O Workload Characterization and Generation

41

Conclusions

• Storage Research Face Challenges from Applications (Big Data, Long-Term Data Preservation, Cloud Storage, Scalability)

• Also Face Challenges from New Technologies (Emerging Memory/Storage Hierarchies)

• Integrated Approach Including Compute, Storage and Network Systems Consideration Is A Must (SDS???)

4242

Thank You!

Questions?