System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post...

18
System Software for Big Data and Post Petascale Computing Osamu Tatebe University of Tsukuba The Japanese Extreme Big Data Workshop February 26, 2014

Transcript of System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post...

Page 1: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

System Software for Big Data and Post Petascale Computing

Osamu Tatebe University of Tsukuba

The Japanese Extreme Big Data Workshop February 26, 2014

Page 2: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

I/O performance requirement for exascale applications

• Computational Science (Climate, CFD, …) – Read initial data (100TB~PB) – Write snapshot data (100TB~PB)

periodically

• Data Intensive Science (Particle Physics, Astrophysics, Life Science, …) – Data analysis of 10PB~EB experiment

data

Page 3: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scalable performance requirement for Parallel File System

Year FLOPS #cores IO BW IOPS Systems 2008 1P 100K 100GB/s O(1K) Jaguar, BG/P 2011 10P 1M 1TB/s O(10K) K, BG/Q 2016 100P 10M 10TB/s O(100K) 2020 1E 100M 100TB/s O(1M)

IO BW and IOPS are expected to be scaled-out in terms of # cores or # nodes

Performance target

Page 4: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Technology trend

• HDD performance not increase so much – 300 MB/s, 5 W in 2020 – 100 TB/s means O(2M)W

• Flash, storage class memory – 1 GB/s, 0.1 W in 2020 – Cost, limited number of updates

• Interconnects – 62 GB/s (Infiniband 4xHDR)

☹ ☺

Page 5: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Current parallel file system

• Central storage array • Separate installation of compute nodes and storage • Network BW between compute nodes and storage

needs to be scaled-up to scale out the I/O performance

MDS

Compute nodes (clients) Storage

NW BW limitation

Page 6: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Remember memory architecture

CPU

Mem

CPU

Mem

Shared memory Distributed memory

Page 7: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled-out parallel file system

• Distributed storage in compute nodes • I/O performance would be scaled out by accessing near

storage unless metadata performance is bottleneck – Access to near storage mitigates network BW requirement – The performance may be non uniform

MDS cluster

Compute nodes (clients)

Storage

Page 8: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Example of Scale-out Storage Architecture

• 3 years later snapshot • Non-uniform but

scale-out storage • R&D of system

software stacks is required to achieve maximum I/O performance for data-intensive science

CPU (2 sockets x2.0GHzx16

coresx32FPU)

memory chipset

1TB local

storage

12 Gbps SAS x 16

x 16

19.2 GB/s, 16 TB

x 500

Metadata server

x 10

9.6 TB/s, 8 PB 96 TB/s, 80 PB

Infiniband HDR 62 GB/s

• 5,000 IO nodes • 10 MDSs

Page 9: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Challenge • File system (Object store)

– Central storage cluster to distributed storage cluster – Scaled out parallel file system up to O(1M) clients

• Scaled out MDS performance • Compute node OS

– Reduction of OS noises – Cooperative cache

• Runtime system – Optimization for non uniform storage access “NUSA”

• Global storage for data sharing of exabyte-scale data among machines

Page 10: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled out parallel file system

• Federate local storage in compute nodes – Special purpose

• Google file system [SOSP’03] • Hadoop file system (HDFS)

– POSIX(-like) • Gfarm file system [CCGrid’02, NGC’10]

Page 11: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Scaled-out MDS • GIGA+ [Swapnil Patil et al.

FAST’11] – Incremental directory partitioning – Independent locking in each

partition • skyFS [Jing Xing et al. SC’09]

– Performance improvement during directory partitioning in GIGA+

• Lustre – MT scalability in 2.X – Proposed clustered MDS

• PPMDS [Our JST CREST R&D] – Shared-nothing KV stores – Nonblocking software transactional

memory (No lock)

IOPS (file creates per sec) #MDS (#core) GIGA+ 98K 32 (256) skyFS 100K 32 (512) Lustre 2.4 80K 1 (16) PPMDS 270K 15 (240)

Page 12: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Development of Pwrake: Data-intensive Workflow System

IO-aware Task Scheduling: • Locality-aware scheduling

– Selection of Compute Nodes by Input files

• Buffer Cache-aware scheduling – Modified LIFO to ease Trailing Task Problem

0 2000 4000

Naïve

Locality aware

Locality awareand Cache aware

Locality-aware

42% speedup Cache-aware

23% speedup

Workflow elapsed time (sec) with I/O file size 900 GB (10 nodes)

Process Process Process

SSH

Pwrake

file3 file1 file2

Gfarm file system

Pwrake = Workflow System based on Rake (Ruby make)

Page 13: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Maximize Locality using Multi-Constraint Graph Partitioning

[Tanaka, CCGrid 2012] • Task scheduling based on MCGP can minimize data movement

• Applied to Pwrake workflow system and evaluated on Montage workflow

Data movement reduced by 86% Execution time improved by 31%

Simple Graph Partitioning Multi-Constraint Graph Partitioning

Parallel tasks are unbalanced among nodes.

Page 14: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

HPCI Shared Storage • HPCI – High Performance Computing Infrastructure

– “K”, Hokkaido, Tohoku, Tsukuba, Tokyo, Titech, Nagoya, Kyoto, Osaka, Kyushu, RIKEN, JAMSTEC, AIST

• A 20PB Gfarm distributed file system consisting East and West sites

• Grid Security Infrastructure (GSI) for user ID • Parallel file replication among sites • Parallel file staging to/from each center

11.5 PB (60 servers)

10 PB (40 servers)

10 (~40) Gbps MDS MDS

West site (AICS) East site (U Tokyo)

MDS MDS

Picture courtesy by Hiroshi Harada (U Tokyo)

Page 15: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Storage structure of HPCI Shared Storage

Temporal space • I/O performance •No backup

Data sharing •Capacity and reliability •Secured communication •Fair share and easy to use •No backup but file can be replicated

Lustre, Pansas, GPFS, …

Gfarm file system

Persistent storage •Capacity and reliability •Back up copy will be in Tape or disk

Local file

system

Global file system

Wide-area distributed file

system

Objective File system How to use

•mv/cp •File staging

•mv/cp •File staging

HPCI Shared Storage

•Web I/F Remote clients

Page 16: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Initial Performance Result

898 847

1,107 1,073

0

200

400

600

800

1,000

1,200

Hokkaido Kyoto Tokyo AICS

I/O Bandwidth [MB/sec]

File copy performance of 300 1GB files to HPCI Shared Storage

Page 17: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Related System

• XSEDE-Wide File System (GPFS) – Planned, but not in operation yet

• DEISA Global File System – Multicluster GPFS

• RZG, LRZ, BSC, JSC, EPSS, HLRS, … • Site name included in the path name – no location

transparency – files cannot be replicated across sites

– PRACE does not provide global file system • Limitation of operation systems that can mount • PRACE does not assume to use multiple sites

Page 18: System Software for Big Data and Post Petascale Computing · System Software for Big Data and Post Petascale Computing . Osamu Tatebe . ... – Modified LIFO to ease Trailing Task

Summary • App IO requirement

– Computational Science • Scaled-out IO performance up to O(1M) nodes (100TB to 1PB per

hour) – Data Intensive Science

• Data processing for 10PB to 1EB data (>100TB/sec) • File system, Object store, OS and runtime R&D for scale out

storage architecture – Central storage cluster to distributed storage cluster

• Network wide RAID • Scaled out MDS

– Runtime system for non uniform storage access “NUSA” • Locality aware process scheduling

• Global file system