Linux on Power Anwendertag 2016 IBM – Böblingen€¦ · high bandwidth, low-latency memory layer...
Transcript of Linux on Power Anwendertag 2016 IBM – Böblingen€¦ · high bandwidth, low-latency memory layer...
1© Copyright 2013-2016 Levyx Inc. Proprietary and Confidential
Linux on Power Anwendertag 2016
IBM – Böblingen
NoSQL Datenbank Abfragen Software optimiert für IBM Power
CAPI FLASH – aware solutions
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Levyx – Eine Optimizations-Brücke zwischen SW & HW
SoftwareHardware
NVMs Flash SSDs
Multi-core Processors
Hardware
Agnostic storage
middle-ware designed to optimize SW
data-transactions
with latest HW
IBM SW Anwendungen die von Levyx‘s Technologie profitieren können:Netezza analytics, Big Insight, Dr. Watson, Spark, BigSQL, Memcached etc…
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Optimization Tool
Patent-pendingMulti-core
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Ultra-low latency indexing
engine for billions of
objects
NVM/Flash Replaces DRAM
Enables Very Dense Nodes
World’s Fastest Key Value Store
Secret Sauce: HeliumTM Data Access Engine
© Copyright 2013-2016 Levyx Inc. Proprietary and Confidential
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Helium: Flash/Multi-Core Optimized
Application-analytics platform
Database
Database Storage Engine
OS File System
OS Volume Manager
OS Device Driver
Disk Controller
Disk Drive
Application-Analytics platform
Database
Levyx Helium
OS device driver
Flash controller F/W
Flash Chips
Leverage Multi-core/Multi-Channel Parallelism to boost performance/reduce latency. Reduce layers of abstraction/overhead
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
© Copyright 2013-2015 Levyx Inc. 5
Helium Key Attributes
• Compact RAM-based Index – 10’s Billions of Keys, PTB’s Data
• Flash Optimized– tight 99%, 99.99% latency
• Lock-free architecture
• Structured:
–Full SQL Command Set – Sort, Join, Group-by, Filter, Aggregate, Projections, etc
• Unstructured:
–Get, Put, Delete, Point/Range Query, Point Update
• ACID Compliance/Transactions Groups
• In-line Dictionary Compression
• Snapshot
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
© Copyright 2013-2015 Levyx Inc.
� Portable Implementation with Architecture and OS-specific Dependencies Fully Isolated
� Available on Unix/Linux/Window/Mac platforms
� Distributed in the Form of a Library
� Fully documented key/value API
� Bundled as a Server with Client API Support in Popular Languages
� C, C++, Java, Node.js, REST, etc.
� Wrappers for Popular KVS
�RocksDB, Memcached
� Platform for Integration with Other Technologies
� Support for structured data (to improve Spark’s shuffle performance)
� Columnar database integration with SparkSQL
Helium: Programming Language/Platform/Wrapper Support
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
What We Have Innovated
Main Memory
I/O Subsystem
Flash
“Traditional” Data Path Levyx Data Path
• Flash/NVM treated as 2nd-class citizens
• Files systems and OS kernels not designed to fully utilize bandwidth
• Block-oriented unstructured access
• A single, persistent, high-capacity, high bandwidth, low-latency memory layer
• Scalable with the # of cores in a system
• Object-oriented, highly structured data access
Registers,
Caches, Main
Memory/ Flash/NVM
Our Innovation:A simpler, more
scalable, I/O stack
© Copyright 2013-2016 Levyx Inc. Proprietary and Confidential
strategy ( )
API Attached Flash Optimization CAPI changes the game - Direct Flash Access - feels
more like memory and less like storage
� Attach IBM FlashSystem to POWER8 via CAPI
� Read/write commands issued via APIs from applications to eliminate 97% of code path length
� Saves 10+ cores per 1M IOPS
Pin buffers, Translate, Map DMA, Start I/O
Application
Read/Write Syscall
Interrupt, unmap, unpin,Iodone scheduling
20K instructions reduced to
<2000
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix AsyncI/O Style API
Shared Memory Work Queue
aio_read()aio_write()
iodone ( )
LVM
9
Machine Learning SQL Graph
1.7X System-to-System Advantage2X Core-to-Core Advantage
Machine Learning SQL Graph Machine Learning SQL Graph
1.5X Price Performance Advantage
Performance of Spark on POWER 7-Node S812LC 10-core vs. 7-Node E5-2690 v3 12-core
CAPI Flash Configurations
Up to 56TB of extended memory with one POWER8 server + CAPI attach FLASH
Power S822L / S812L
Flash System 900
Power S822L / S812L / S822 LC
NEW
External Flash Configuration
Integrated Flash Configuration
Up to 8TB of super-fast storage tier on one POWER8 server
10
0
50.000
100.000
150.000
200.000
250.000
300.000
350.000
400.000
450.000
Conventional CAPI - I CAPI - E
IOPS per Hardware Thread
0
20
40
60
80
100
120
140
160
180
200
Conventional CAPI - I CAPI - E
Latency (microseconds)
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
CAPI Flash Solution Use Cases
Memory Expansion• Application constrained by single-
system memory capacity. Typical growth is through additional compute nodes.
• CAPI Flash APIs offer highly-efficient flash access, increased total capacity at better $ / throughput.
Data Cache• Application uses in-memory caches
for data storage, and typically-constrained by ratios of memory to underlying storage.
• CAPI Flash APIs offer access to much larger ephemeral or persistent data in Flash, freeing up RAM.
Fast Storage• Application is constrained by IO
overhead and throughput of existing storage infrastructure.
• CAPI Flash APIs offer extremely high IO per CPU thread with low latency.
Netezza analytics, Big Insight, Dr. Watson, Spark, BigSQL, Memcached etc…
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Helium vs RocksDB vs Aerospikehttp://www.levyx.com/content/helium-demo
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential© Copyright 2013-2016 Levyx Inc.
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
• Helium-DB Storage Engine
–World’s Fastest Key Value store for Big Data Analytics and Operational Databases
–In-Memory Speeds for very large data sets with Persistence
• LevyxSpark: Apache Spark+Helium
– Storage Optimized and Accelerated Open Source Spark for real-time/hi IO performance applications
– Full Spark SQL query pushdown (join, group-by, filter, etc) and acceleration to machine code speeds
– Node consolidation with combined memory-flash storage layer
• Levyx Enhanced Memcached
–Maximize resource utilization – over 1 million transactions per second per node.
–Data Persistence without sacrificing ACID properties
Levyx Products
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Helium Accelerated Memcached
• Faster : Standard 90:10 and 70:30 (get:set) Helium-Memcached is at least 3,5 to 5,5 times better in TPS.
• Cheaper : Single Helium-Memcached scales with cores/SSD vs. stock memcached (needs multiple nodes, large amounts of RAM)
• Simpler: Plug and Play with existing Memcached applications. Rapid Automatic recovery from persisted SSD simplifies
70% get 30% set
TPS Memcached Levyx Memcached
SolarFlare 1.3 M 5.1 M
SolarFlare OpenOnload
1.5 M 7.4 M
90% get 10% set
TPS Memcached Levyx Memcached
SolarFlare 3.9 M 10.0 M
SolarFlare OpenOnload
4.2 M 14.8 M
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Spark without Levyx
(500 nodes) r3.8 large$33,600 /day
Spark with Levyx
(50 nodes) c3.8 large$1,920 /day
15X Lower Cost!
LevyxSpark Reduces Nodes and Cost
Cyber Security Real Time Monitoring Use Case
“Often times technology vendors advertise
scale-out as a way to reach high performance
goals. It is a proven approach, but it is often
used to mask single node inefficiencies.
Without a solution where CPU, memory,
network, and local storage are properly
balanced, this is simply what we call “throwing
hardware at the problem”. Hardware that,
virtual or not, customers pay for.”
-Google Blog, 2015, in reference to Levyx and its groundbreaking
technology
Multi-million operations per second on a single Google Compute Engine instanceThursday, July 30, 2015 : https://cloudplatform.googleblog.com/2015/07/Multi-million-operations-per-second-on-a-single-Google-Cloud-Platform-instance.html
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
LevyxSpark Advantages
• Faster
–Combined solution provides superior performance vs Native Apache Spark especially in situations involving:
•Large datasets dealing with sorting, joins, group-by (heavy shuffling)
•Ideal for workloads involving small Random inserts, point queries
•Leveraging Index lookups vs filtering
• Cheaper
–Up to 90% reduction in Nodes/lower cost Nodes for equivalent or greater “in-memory” capacity
• Simpler
–Reduced network complexity
–No need to tier from Memory to Flash
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
LevyxSpark and OpenPower:Ideal Dense, ”Scale-in” Platform
• Power 8
–Hi core count/Relatively low cost
–CAPI Hi-performance interface
• 2 week porting effort
–Goal: Native Spark(FC) vs LevyxSpark (CAPI)
• Test Unit
–Power System S822L 2-socket POWER8 Server
–20 POWER8 cores, 160 logical CPUs (SMT8, 8 threads per core)
–256GB RAM
–Apache Spark 1.6
–FC and CAPI HBAs connected to IBM FlashSystem 840
–Ubuntu16.04.01
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Test Benchmarks
• Sort – Integer, String,GenSort• Read an input table from data ingestion drive
• Sort table based on integer column
• Write sorted table to flash subsystem
• Iterative Join
–Read 16 table from data ingestion drive
–Save final join result to flash subsystem
–For 10 iterations
• Change one of input join graph
• Calculate new value of final join result
• Update a new result on flash subsystem
• Incremental Update to Sorted Table
– Read an input table from data ingestion drive as a baseline data set
– For 10 iterations
• Read another small table from data ingestion drive
• Add all elements of small table to base line data set
• Sort base line data based on first integer column
• Write sorted table to flash subsystem
18
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Specification – Test Bench Summary
Bench Mark Data Set Size (GB) Comment
Sort 64, 128, 256, 512 Highlight advantage of LevyxSpark in analytical use cases
Iterative Join 128, 256, 512 Highlight advantage of LevyxSpark in data persisting
Incremental Update 128, 256, 512 Highlight advantage of LevyxSpark in transactional use cases
PERFORMANCE COMPARISON
©Copyright 2013-2014 Levyx Inc.
Integer Sort Test Bench
Execution Time
0
1000
2000
3000
4000
5000
6000
64 128 256 512
LevyxSpark Spark
Average CPU(s) User %
©C
opyri
ght 2013-2
016
Levyx Inc. P
ropri
eta
ry a
nd
Confidential
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
64 128 256 512
LevyxSpark Spark
String Sort Test Bench
Execution Time
0
1000
2000
3000
4000
5000
6000
7000
8000
64GB 128GB 256GB 512GB
Input Size
LevyxSpark Spark
Average CPU(s) User %
©C
opyri
ght 2013-2
016
Levyx Inc. P
ropri
eta
ry a
nd
Confidential
0%
10%
20%
30%
40%
50%
60%
64GB 128GB 256GB 512GB
Input Size
LevyxSpark Spark
GenSort Test Bench
Execution Time
0
500
1000
1500
2000
2500
64GB 128GB 256GB
Input Size
LevyxSpark Spark
Average CPU(s) User %
©C
opyri
ght 2013-2
016
Levyx Inc. P
ropri
eta
ry a
nd
Confidential
0%
10%
20%
30%
40%
50%
60%
64GB 128GB 256GB
Input Size
LevyxSpark Spark
Iterative Graph Test Bench
Execution Time
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
128GB 176GB 256GB
Input Size
LevyxSpark Spark
Average CPU(s) User %
©C
opyri
ght 2013-2
016
Levyx Inc. P
ropri
eta
ry a
nd
Confidential
0%
5%
10%
15%
20%
25%
128GB 176GB 256GB
Input Size
LevyxSpark Spark
Sto
ck S
pa
rk E
rror
Sto
ck S
park
Erro
r
Incremental Update
Execution Time
0
1000
2000
3000
4000
5000
6000
7000
64GB 128GB 256GB
Input Size
LevyxSpark Spark
Average CPU(s) User %
©C
opyri
ght 2013-2
016
Levyx Inc. P
ropri
eta
ry a
nd
Confidential
0%
10%
20%
30%
40%
50%
60%
64GB 128GB 256GB
Input Size
LevyxSpark Spark
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Summary
LevyxSpark plus POWER8/CAPI integration ideal combination for Apache Spark IO Intensive Workloads
–Balanced Scale-in platform- Fewer nodes needed for a given workload
–Freed up cores by CAPI integration allow more analytical/computational workloads
–Larger datasets per node/reduced shuffling/spills/crashes
Levyx is ready to collaborate making applications Power CAPI Flash - aware
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
Contact within IBM
Randy [email protected]
Questions?
Levyx Contact
Kim [email protected]: +4916094878794
© 2016 IBM Corporation#ibmedge
www.levyx.com - copy of presentation
The Deluge O Data Is Here And Accelerating
© Copyright 2013-2015 Levyx Inc. Proprietary and Confidential
www.levyx.com
Danke Schön!
Thank You!
© Copyright 2013-2016 Levyx Inc. Proprietary and Confidential
Contact within IBM
Randy [email protected]
Levyx Contact
Kim [email protected]: +4916094878794