Post on 21-May-2020
POWER8 Scale Out, OpenPOWER and CAPI
Georgia IBM POWER User Group
16 APR 2015
JT Kellington
POWER8 Scale Out, OpenPOWER and CAPI
POWER8 Scale Out
© 2015 IBM Corporation
Power April 2014 Announcements
• New POWER8 Scale Out Servers
– IBM POWER8 2U 2 socket server: Power S822
– IBM POWER8 4U 1 socket server: Power S814
– IBM POWER8 4U 2 socket server: Power S824
• New POWER8 Linux Servers
– IBM POWER8 Linux 2U 1 socket server: Power S812L
– IBM POWER8 Linux 2U 2 socket server: Power S822L
• New Virtualization Management
– Enhanced HMC Functionality
– IBM PowerKVM – Kernel Virtual Machine
• New Linux Distro Offering
– Canonical Ubuntu
– Available on Linux Power servers with PowerKVM
© 2015 IBM Corporation
Power April 2014 Announcements
• New I/O Options
– Ethernet
• New IBM i Releases
– IBM i 7.2 (1st new version in 4 years)
– IBM i 7.1 TR8
• POWER8 Hardware support
– IBM BLU Acceleration Solution - Power Systems Edition
– IBM PowerVP – Virtualization Performance
– IBM PowerSC – Security and Compliance
– IBM PowerVM
– IBM PowerVC
© 2015 IBM Corporation
0%
20%
40%
60%
80%
100%
180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm
Gain by Technology Scaling Gain by InnovationRelative %
of Improvement
Innovation Drives Performance
6 © 2015 IBM Corporation
POWER8: The First Processor Designed for Big Data IBM 22nm Technology • Silicon-on-Insulator
• 15 metal layers
• Deep trench eDRAM
POWER8 Processor Compute
• 12 cores (thread strength optimized)
• SMT8, 16-wide execution
• 2X internal data flows
• Transactional Memory
Cache
• 64KB L1 + 512KB L2 / core
• 96MB L3 + up to 128MB L4 / socket
• 2X bandwidths
System Interfaces
• 230 GB/s memory bandwidth / socket
• Up to 48x Integrated PCI gen 3 / socket
• CAPI (over PCI gen 3)
• Robust, Large SMP Interconnect
• On chip Energy Mgmt, VRM / core
POWER8 DCM
Memory
Buffer
DRAM
Chips
POWER8 Memory Organization (Max Config shown)
16MB
16MB
16MB
16MB
16MB
16MB
16MB
16MB
128 GB
128 GB
128 GB
128 GB
128 GB
128 GB
128 GB
128 GB
Up to 1 TB / Socket
First P8 Systems: 512 GB /Socket
8 © 2015 IBM Corporation
POWER8 Performance
POWER5
POWER6
POWER7
POWER8
POWER5
POWER6
POWER7
POWER8
0 50 100 150 200
POWER6
POWER7
POWER7+
POWER8
IO Bandwidth (scale-out systems)
per Socket Performance Gains (SMT8)
0 50 100 150 200 250
POWER5
POWER6
POWER7
POWER8
Memory BW per Socket
Per Core Performance Gains (mixed workloads)
© 2015 IBM Corporation
POWER8 Scale-Out Systems
Power Systems scale-out portfolio
Power Systems
S822L Power Systems
S812L •1-socket, 2U •Linux Only •KVM and PowerVM
•2-socket, 2U •Linux Only •KVM and PowerVM
•2-socket, 2U •All Operating Systems •PowerVM only
Power Systems
S822
Power Systems
S814 •1-socket, 4U •All Operating Systems •PowerVM only
Power Systems
S824 •2-socket, 4U •All Operating Systems •PowerVM only
Power Systems
S824L
•2-socket, 4U •Linux Only •Bare metal
Power 730 Power S822
Processor POWER7+ POWER8
Sockets 2 2
Cores 8 / 12 / 16 12 / 20
Maximum Memory 512 MB @ 1066 MHz 512 GB / 1 TB @ 1600 MHz
Memory Cache No Yes
Memory Bandwidth 68 GB/sec 192 GB/sec
Memory DRAM Spare No Yes
IO Expansion Slots Dual GX++ 4 PCIe x16 G3
PCIe slots 5 PCIe x8 LP 4 / 5 PCIe x8 LP 2 / 4 PCIe x16 LP
PCIe Hot Plug Support No Yes
IO bandwidth 60 GB/sec 192 GB/sec
Ethernet ports Four 1 Gbt Four 1 Gbt
SFF 6 12
Easy Tier Support No Yes
Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 )
Service Processor Generation 1 Generation 2
POWER8 2U Scale Out Comparison
Power 720 Power System S814
Processor POWER7+ POWER8
Sockets 1 1
Cores 4 / 6 / 8 6 / 8
Maximum Memory 512 GB @ 1066 MHz 512 GB @ 1600 MHz
Memory Cache No Yes
Memory Bandwidth 136 GB/sec 192 GB/sec
Memory DRAM Spare No Yes
IO Expansion Slots Dual GX++ 4 PCIe x16 G3
PCIe slots 5 PCIe x8 FH / HL
4 PCIe x8 HH / HL (opt) 5 PCIe x8 FH / HL 2 PCIe x16 FH / FL
CAPI (Capable slots) N / A One
PCIe Hot Plug Support No Yes
IO bandwidth 40 GB/sec 96 GB/sec
Ethernet ports Quad 1 Gbt Quad 1 Gbt (x8 Slot)
SFF bays 6 12
Easy Tier Support No Yes
Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 )
Service Processor Generation 1 Generation 2
POWER8 4U Scale Out Comparison
Power 740 Power Systems
S824 Processor POWER7+ POWER8
Sockets 2 2
Cores 16 24
Maximum Memory 1 TB @ 1066 MHz 1 TB (2 TB ) @ 1600 MHz
Memory Cache No Yes
Memory Bandwidth 68 GB/sec 192 GB/sec
Memory DRAM Spare No Yes
IO Drwr Expansion Slots Dual GX++ 4 PCIe x16 G3
PCIe slots 5 PCIe x8 FH / HL
4 PCIe x8 HH / HL (opt) 7 PCIe x8 FH / HL 4 PCIe x16 FH / FL
PCIe Hot Plug Support No Yes
IO bandwidth 60 GB/sec 192 GB/sec
Ethernet ports Quad 1 Gbt Quad 1 Gbt
SFF bays 6 12
Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 )
Easy Tier No Yes
Service Processor Generation 1 Generation 2
POWER8 4U Comparison
© 2015 IBM Corporation
Performance / Benchmarks
© 2015 IBM Corporation
POWER8 System Performance
P4 690
P5+ 595
P8 S824
© 2015 IBM Corporation
Power 740 vs Power S824
0
2
4
6
P 740+ P8 S824
0
50
100
150
200
P 740+ P8 S824
0
1000
2000
P 740+ P8 S824
Max Watts
50% more Cores More Internal Storage More I/O Slots Higher Perf Memory
Performance per BTU
Greater Energy Efficiency
Better Thermal Characteristics
0
100
200
300
400
P 740+ P8 S824
Performance
~2x Better Performance
Performance per KW
© 2015 IBM Corporation
IBM S824 Fujitsu
RX300 S8
HP ProLiant
BL460c
Cisco UCS
C240 M3
2x Better Performance
than nearest Intel
competition
24 Core Systems
2x +
SAP Sales & Distribution 2-Tier ERP 6
Benchmark
© 2015 IBM Corporation
Per Core Performance
Oracle
SPARC
T4-2
16-core
Cisco
UCS
B200 M3
16-core
IBM
Power
S824
6-core
Oracle
SPARC
T4-2
16-core
Cisco
UCS
B200 M3
16-core
IBM
Power
S824
6-core
>3x
Performance Leadership
Siebel CRM Release 8.1.1.4
Benchmark
© 2015 IBM Corporation
Per Core Performance
Oracle
SPARC
X3-2L
16-core
Cisco
UCS
B200 M3
24-core
IBM
Power
S824
12-core
2x +
Performance Leadership
Oracle
SPARC
X3-2L
16-core
Cisco
UCS
B200 M3
24-core
IBM
Power
S824
12-core
eBS 12.1.3 Payroll Benchmark
© 2015 IBM Corporation
Operating Systems
21 © 2015 IBM Corporation
POWER8 AIX Levels
11 / 2012 12 / 2012 3 / 2013 5 / 2013 8 / 2013 9 / 2013 10 / 2013 12 / 2013 2Q / 2014 3Q / 2014
AIX 6.1 TL7
SP6 SP7 SP8 SP9 SP10
AIX 6.1 TL8
SP1 SP2 SP3 SP4 SP5
AIX 6.1 TL9
SP1
SP3
+ APAR
IV56366
AIX 7.1 TL1
SP6 SP7 SP8 SP9 SP10
AIX 7.1 TL2
SP1 SP2 SP3 SP4 SP5
AIX 7.1 TL3
SP1
SP3
+ APAR
IV56367
P8, P7 or P6 Modes with Full I/O Support
P7 or P6 Modes with Full I/O Support
P7 or P6 Modes with Virtual I/O
© 2015 IBM Corporation
Why AIX……
• Best Performance and Scalability
– Scales to 256 Cores
– #1 SAP System performance
– #1 SAP per Core performance
• Most Available
– AIX & Power # 1 in availability (ITIC 2013 report)
• Most Secure
– CAPP/OSPP/EAL4+ Security Certification
– 0 reported security breeches with SAP and IBM DB2 or Oracle DB2 on
AIX & Power
• Self Tuning (Dynamic System Optimization)
– Monitors and adjusts optimizations as needed
– Cache & Memory affinity
– Shared memory & Data Stream Pre- fetch optimization
• Minimize Memory requirements
– Active Memory Expansion
© 2015 IBM Corporation
Investment being made into AIX……
• Hot patching of AIX Kernel
– Apply fix to “Live” AIX Kernel
– No reboot of the partition required
– No recycling of the applications
• CAPI Enablement
– Support of CAPI resources
• SRIOV Enhancements
– FCoE & Fibre Channel
• Performance improvements
– Pthreads Trans Memory
• Future Considerations
– AME Enhancements
– Larger Max memory
– Split Core support
– DSO Enhancements
IBM i 7.2
POWER7
Max Scale = 32 cores (SMT4)
Max Partition = 96 cores (SMT4)
Threads = ST, SMT2, SMT4 up to 384 threads in single partition
POWER8
Max Scale = 48 cores (SMT8)
Max Partition = 96 cores (SMT8)
Threads = ST, SMT2, SMT4, SMT8 up to 768 threads / single partition
IBM i Levels
IBM i 7.1 TR8
POWER7
Max Scale = 32 cores (SMT4)
Max Partition = 64 cores (SMT4)
Threads = ST, SMT2, SMT4 up to 256 threads in single partition
POWER8
Max Scale = 32 cores (SMT8)
Max Partition = 64 cores (SMT4)
Threads = ST, SMT2, SMT4, SMT8 up to 256 threads / single partition
25 © 2015 IBM Corporation
IBM i 7.2 and POWER8 Highlights
• Enhancing Systems of Engagement and Systems of Record:
– POWER8 enables new levels of performance, reliability and scalability making it simpler to integrate systems of engagement and systems of record on a single system and single architecture
– IBM i 7.2 locks down business data, increases security and improves performance minimizing risk as you extend business systems to customers through mobile and cloud. And, combined with new encrypt/decrypt capabilities in POWER8, ensuring your data is protected has never been easier
• Key Capabilities:
– Powerful new features of DB2® for i ensures security of the data in a modern environment of mobile, social and network access
– IBM Navigator for i extends system management capabilities to manage and monitor performance services
– Integrated Security SSO application suite extended to include FTP and Telnet authentication with Kerberos
– PowerHA SystemMirror for i Express Edition introduces HyperSwap and improves system resiliency to ensure continual access for customers and employees
– Analytics: combined value of DB2 WebQuery & Cognos on Linux on Power
– Free Format RPG provides game changing enhancements for developers, making extension to mobile and social easier.
© 2014 International Business Machines Corporation 25
© 2015 IBM Corporation
POWER8 Linux Distros
2Q / 2014
RHEL6 RHEL 6.5
P7 Mode in P8
RHEL 7 RHEL 7.0 - POWER8 Support
RHEL 7.1 – LE KVM Support
SLES 11 SLES 11 + SP3
P7 Mode in P8
SLES 12 POWER8 LE KVM
Ubuntu (LE) 14.04.00/01
P8 Support
© 2015 IBM Corporation
Virtualization
28 © 2015 IBM Corporation
PowerKVM: Open Virtualization for scale-out Linux Systems • Kernel-Based Virtual Machine(KVM) Open Source Hypervisor for virtualizing Linux
guest VMs on POWER8 Linux Scale-out servers
• Exploit existing Linux admin skills and tools
• Leverage Power systems performance and resiliency
PowerVM: Virtualization without Limits • Delivers higher levels of utilization
• Simplified virtualization user experience with new performance views & capacity data
PowerVP: - Virtualization Performance
• Improved memory and shared processor affinity to optimize performance and
service levels
PowerVC (Virtualization Center): Increase IT productivity and agility • Built on OpenStack
• Improved scalability, active directory support and shared storage pools enabling faster
integration with clients existing infrastructure
SmartCloud Entry for Power Systems*
• Extended capability to enable customization & quicker deployment of
OpenStack-based cloud solutions
28
Power System Software An intelligent IT infrastructure for Cloud, Big Data,
Analytics & Mobile
New
Simplified Virtualization and Cloud Management
Expanded choice and enhanced value for the industry’s most scalable & flexible virtualization
infrastructure for UNIX, Linux and IBM i
HMC Past HMC in 2Q-2014
• Disjoint set of tools
• Multiple agents need to be installed in OS
• Minimal or Lack of Visualization
• Integrated Visual Monitor in HMC
• Standard set of Interfaces for
external APIs to consume data
Power Systems Performance Monitoring
Performance metric indicators & utilization dashboard
Processor, memory & I/O
Server & LPAR level information
Basic trend data collection and visualization
Identify bottlenecks
Early problem detection
REST based API to access:
All platform (PHYP & VIOS) metrics for Tivoli
Third Party tools
Performance Monitoring – Metrics & Dashboard
Provides full PowerVM
performance and
capacity metrics
Via a single touch-point
(HMC).
PowerKVM
PowerVM
PowerVM is Power Virtualization that will continue to be enhanced to support AIX, IBM i Workloads as well as Linux Workloads
Initial Offering: 2004
Initial Offering: Q2 2014
PowerKVM provides an Open Source choice for Power Virtualization for Linux workloads. Best for clients that have Linux centric admins.
Power Virtualization Options
PowerVM PowerKVM
GA Availability 2004 Q2 2014
Supported Hardware All P6, P7, P7+, P8
Systems PowerLinux P8 Systems
Supported OS AIX, IBM i & Linux Linux
Workload Mobility Supports AIX, IBM i &
Linux Linux
Basic Virtualization
Management IVM / HMC / FSM Virtman/libvirt
Advanced Virtualization
Management PowerVC/VMControl PowerVC, Vanilla OpenStack
Admin Type Power Centric Linux/x86 Centric
Established Security
Track Record on Power Yes No
Open Source Hypervisor No Yes
PowerVM vs PowerKVM Comparison
• First release available in 2014
• Focus: New Linux workloads for Power Systems
• Seamless transition for existing Linux admins to adopt Power Linux
Virtualization without any training
• No HMC or other traditional IBM consoles
• Normal Linux management and OpenStack options
• PowerKVM only supports Linux guest VMs
• Cloud potential: Have many more small VMs than traditional Power
Virtualization
• POWER8 PowerLinux hardware only
• Live Workload mobility support between PowerKVM servers
• Open Source Hypervisor: Hardware is abstracted by firmware
• Managed by OpenStack(PowerVC) or by off the shelf OpenStack or
local Linux Tools
PowerKVM Positioning
POWER8 Scale Out, OpenPOWER and CAPI
OpenPOWER
35 © 2015 IBM Corporation
The Era of Heterogeneous Computing is Coming…
Without Price Increases
Microprocessors and technology alone are no longer driving Cost/performance improvements
2 socket systems 2 socket sys @ constant cost
Processors
Semiconductor Technology
36 © 2015 IBM Corporation © 2014 International Business Machines Corporation
Workload Acceleration Services Delivery Model Advanced Memories Optimized System Design Custom SOC’s
Some Example Use Cases
System stack innovations are required to drive cost/performance
Processors
Semiconductor Technology
Applications and services
Firmware, Operating System and Hypervisor
System Stack
Systems Management & Cloud Deployment
Systems Acceleration & HW/SW Optimization
© 2015 IBM Corporation
OpenPOWER Extends Moore’s Law to the
System
OpenPOWER will enable data centers to rethink their approach to
technology.
Member companies may use POWER for custom open servers and components for Linux based
cloud data centers.
OpenPOWER ecosystem partners can optimize the interactions of
server building blocks – microprocessors, networking, I/O &
other components – to tune performance.
How will the OpenPOWER Foundation
benefit clients?
– OpenPOWER technology creates
greater choice for customers
– Open and collaborative development
model on the Power platform will
create more opportunity for
innovation
– New innovators will broaden the
capability and value of the Power
platform
What does this mean to the industry?
– Game changer on the competitive
landscape of the server industry
– Will enable and drive innovation in
the industry
– Provide more choice in the industry
Platinum Members
© 2015 IBM Corporation 38
Fueling an Open Development Community
Boards / Systems
I/O / Storage / Acceleration
Chip / SOC
System / Software / Integration
Implementation / HPC / Research
Complete member list at www.openpowerfoundation.org
39 © 2015 IBM Corporation
OpenPOWER: Growing Fast
Boards/Systems
I/O, Storage, Acceleration
Chip/SOC
System/Software/Services
39
***Chart from April 2014!!!
40 © 2015 IBM Corporation 40
POWER8/8+
Processors
PowerCore GPU/Other
NVLINK
Memory Interface Control
DMI
DMI
Server Class Memory
GPU/Other
NVLINK
Memory Interface Control
CAPI IBM & Partner Devices
Server Class Memory
“POWER” Built for Open Innovation
Innovation with OpenPOWER is taking place on all interfaces and with custom SOC Designs
POWER Processors have a Leadership Set of Differentiated Interfaces
41 © 2015 IBM Corporation
Redesigning the Computer
• Extreme Parallelism available
• Targeted Software Accelerator packs
• IP Base Libraries
• Customer IP
• Reconfigurable Nature fights Commoditization
+
Ta
rge
ted
So
ftwa
re
Acce
lera
tion
Pa
cks T
ranspare
nt Toolin
g
Middleware Like Abstraction
Services
CPU’s FPGA or GPU
Strong Cores for Serial Codes
Runs Traditional & Legacy Software
Runs OS (Security, Virtualization, etc)
Greater robustness is achieved by mating of specializations….
© 2015 IBM Corporation
When to Use FPGAs
• Transistor Efficiency & Extreme Parallelism
– Bit-level operations
– Variable-precision floating point
• Power-Performance Advantage
– >2x compared to Multicore (MIC) or GPGPU
– Unused LUTs are powered off
• Technology Scaling better than CPU/GPU
– FPGAs are not frequency or power limited yet
– 3D has great potential
• Dynamic reconfiguration
– Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected ...
– allows network as well as compute specialization
© 2015 IBM Corporation
When to Use GPGPUs
• Extreme FLOPS & Parallelism
– Double-precision floating point leadership
– Hundreds of GPGPU cores
• Programming Ease & Software Group Interest
– CUDA & extensive libraries
– OpenCL
– IBM Java (coming soon)
• Bandwidth Advantage on Power
– Start w/PCIe gen3 x16 and then move to NVLink
• Leverage existing GPGPU eco-system and development base
– Lots of existing use-Cases to build on
– Heavy HPC investment in GPGPU
Power8 Invents CAPI
CAPP
PC
Ie
Power Processor
CAPI over
PCIe
Coherently Attached
Device
• Coherent Attached Processor Proxy (CAPP) in processor
– Unit on processor that extends coherency to an attached device
– On processor directory responds on behalf of off-chip device
(Filtering snoops)
• Coherency protocol tunneled over standard PCIe
– Eliminates the need for special I/Os and protocol logic
CAPI utilizes standard Posted Write and Non-posted Reads
– Reduces the complexity and bandwidth requirements of the
attached device
• Enables attached device to be a peer to the processor
– Simplifies programming model between application
– Enables device to use same effective address as application
running in processor
– Eliminates the cumbersome I/O Device Driver requirements
Pinned memory not required
Why CAPI is Better than Traditional PCIe
CAPP PCIe
Power Processor
FPGA
Fu
nctio
n n
Fu
nctio
n 0
Fu
nctio
n 1
Fu
nctio
n 2
CAPI
IBM Supplied POWER Service Layer
Typical I/O Model Flow
Flow with a Coherent Model Shared Mem.
Notify Accelerator Acceleration
Shared Memory
Completion
DD Call Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Int
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Advantages of Coherent Attachment Over I/O Attachment
• Virtual Addressing & Data Caching
– Shared Memory
– Lower latency for highly referenced
data
• Easier, More Natural Programming Model
– Traditional thread level
programming
– Long latency of I/O typically requires
restructuring of application
• Enables Applications Not Possible on I/O
– Pointer chasing, etc…
© 2015 IBM Corporation
Workloads to Innovate
• Start with what FPGAs are good at: Embarrassingly Parallel Problems
• Combine with CAPI strengths:
– Ease of programming
– Lack of device driver
– Shared memory & caching (host to accelerator communication)
• What do you get:
– Bitwise data manipulation (e.g. Deep Compression)
– Pattern recognition
– Encryption
– Monte Carlo
Statistical modeling for complex predictions
– Image Analytics & Biometrics
Facial recognition
Feature detection (e.g. cancer)
– Network Packet Processing & Inspection
– Bioinformatics (e.g. Sequence alignment)
– Reverse time migration (Oil & Gas)
– Ensemble Calculations of Numerical Weather Prediction
– Machine Learning
– And on and on
Example: File System Acceleration with CAPI-FPGA
• Compression
– IBM Gzip offers best combination of
performance and compression rate
• De-Duplication
– Signature calculation is easy to
integrate with compression datapath
• Crypto
– Crypto acceleration on P8
– FPGA is also a good fit, especially if
crypto algorithm is non-standard
• Content analytics for real-time tagging
– IBM CAPI/FPGA accelerated text
analytics
– IBM CAPI/FPGA accelerated image
analytics
• Power 8 / CAPI benefits
– Very strong memory & I/O bandwidth
– Seamless integration with CAPI
shared memory interface (acc. Is just
like another core )
– Variety of accelerator partners
through OpenPOWER ( Altera, Xilinx,
NVIDIA, ...)
48 © 2015 IBM Corporation
IBM Accelerated GZIP Compression
48
What it is:
An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread
througput of ~2GB/s and a compression rate significantly better than low-CPU overhead
compressors like snappy.
49 © 2015 IBM Corporation
IBM Accelerated Text Processing
49
AQL
• rule language
• SQL-like syntax
systemT
optimizer
Compiled
operator
graph
For years, Microsoft Corporation
CEO Bill Gates was against open
source. But today he appears to
have changed his mind. "We can
be open source”
Annotations
systemT
runtime
Java +
FPGA
What it is:
A compiler/runtime system for
accelerating text analytics on a shared-
memory CPU-FPGA
Results
Big Speedup vs. Multithread SW
To appear @:
Hot Chips 2014
© 2015 IBM Corporation 50
FPGA Image & Video Processing
Information Extraction Object Recognition
Template Matching Edge Detection, Feature Extraction, Segmentation
Go
al
Extract relevant information from input
image to enable object recognition
Information located where pixels change
color (edges, blobs)
Intrinsic properties of objects
Object boundaries
Mo
tivati
on
s
Applications requiring edge detection & feature extraction span a wide range of domains
Computer/Machine Vision: Tracking, Object Recognition & Navigation
General image proc.: Compression
Quality Control: Unsupervised Defect Identification
Medical Imaging: Analysis + Diagnosis & Computer Guided Surgery
Ap
pro
a
ch
Design fully-pipelined FPGA architectures
streaming application
Real-time, low-power, onboard image
processing solution
Sobel and Canny: extract contours/edges
SURF: extract scale & rotation-invariant features
© 2015 IBM Corporation
Custom Hardware Mapping
51
Th
eo
ry
2D convolution with Gaussian Filter: blur
2D convolution with Gaussian 1st derivative: extract edges
2D convolution with Gaussian 2nd derivative: extract features
Hard
wa
re
Desig
n FPGA acceleration results from:
Parallel 2D convolution
Process all pixels inside filter in parallel
Parallel 2D convolution in x, y, z direction
Parallel 2D convolution for all filter scales
Total of 33 filters
X
Y
X
Y
Gaussian 1st derivative
2nd derivative
© 2015 IBM Corporation 52
Res
ul
ts
Results & Conclusions
Apps.
VHDL performance OpenCL
performance
Stratix 4 Stratix 5 Stratix 5
Frames/sec Max
freq. Frames/sec
Max
freq. Frames/sec
Max
freq.
Sobel 475 170 909 300 870 300
Canny 470 170 890 300 823 309
SURF 392 170 870 300 804 283
OpenCL vs. VHDL performance table
OpenCL vs. VHDL
productivity table VHDL
development
time
OpenCL
development
time
Sobel,
Canny,
&
SURF
6 months 1 month
Co
nclu
sio
ns
Productivity Performance
53 © 2015 IBM Corporation
IBM Accelerated Image Processing
What it is:
A real-time multi-HD stream Harris-Laplace feature detection algorithms implemented in
an FPGA
Performance:
166M pixels per second
( i.e. multi-stream HD video)
To appear:
IBM Journal of Research & Development
54 © 2015 IBM Corporation
strategy ( )
CAPI Attached Flash Optimization
– Attach TMS Flash to POWER8 via CAPI coherent Attach
– Issues Read/Write Commands from applications to eliminate 97% of code pathlength
– Saves 20-30 cores per 1M IOPs
Pin buffers,
Translate, Map DMA,
Start I/O
Application
Read/Write Syscall
Interrupt, unmap,
unpin,Iodone scheduling
20K instructions reduced to
<500
Disk and Adapter DD
strategy ( ) iodone ( )
FileSystem
Application
User Library
Posix Async
I/O Style API
Shared Memory
Work Queue
aio_read()
aio_write()1
iodone ( )
LVM
54
55 © 2015 IBM Corporation 55
56 © 2015 IBM Corporation
Flash as Slow Memory
client network flash
server
network
network
network
acceptable
latency
CAPI
Memory
Conventional PCIe I/O
56
57 © 2015 IBM Corporation
Monte-Carlo CAPI Acceleration
Running
1 million iterations
At least
250x Faster
with CAPI FPGA +
POWER8 core
Full execution of a Heston
model pricing for a single
security:
1. SOBOL sequence
generator (pRNG)
2. Inverse Normal to create
the non-linear distribution
3. Path-generation
4. Pay-off function
Easier to Code:
Reduces C code writing by 40x compared to non-CAPI FPGA
58 © 2015 IBM Corporation 58
POWER8-based Network Acceleration
Faster workloads with less infrastructure
Eastern
Central
New York
Boston
Washington D.C.
Chicago
data
data
data
data
exploiting high speed
networks with
Remote DMA
IBM Power Systems and Mellanox® Technologies partnering to
simultaneously accelerate the network and compute for NoSQL
workloads.
10x higher
throughput
Dramatically less data center
infrastructure
10x lower latency
Dramatically faster
responsiveness to customers
leveraging POWER8
high throughput low
latency I/O
RDMA
59 © 2015 IBM Corporation
• We’re only just discovering how to make this data useful
• Impossible to make this much data useful through human inspection
Large global retailers collect petabytes of data
Transactions generate tens of millions of filing
cabinets of paper
How does a retailer translate all of this data to
business value?
Group customers in segments with similar
behavior
Customize products and marketing programs
GPU Acceleration Example: Espresso
60 © 2015 IBM Corporation
IBM Power Systems GPU Acceleration of Java Applications
• Now possible on today's Big Data and Java Workload Acceleration
– Use of segmentation or clustering in the retail industry
• Look for non-obvious patterns in the sales data and react
quickly Analyze across tens of thousands of dimensions
quickly and accurately
• Lends itself nicely to a bit of computer science known as
"k-means clustering"
– Outcome could lead to new products, revised products and
advertising, launching new campaigns….wherever the data
leads you….
Imagine generating 100 times more ideas for new products and campaigns – who can get you there?
61 © 2015 IBM Corporation
• IBM and NVIDIA are demonstrating segmentation
using GPU accelerated machine learning for
clustering using Hadoop / Mahout
– OpenPower initiative with NVIDIA
– First product implementing GPU acceleration for
Java
• Best-in-class ingredients
– IBM POWER8 – Designed for Big Data
– IBM Java
– NVIDIA CUDA GPU acceleration
– Ubuntu Little Endian Linux for POWER
• Achieving 8X performance improvement
61
GPU Espresso Demo
62 © 2015 IBM Corporation 62 © 2014 OpenPOWER Foundation
NVIDIA acceleration built into IBM Power S824L
8x faster than x86 Ivy
Bridge on pattern extraction
82x faster for Cognos BI and
DB2 BLU
Altera FPGA acceleration and IBM CAPI
Monte Carlo 250x faster than POWER8 core
alone, reduced C code 40x over non-CAPI FPGA
Data Engine for NoSQL 24:1 server
consolidation, 3x lower cost per user, 40TB
CAPI-attached flash
CAPI dev kit with FPGA card from Nallatech
Tyan OpenPOWER Customer Reference System
US Dept of Energy $325M super computing
contract awarded to IBM, Mellanox, and NVIDIA
OpenPOWER innovations benefit Clients
DoE systems for science and
stockpile stewardship
Sierra and Summit systems to be
>100 PF, 2 GB/core main memory,
local NVRAM, and science
performance 4x-8x Titan or Sequoia
© 2015 IBM Corporation
University Research on Power8 Accelerators
• Photodynamic Therapy @ University of Toronto
• fMRI @ Western University
• Genomics @ University of Illinois Urbana-Champaign & Rice & Delft
• Seismic @ University of Texas
• Data Analytics @ North Carolina State University
• Financial Risk @ University of Florida
• The list is growing rapidly…
POWER8 Scale Out, OpenPOWER and CAPI
What is CAPI?
© 2015 IBM Corporation
What’s in a name?
65
© 2015 IBM Corporation
FPGA as an Accelerator
• FPGA: Field Programmable Gate Array
– It’s a re-programmable chip
– It can run fast (cycle times of 250 – 500 Mhz or more)
– It has Industry Standard Interfaces like PCI-E Gen3
– The Major FPGA Suppliers, Altera and Xilinx,
are OpenPOWER Foundation members
66
FPGA
gzip Encrypt
Monte
Carlo
PCIE
FPGA Library
Source code for FPGAs has traditionally
been written in RTL* (VHDL** or Verilog).
Now, we also have OpenCL, a more
programmer friendly language.
*RTL = Register Transfer Level
**VHDL = VHSIC*** Hardware Description Language
***VHSIC = Very High Speed Integrated Circuit
© 2015 IBM Corporation
When to Use FPGAs
• Transistor Efficiency & Extreme Parallelism
– Bit-level operations
– Variable-precision floating point
• Power-Performance Advantage
– >2x compared to Multicore (MIC) or GPGPU
– Unused LUTs are powered off
• Technology Scaling better than CPU/GPU
– FPGAs are not frequency or power limited yet
– 3D has great potential
• Dynamic reconfiguration
– Flexibility for application tuning at run-time vs. compile-time
• Additional advantages when FPGAs are network connected ...
– allows network as well as compute specialization
© 2015 IBM Corporation
Why is an Accelerator Faster?
68
FPGA PCIE
Question: The POWER8 Processor runs at ~3Ghz while our
FPGA runs at 250Mhz. So why would an accelerator
be better?
Answer: The FPGA is better for certain algorithms, such as
those that are numerical intensive or have parallelism.
The POWER8 processor has a finite set of instructions
to implement the algorithm in SW.
The FPGA is customized logic built for specific
processing of an algorithm.
© 2015 IBM Corporation
Why is an Accelerator Faster?
69
FPGA PCIE
Example 1: Numerical Intensive Algorithm
FPGA
sin cos
x+
∑
+
∫
Integral ()
Sigma ()
Sin ()
Cos ()
Main
(n,a,v,w)
SW
Variables
Done! Done!
© 2015 IBM Corporation
Why is an Accelerator Faster?
70
FPGA PCIE
Example 2: Parallelism
FPGA
Monte Carlo Risk Analysis to determine
probability of financial success:
Given current finances, run 100 scenarios
Variable distributor
En
gin
e 1
E
ng
ine
2
Engin
e 3
E
ng
ine
4
En
gin
e 5
E
ng
ine
6
En
gin
e 7
E
ngin
e8
E
ng
ine
9
Engin
e 5
0
Results Accumulator
Monte
Main
(Vars)
SW
Variables Variables
50 5 10 100
71
© 2015 IBM Corporation
So what is new?
Accelerators on FPGAs
have been around for a
long time….
So what is new?
Coherency makes the
accelerator a peer to
the POWER8 cores
72
© 2015 IBM Corporation
Memory Subsystem
Virt Addr
What was done before CAPI?
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
App
FPGA PCIE
Variables Input
Data
DD
Device Driver
Storage Area
Variables
Input
Data
Variables
Input
Data
Output
Data
Output
Data
Prior to CAPI, an application called a device driver to utilize an
FPGA Accelerator.
The device driver performed a memory mapping operation.
3 versions of the data (not coherent).
1000s of instructions in the device driver.
73
© 2015 IBM Corporation
Memory Subsystem
Virt Addr
CAPI Coherency
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core
POWER8
Core App
FPGA PCIE
With CAPI, the FPGA shares memory with the cores
PS
L
Variables Input
Data
Output
Data
1 coherent version of the data.
No device driver call/instructions.
74
© 2015 IBM Corporation
Typical I/O Model Flow:
Flow with a Coherent Model:
Shared Mem.
Notify Accelerator Acceleration
Shared Memory
Completion
DD Call Copy or Pin
Source Data
MMIO Notify
Accelerator Acceleration
Poll / Interrupt
Completion
Copy or Unpin
Result Data
Ret. From DD
Completion
Application
Dependent, but
Equal to below
Application
Dependent, but
Equal to above
300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions
1,000 Instructions
7.9µs 4.9µs
Total ~13µs for data prep
400 Instructions 100 Instructions
0.3µs 0.06µs
Total 0.36µs
CAPI vs. I/O Device Driver: Data Prep
© 2015 IBM Corporation
FPGA is a peer to the processor
-- Caching and translations by PSL
Simple Programming paradigm
Higher performance
Architecture allows for any kind of
FPGA or even an ASIC Flexible solutions
Connection to Flash, FC, EN….
Virtualization in the Architecture Applications can share Accelerator
CAPI vs. I/O or Socket FPGA Solution
IBM Innovation Customer Impact
I/O Paradigm CAPI Paradigm
CAPI Differentiation
76 © 2015 IBM Corporation
POWER8 Processor
Technology
• 22 nm SOI, eDRAM, 15 ML 650 mm2
Caches
• 512 KB SRAM L2 / core
• 96 MB eDRAM shared L3
Memory
• Up to 230 GB/s
sustained bandwidth
Bus Interfaces
• Durable open memory attach
interface
• Integrated PCIe Gen3
• SMP interconnect
• CAPI
Energy Management • On-chip power management microcontroller
Cores
• 12 cores (SMT8)
• 8 dispatch, 10 issue,
16 execution pipes
• 2x internal data
flows/queues
• Enhanced prefetching
• 64 KB data cache,
32 KB instruction cache
Accelerators
• Crypto and memory
expansion
• Transactional memory
• VMM assist
• Data move/VM mobility
POWER8 Scale-Out Dual Chip Module
Chip Interconnect
Core Core Core
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L2 L2 L2
Core Core Core
Chip Interconnect
Core Core Core
L2 L2 L2
L2 L2 L2
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
L3
Bank
Core Core Core
Mem
ory
Bu
s
Mem
ory
Bu
s
SM
P In
terc
on
ne
ct
SM
P In
terc
on
ne
ct
SM
P
SM
P
CA
PI
PC
Ie
SM
P
CA
PI
PC
Ie
SM
P
Let’s take a closer look at how IBM Engineers made CAPI work
© 2015 IBM Corporation
PCIe
How CAPI Works
Algorithm Algo m rith
POWER8 Processor
Acceleration Portion:
Data or Compute Intensive,
Storage or External I/O
Application Portion:
Data Set-up, Control
Sharing the same memory space
Accelerator is a peer to POWER8 Core
CAPI Developer Kit Card
78 © 2015 IBM Corporation
FPGA
POWER8
Core
CA
PP
P
CIe
CAPI technology connections
• Proprietary hardware to enable
coherent acceleration
• Operating system enablement
– Ubuntu LE
– Libcxl function calls
• Customer application and accelerator
• Application sets up data and calls the
accelerator functional unit (AFU)
• AFU reads and writes coherent data across the
PCIe and communicates with the application
– PSL cache holds coherent data for quick
AFU access
POWER8 Processor
OS
App
Memory (Coherent)
AFU
IBM Supplied PSL
79
© 2015 IBM Corporation
2 2 Set Work Element
Descriptor (WED) at
AddrX – may contain
addresses of other data
structures
Understands WED content - and
any other addressed data
structures
AFU reserved for work Open device
cxl_afu_open_dev
1 Connect to
accelerator
App
OS
IBM Supplied
PSL
AFU
If required, App can
read or write AFU
registers
5 MMIO interface
AFU continues to work
using this interface
Reset AFU
PSL_WED_Ax is
set to AddrX
AFU_CNTL_An[E]
is set
jea gets AddrX
jcom gets start
CTL interface Start accelerator 3 Attach device
cxl_afu_attach
6 6 AFU finishes
(Mechanism is user defined)
De-assert RUNNING
Assert DONE
App knows AFU is finished
(Mechanism is user
defined)
App can start again
from top or free AFU
CTL interface
Free device
cxl_afu_free
CAPI solution flow
Resp interface
CMD interface
Buffer interface 4
AFU fetches AddrX (the WED)
starts operation
80
© 2015 IBM Corporation
POWER8 with CAPI Cards
POWER8 Modules
CAPI Dev Kit Cards
Front View
Side View
81
© 2015 IBM Corporation
• CAPI is a platform to enable acceleration
• CAPI provides an infrastructure to improve performance of
an application through FPGA acceleration
– Enables customer-defined acceleration within the processor complex
• CAPI allows implementation of a wide range of accelerators
to optimally address many different customer challenges
– Each implementation is a unique CAPI Solution
• A CAPI Solution is a specific implementation of an algorithm
that uses an FPGA + application
• A CAPI Solution requires logic designers and programmers
to implement the solution
• CAPI Solution Examples:
– Flash Appliance (IBM Data Engine for NoSQL)
– MonteCarlo Algorithm
Basic concepts of CAPI
CAPI vs. CAPI Solutions
Platform
for
Innovation
Specific
Customer
Solution
© 2015 IBM Corporation
Why Accelerate on CAPI?
• Reasons to consider CAPI Acceleration
– Higher Performance
If your customer has a complex application running on a core, consider
CAPI for better performance
If your customer already does I/O attached FPGA acceleration, CAPI will
simplify their software and provide better performance
– Lower IT Costs
By moving workload to CAPI, your customer will need fewer cores
In some cases, such as the IBM Data Engine for NoSQL, CAPI can do the
same work with far less infrastructure
– Lower Power
• Running acceleration on an FPGA can result in lower power consumption
vs. running the application as software on a core
82
Note:
When considering CAPI for a particular solution, we compare it to:
1. The same solution running as software –OR–
2. The same solution running on an IO attached FPGA
83
© 2015 IBM Corporation
CAPI ecosystem partners and consumers
Partner Solutions
Clients with their
Own Proprietary Solutions
CAPI-APPS
For
Clients
IBM CAPI Solutions IBM Data Engine for NoSQL
Have a client who wants their
IBM Application to be
accelerated on CAPI? (ex:
DB2, CPLEX, Streams)
Contact: Jonathan Dement
(dementj@us.ibm.com)
Have a client or partner who
wants to create a CAPI-App
and sell it to others? Point
them to the CAPI resources in
this doc (IBM and Nallatech
websites) and email Bruce
Wile (bwile@us.ibm.com)
about the opportunity
Have a client or partner who
wants to create a proprietary
CAPI Solution? Point them to
the CAPI resources in this doc
(IBM and Nallatech websites)
and email Bruce Wile
(bwile@us.ibm.com).
Why tell Bruce Wile about
the opportunity?
Depending on the size of the
opportunity, we will engage
the CAPI Customer
Enablement Team
84
© 2015 IBM Corporation
CAPI
CAPI Developer Kit CAPI Market Solutions
Clients create their own,
proprietary business solution. IBM & Partners create business
solutions for the CAPI Market.
Clients buy pre-packaged
solutions from the CAPI Market.
Two Paths into CAPI
CAPI App Solutions
85 © 2015 IBM Corporation
CAPI Solutions
CAPI App Solutions
© 2015 IBM Corporation 86 © 2014 OpenPOWER Foundation
Open Development Driving CAPI Solutions
Boards / Systems
I/O / Storage / Acceleration
Chip / SOC
System / Software / Integration
Implementation / HPC / Research
Complete member list at www.openpowerfoundation.org
87 © 2015 IBM Corporation
CAPI
Market
Medicine
Finance/
Insurance
Visual /
Biometric
Analysis
Oil & Gas
Weather
Big Data/
Database/
Compute
Social/
Media
Radiation Therapy
Pharmaceuticals
Public Health Image
Analysis Genomics
Risk Analysis
Monte Carlo
Pattern Analysis
Retail Security
Facial Recognition
Network Packet Processing
Database Acceleration/KVS
Machine Learning
Bitwise Data Manipulation
Compression/Encryption
Ensemble
Calculations of
Numerical Weather
Prediction
Reverse Time Migration
Data Analytics
Pattern Recognition
Manufacturing
/EDA
Fluid Dynamics
3D Modeling CAD
Pipeline Analysis & Flow
Specialized Algorithms Deep Computation and
Critical Runtime Jobs
Edge of Network; JPEG
& Video processing
Visual /
Biometric
Analysis
Big Data/
Database/
Compute Medicine
Database Acceleration
& Fast Storage
Social/
Media Big Data/
Database/
Compute
Potential Markets for CAPI Solutions
88 © 2015 IBM Corporation
CAPI Availability
• CAPI Developer Kit
– Procure through Nallatech
– For customers considering creating their own CAPI Solution
–CAPI Decision and Process Guide
– Requires POWER8 Server
– Available now
– See www.nallatech.com/capi
• First CAPI Solution:
– Procure through IBM
– GA in early 2015
IBM Data Engine for NoSQL
• See: http://www.ibm.com/support/customercare/sas/f/capi/home.html
89
© 2015 IBM Corporation
CAPI Developer Kit
90
© 2015 IBM Corporation
CAPI Developer Kit – FPGA Card
Altera Stratix V FPGA
Dual 10G SFP+
2 Banks of SDRAM
PCI-E Gen3
Complete Datasheet
91
© 2015 IBM Corporation
CAPI Developer Kit
IBM POWER8TM Server
92
© 2015 IBM Corporation
CAPI Developer Kit
93
© 2015 IBM Corporation
CAPI Developer Kit
http://www.ibm.com/support/customercare/sas/f/capi/home.html
94
© 2015 IBM Corporation
© Copyright International Business Machines Corporation 2015
Printed in the United States of America September 2015
IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp.,
registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.
A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at
www.ibm.com/legal/copytrade.shtml.
The following terms are trademarks or registered trademarks licensed by Power.org in the United States and/or other countries: Power ISA.
Information on the list of U.S. trademarks licensed by Power.org may be found at www.power.org/about/brand-center/.
Linux is a trademark of Linus Torvalds in the United States, other countries, or both.
Other company, product, and service names may be trademarks or service marks of others.
All information contained in this document is subject to change without notice. The products described in this document
are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction
could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not
affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied
license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document
was obtained in specific environments, and is presented as an illustration. The results obtained in other operating
environments may vary.
While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations
or warranties of accuracy or completeness are made.
Note: This document contains information on products in the design, sampling and/or initial production phases
of development. This information is subject to change without notice. Verify with your IBM field applications
engineer that you have the latest version of this document before finalizing a design.
You may use this documentation solely for developing technology products compatible with Power Architecture®. You may not modify or distribute this documentation. No license,
express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document.
THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. In no event will IBM be
liable for damages arising directly or indirectly from any use of the information contained in this document.
IBM Systems and Technology Group
2070 Route 52, Bldg. 330
Hopewell Junction, NY 12533-6351
The IBM home page can be found at ibm.com®.
Version 1.0
29 September 2014—IBM Confidential