POWER8 Scale Out, OpenPOWER and CAPI · Power 730 Power S822 Processor POWER7+ POWER8 Sockets 2 2...

POWER8 Scale Out, OpenPOWER and CAPI

Georgia IBM POWER User Group

16 APR 2015

JT Kellington

POWER8 Scale Out

Power April 2014 Announcements

• New POWER8 Scale Out Servers

– IBM POWER8 2U 2 socket server: Power S822

• New POWER8 Linux Servers

– IBM POWER8 Linux 2U 1 socket server: Power S812L

– IBM POWER8 Linux 2U 2 socket server: Power S822L

• New Virtualization Management

– Enhanced HMC Functionality

– IBM PowerKVM – Kernel Virtual Machine

• New Linux Distro Offering

– Canonical Ubuntu

– Available on Linux Power servers with PowerKVM

Power April 2014 Announcements

• New I/O Options

– Ethernet

• New IBM i Releases

– IBM i 7.2 (1st new version in 4 years)

– IBM i 7.1 TR8

• POWER8 Hardware support

– IBM BLU Acceleration Solution - Power Systems Edition

– IBM PowerVP – Virtualization Performance

– IBM PowerSC – Security and Compliance

– IBM PowerVM

– IBM PowerVC

180 nm 130 nm 90 nm 65 nm 45 nm 32 nm 22 nm

Gain by Technology Scaling Gain by InnovationRelative %

of Improvement

Innovation Drives Performance

POWER8: The First Processor Designed for Big Data IBM 22nm Technology • Silicon-on-Insulator

• 15 metal layers

• Deep trench eDRAM

POWER8 Processor Compute

• 12 cores (thread strength optimized)

• SMT8, 16-wide execution

• 2X internal data flows

• Transactional Memory

• 64KB L1 + 512KB L2 / core

• 96MB L3 + up to 128MB L4 / socket

• 2X bandwidths

System Interfaces

• 230 GB/s memory bandwidth / socket

• Up to 48x Integrated PCI gen 3 / socket

• CAPI (over PCI gen 3)

• Robust, Large SMP Interconnect

• On chip Energy Mgmt, VRM / core

POWER8 DCM

Memory

Buffer

POWER8 Memory Organization (Max Config shown)

128 GB

Up to 1 TB / Socket

First P8 Systems: 512 GB /Socket

POWER8 Performance

POWER5

POWER6

POWER7

POWER8

POWER5

POWER6

POWER7

POWER8

0 50 100 150 200

POWER6

POWER7

POWER7+

POWER8

IO Bandwidth (scale-out systems)

per Socket Performance Gains (SMT8)

0 50 100 150 200 250

POWER5

POWER6

POWER7

POWER8

Memory BW per Socket

Per Core Performance Gains (mixed workloads)

POWER8 Scale-Out Systems

Power Systems scale-out portfolio

Power Systems

S822L Power Systems

S812L •1-socket, 2U •Linux Only •KVM and PowerVM

•2-socket, 2U •Linux Only •KVM and PowerVM

•2-socket, 2U •All Operating Systems •PowerVM only

Power Systems

S814 •1-socket, 4U •All Operating Systems •PowerVM only

Power Systems

S824 •2-socket, 4U •All Operating Systems •PowerVM only

Power Systems

•2-socket, 4U •Linux Only •Bare metal

Power 730 Power S822

Processor POWER7+ POWER8

Sockets 2 2

Cores 8 / 12 / 16 12 / 20

Maximum Memory 512 MB @ 1066 MHz 512 GB / 1 TB @ 1600 MHz

Memory Cache No Yes

Memory Bandwidth 68 GB/sec 192 GB/sec

Memory DRAM Spare No Yes

IO Expansion Slots Dual GX++ 4 PCIe x16 G3

PCIe slots 5 PCIe x8 LP 4 / 5 PCIe x8 LP 2 / 4 PCIe x16 LP

PCIe Hot Plug Support No Yes

IO bandwidth 60 GB/sec 192 GB/sec

Ethernet ports Four 1 Gbt Four 1 Gbt

SFF 6 12

Easy Tier Support No Yes

Integrated split backplane Yes ( 3 + 3 ) Yes ( 6 + 6 )

Service Processor Generation 1 Generation 2

POWER8 2U Scale Out Comparison

Power 720 Power System S814

Processor POWER7+ POWER8

Sockets 1 1

Cores 4 / 6 / 8 6 / 8

Maximum Memory 512 GB @ 1066 MHz 512 GB @ 1600 MHz

Memory Cache No Yes

IO Expansion Slots Dual GX++ 4 PCIe x16 G3

PCIe slots 5 PCIe x8 FH / HL

4 PCIe x8 HH / HL (opt) 5 PCIe x8 FH / HL 2 PCIe x16 FH / FL

CAPI (Capable slots) N / A One

Ethernet ports Quad 1 Gbt Quad 1 Gbt (x8 Slot)

SFF bays 6 12

Easy Tier Support No Yes

POWER8 4U Scale Out Comparison

Power 740 Power Systems

S824 Processor POWER7+ POWER8

Sockets 2 2

Cores 16 24

Maximum Memory 1 TB @ 1066 MHz 1 TB (2 TB ) @ 1600 MHz

Memory Cache No Yes

IO Drwr Expansion Slots Dual GX++ 4 PCIe x16 G3

PCIe slots 5 PCIe x8 FH / HL

4 PCIe x8 HH / HL (opt) 7 PCIe x8 FH / HL 4 PCIe x16 FH / FL

Ethernet ports Quad 1 Gbt Quad 1 Gbt

SFF bays 6 12

Easy Tier No Yes

POWER8 4U Comparison

Performance / Benchmarks

POWER8 System Performance

P4 690

P5+ 595

P8 S824

Power 740 vs Power S824

P 740+ P8 S824

Max Watts

50% more Cores More Internal Storage More I/O Slots Higher Perf Memory

Performance per BTU

Greater Energy Efficiency

Better Thermal Characteristics

P 740+ P8 S824

Performance

~2x Better Performance

Performance per KW

IBM S824 Fujitsu

RX300 S8

HP ProLiant

BL460c

Cisco UCS

C240 M3

2x Better Performance

than nearest Intel

competition

24 Core Systems

SAP Sales & Distribution 2-Tier ERP 6

Benchmark

Per Core Performance

Oracle

16-core

B200 M3

16-core

6-core

Oracle

16-core

B200 M3

16-core

6-core

Performance Leadership

Siebel CRM Release 8.1.1.4

Benchmark

Per Core Performance

Oracle

16-core

B200 M3

24-core

12-core

Performance Leadership

Oracle

16-core

B200 M3

24-core

12-core

eBS 12.1.3 Payroll Benchmark

Operating Systems

POWER8 AIX Levels

11 / 2012 12 / 2012 3 / 2013 5 / 2013 8 / 2013 9 / 2013 10 / 2013 12 / 2013 2Q / 2014 3Q / 2014

AIX 6.1 TL7

SP6 SP7 SP8 SP9 SP10

AIX 6.1 TL8

SP1 SP2 SP3 SP4 SP5

AIX 6.1 TL9

+ APAR

IV56366

AIX 7.1 TL1

SP6 SP7 SP8 SP9 SP10

AIX 7.1 TL2

SP1 SP2 SP3 SP4 SP5

AIX 7.1 TL3

+ APAR

IV56367

P8, P7 or P6 Modes with Full I/O Support

P7 or P6 Modes with Full I/O Support

P7 or P6 Modes with Virtual I/O

Why AIX……

• Best Performance and Scalability

– Scales to 256 Cores

– #1 SAP System performance

– #1 SAP per Core performance

• Most Available

– AIX & Power # 1 in availability (ITIC 2013 report)

• Most Secure

– CAPP/OSPP/EAL4+ Security Certification

– 0 reported security breeches with SAP and IBM DB2 or Oracle DB2 on

AIX & Power

• Self Tuning (Dynamic System Optimization)

– Monitors and adjusts optimizations as needed

– Cache & Memory affinity

– Shared memory & Data Stream Pre- fetch optimization

• Minimize Memory requirements

– Active Memory Expansion

Investment being made into AIX……

• Hot patching of AIX Kernel

– Apply fix to “Live” AIX Kernel

– No reboot of the partition required

– No recycling of the applications

• CAPI Enablement

– Support of CAPI resources

• SRIOV Enhancements

– FCoE & Fibre Channel

• Performance improvements

– Pthreads Trans Memory

• Future Considerations

– AME Enhancements

– Larger Max memory

– Split Core support

– DSO Enhancements

IBM i 7.2

POWER7

Max Scale = 32 cores (SMT4)

Max Partition = 96 cores (SMT4)

Threads = ST, SMT2, SMT4 up to 384 threads in single partition

POWER8

Threads = ST, SMT2, SMT4, SMT8 up to 768 threads / single partition

IBM i Levels

IBM i 7.1 TR8

POWER7

Threads = ST, SMT2, SMT4 up to 256 threads in single partition

POWER8

Threads = ST, SMT2, SMT4, SMT8 up to 256 threads / single partition

IBM i 7.2 and POWER8 Highlights

• Enhancing Systems of Engagement and Systems of Record:

– POWER8 enables new levels of performance, reliability and scalability making it simpler to integrate systems of engagement and systems of record on a single system and single architecture

– IBM i 7.2 locks down business data, increases security and improves performance minimizing risk as you extend business systems to customers through mobile and cloud. And, combined with new encrypt/decrypt capabilities in POWER8, ensuring your data is protected has never been easier

• Key Capabilities:

– Powerful new features of DB2® for i ensures security of the data in a modern environment of mobile, social and network access

– IBM Navigator for i extends system management capabilities to manage and monitor performance services

– Integrated Security SSO application suite extended to include FTP and Telnet authentication with Kerberos

– PowerHA SystemMirror for i Express Edition introduces HyperSwap and improves system resiliency to ensure continual access for customers and employees

– Analytics: combined value of DB2 WebQuery & Cognos on Linux on Power

– Free Format RPG provides game changing enhancements for developers, making extension to mobile and social easier.

POWER8 Linux Distros

2Q / 2014

RHEL6 RHEL 6.5

P7 Mode in P8

RHEL 7 RHEL 7.0 - POWER8 Support

RHEL 7.1 – LE KVM Support

SLES 11 SLES 11 + SP3

P7 Mode in P8

SLES 12 POWER8 LE KVM

Ubuntu (LE) 14.04.00/01

P8 Support

Virtualization

PowerKVM: Open Virtualization for scale-out Linux Systems • Kernel-Based Virtual Machine(KVM) Open Source Hypervisor for virtualizing Linux

guest VMs on POWER8 Linux Scale-out servers

• Exploit existing Linux admin skills and tools

• Leverage Power systems performance and resiliency

PowerVM: Virtualization without Limits • Delivers higher levels of utilization

• Simplified virtualization user experience with new performance views & capacity data

PowerVP: - Virtualization Performance

• Improved memory and shared processor affinity to optimize performance and

service levels

PowerVC (Virtualization Center): Increase IT productivity and agility • Built on OpenStack

• Improved scalability, active directory support and shared storage pools enabling faster

integration with clients existing infrastructure

SmartCloud Entry for Power Systems*

• Extended capability to enable customization & quicker deployment of

OpenStack-based cloud solutions

Power System Software An intelligent IT infrastructure for Cloud, Big Data,

Analytics & Mobile

Simplified Virtualization and Cloud Management

Expanded choice and enhanced value for the industry’s most scalable & flexible virtualization

infrastructure for UNIX, Linux and IBM i

HMC Past HMC in 2Q-2014

• Disjoint set of tools

• Multiple agents need to be installed in OS

• Minimal or Lack of Visualization

• Integrated Visual Monitor in HMC

• Standard set of Interfaces for

external APIs to consume data

Power Systems Performance Monitoring

Performance metric indicators & utilization dashboard

Processor, memory & I/O

Server & LPAR level information

Basic trend data collection and visualization

Identify bottlenecks

Early problem detection

REST based API to access:

All platform (PHYP & VIOS) metrics for Tivoli

Third Party tools

Performance Monitoring – Metrics & Dashboard

Provides full PowerVM

performance and

capacity metrics

Via a single touch-point

(HMC).

PowerKVM

PowerVM

PowerVM is Power Virtualization that will continue to be enhanced to support AIX, IBM i Workloads as well as Linux Workloads

Initial Offering: 2004

Initial Offering: Q2 2014

PowerKVM provides an Open Source choice for Power Virtualization for Linux workloads. Best for clients that have Linux centric admins.

Power Virtualization Options

PowerVM PowerKVM

GA Availability 2004 Q2 2014

Supported Hardware All P6, P7, P7+, P8

Systems PowerLinux P8 Systems

Supported OS AIX, IBM i & Linux Linux

Workload Mobility Supports AIX, IBM i &

Linux Linux

Basic Virtualization

Management IVM / HMC / FSM Virtman/libvirt

Advanced Virtualization

Management PowerVC/VMControl PowerVC, Vanilla OpenStack

Admin Type Power Centric Linux/x86 Centric

Established Security

Track Record on Power Yes No

Open Source Hypervisor No Yes

PowerVM vs PowerKVM Comparison

• First release available in 2014

• Focus: New Linux workloads for Power Systems

• Seamless transition for existing Linux admins to adopt Power Linux

Virtualization without any training

• No HMC or other traditional IBM consoles

• Normal Linux management and OpenStack options

• PowerKVM only supports Linux guest VMs

• Cloud potential: Have many more small VMs than traditional Power

Virtualization

• POWER8 PowerLinux hardware only

• Live Workload mobility support between PowerKVM servers

• Open Source Hypervisor: Hardware is abstracted by firmware

• Managed by OpenStack(PowerVC) or by off the shelf OpenStack or

local Linux Tools

PowerKVM Positioning

OpenPOWER

The Era of Heterogeneous Computing is Coming…

Without Price Increases

Microprocessors and technology alone are no longer driving Cost/performance improvements

2 socket systems 2 socket sys @ constant cost

Processors

Semiconductor Technology

Workload Acceleration Services Delivery Model Advanced Memories Optimized System Design Custom SOC’s

Some Example Use Cases

System stack innovations are required to drive cost/performance

Processors

Semiconductor Technology

Applications and services

Firmware, Operating System and Hypervisor

System Stack

Systems Management & Cloud Deployment

Systems Acceleration & HW/SW Optimization

OpenPOWER Extends Moore’s Law to the

System

OpenPOWER will enable data centers to rethink their approach to

technology.

Member companies may use POWER for custom open servers and components for Linux based

cloud data centers.

OpenPOWER ecosystem partners can optimize the interactions of

server building blocks – microprocessors, networking, I/O &

other components – to tune performance.

How will the OpenPOWER Foundation

benefit clients?

– OpenPOWER technology creates

greater choice for customers

– Open and collaborative development

model on the Power platform will

create more opportunity for

innovation

– New innovators will broaden the

capability and value of the Power

platform

What does this mean to the industry?

– Game changer on the competitive

landscape of the server industry

– Will enable and drive innovation in

the industry

– Provide more choice in the industry

Platinum Members

Fueling an Open Development Community

Boards / Systems

I/O / Storage / Acceleration

Chip / SOC

System / Software / Integration

Implementation / HPC / Research

Complete member list at www.openpowerfoundation.org

OpenPOWER: Growing Fast

Boards/Systems

I/O, Storage, Acceleration

Chip/SOC

System/Software/Services

***Chart from April 2014!!!

POWER8/8+

Processors

PowerCore GPU/Other

NVLINK

Memory Interface Control

Server Class Memory

GPU/Other

NVLINK

Memory Interface Control

CAPI IBM & Partner Devices

Server Class Memory

“POWER” Built for Open Innovation

Innovation with OpenPOWER is taking place on all interfaces and with custom SOC Designs

POWER Processors have a Leadership Set of Differentiated Interfaces

Redesigning the Computer

• Extreme Parallelism available

• Targeted Software Accelerator packs

• IP Base Libraries

• Customer IP

• Reconfigurable Nature fights Commoditization

ranspare

nt Toolin

Middleware Like Abstraction

Services

CPU’s FPGA or GPU

Strong Cores for Serial Codes

Runs Traditional & Legacy Software

Runs OS (Security, Virtualization, etc)

Greater robustness is achieved by mating of specializations….

When to Use FPGAs

• Transistor Efficiency & Extreme Parallelism

– Bit-level operations

– Variable-precision floating point

• Power-Performance Advantage

– >2x compared to Multicore (MIC) or GPGPU

– Unused LUTs are powered off

• Technology Scaling better than CPU/GPU

– FPGAs are not frequency or power limited yet

– 3D has great potential

• Dynamic reconfiguration

– Flexibility for application tuning at run-time vs. compile-time

• Additional advantages when FPGAs are network connected ...

– allows network as well as compute specialization

When to Use GPGPUs

• Extreme FLOPS & Parallelism

– Double-precision floating point leadership

– Hundreds of GPGPU cores

• Programming Ease & Software Group Interest

– CUDA & extensive libraries

– OpenCL

– IBM Java (coming soon)

• Bandwidth Advantage on Power

– Start w/PCIe gen3 x16 and then move to NVLink

• Leverage existing GPGPU eco-system and development base

– Lots of existing use-Cases to build on

– Heavy HPC investment in GPGPU

Power8 Invents CAPI

Power Processor

CAPI over

Coherently Attached

Device

• Coherent Attached Processor Proxy (CAPP) in processor

– Unit on processor that extends coherency to an attached device

– On processor directory responds on behalf of off-chip device

(Filtering snoops)

• Coherency protocol tunneled over standard PCIe

– Eliminates the need for special I/Os and protocol logic

CAPI utilizes standard Posted Write and Non-posted Reads

– Reduces the complexity and bandwidth requirements of the

attached device

• Enables attached device to be a peer to the processor

– Simplifies programming model between application

– Enables device to use same effective address as application

running in processor

– Eliminates the cumbersome I/O Device Driver requirements

Pinned memory not required

Why CAPI is Better than Traditional PCIe

CAPP PCIe

Power Processor

IBM Supplied POWER Service Layer

Typical I/O Model Flow

Flow with a Coherent Model Shared Mem.

Notify Accelerator Acceleration

Shared Memory

Completion

DD Call Copy or Pin

Source Data

MMIO Notify

Accelerator Acceleration

Poll / Int

Completion

Copy or Unpin

Result Data

Ret. From DD

Completion

Advantages of Coherent Attachment Over I/O Attachment

• Virtual Addressing & Data Caching

– Shared Memory

– Lower latency for highly referenced

• Easier, More Natural Programming Model

– Traditional thread level

programming

– Long latency of I/O typically requires

restructuring of application

• Enables Applications Not Possible on I/O

– Pointer chasing, etc…

Workloads to Innovate

• Start with what FPGAs are good at: Embarrassingly Parallel Problems

• Combine with CAPI strengths:

– Ease of programming

– Lack of device driver

– Shared memory & caching (host to accelerator communication)

• What do you get:

– Bitwise data manipulation (e.g. Deep Compression)

– Pattern recognition

– Encryption

– Monte Carlo

Statistical modeling for complex predictions

– Image Analytics & Biometrics

Facial recognition

Feature detection (e.g. cancer)

– Network Packet Processing & Inspection

– Bioinformatics (e.g. Sequence alignment)

– Reverse time migration (Oil & Gas)

– Ensemble Calculations of Numerical Weather Prediction

– Machine Learning

– And on and on

Example: File System Acceleration with CAPI-FPGA

• Compression

– IBM Gzip offers best combination of

performance and compression rate

• De-Duplication

– Signature calculation is easy to

integrate with compression datapath

• Crypto

– Crypto acceleration on P8

– FPGA is also a good fit, especially if

crypto algorithm is non-standard

• Content analytics for real-time tagging

– IBM CAPI/FPGA accelerated text

analytics

– IBM CAPI/FPGA accelerated image

analytics

• Power 8 / CAPI benefits

– Very strong memory & I/O bandwidth

– Seamless integration with CAPI

shared memory interface (acc. Is just

like another core )

– Variety of accelerator partners

through OpenPOWER ( Altera, Xilinx,

NVIDIA, ...)

IBM Accelerated GZIP Compression

What it is:

An FPGA-based low-latency GZIP Compressor & Decompressor with single-thread

througput of ~2GB/s and a compression rate significantly better than low-CPU overhead

compressors like snappy.

IBM Accelerated Text Processing

• rule language

• SQL-like syntax

systemT

optimizer

Compiled

operator

For years, Microsoft Corporation

CEO Bill Gates was against open

source. But today he appears to

have changed his mind. "We can

be open source”

Annotations

systemT

runtime

Java +

What it is:

A compiler/runtime system for

accelerating text analytics on a shared-

memory CPU-FPGA

Results

Big Speedup vs. Multithread SW

To appear @:

Hot Chips 2014

FPGA Image & Video Processing

Information Extraction Object Recognition

Template Matching Edge Detection, Feature Extraction, Segmentation

Extract relevant information from input

image to enable object recognition

Information located where pixels change

color (edges, blobs)

Intrinsic properties of objects

Object boundaries

tivati

Applications requiring edge detection & feature extraction span a wide range of domains

Computer/Machine Vision: Tracking, Object Recognition & Navigation

General image proc.: Compression

Quality Control: Unsupervised Defect Identification

Medical Imaging: Analysis + Diagnosis & Computer Guided Surgery

Design fully-pipelined FPGA architectures

streaming application

Real-time, low-power, onboard image

processing solution

Sobel and Canny: extract contours/edges

SURF: extract scale & rotation-invariant features

Custom Hardware Mapping

2D convolution with Gaussian Filter: blur

2D convolution with Gaussian 1st derivative: extract edges

2D convolution with Gaussian 2nd derivative: extract features

n FPGA acceleration results from:

Parallel 2D convolution

Process all pixels inside filter in parallel

Parallel 2D convolution in x, y, z direction

Parallel 2D convolution for all filter scales

Total of 33 filters

Gaussian 1st derivative

2nd derivative

Results & Conclusions

VHDL performance OpenCL

performance

Stratix 4 Stratix 5 Stratix 5

Frames/sec Max

freq. Frames/sec

Sobel 475 170 909 300 870 300

Canny 470 170 890 300 823 309

SURF 392 170 870 300 804 283

OpenCL vs. VHDL performance table

OpenCL vs. VHDL

productivity table VHDL

development

OpenCL

development

Sobel,

Canny,

6 months 1 month

Productivity Performance

IBM Accelerated Image Processing

What it is:

A real-time multi-HD stream Harris-Laplace feature detection algorithms implemented in

an FPGA

Performance:

166M pixels per second

( i.e. multi-stream HD video)

To appear:

IBM Journal of Research & Development

strategy ( )

CAPI Attached Flash Optimization

– Attach TMS Flash to POWER8 via CAPI coherent Attach

– Issues Read/Write Commands from applications to eliminate 97% of code pathlength

– Saves 20-30 cores per 1M IOPs

Pin buffers,

Translate, Map DMA,

Start I/O

Application

Read/Write Syscall

Interrupt, unmap,

unpin,Iodone scheduling

20K instructions reduced to

Disk and Adapter DD

strategy ( ) iodone ( )

FileSystem

Application

User Library

Posix Async

I/O Style API

Shared Memory

Work Queue

aio_read()

aio_write()1

iodone ( )

Flash as Slow Memory

client network flash

server

network

acceptable

latency

Memory

Conventional PCIe I/O

Monte-Carlo CAPI Acceleration

Running

1 million iterations

At least

250x Faster

with CAPI FPGA +

POWER8 core

Full execution of a Heston

model pricing for a single

security:

1. SOBOL sequence

generator (pRNG)

2. Inverse Normal to create

the non-linear distribution

3. Path-generation

4. Pay-off function

Easier to Code:

Reduces C code writing by 40x compared to non-CAPI FPGA

POWER8-based Network Acceleration

Faster workloads with less infrastructure

Eastern

Central

New York

Boston

Washington D.C.

Chicago

exploiting high speed

networks with

Remote DMA

IBM Power Systems and Mellanox® Technologies partnering to

simultaneously accelerate the network and compute for NoSQL

workloads.

10x higher

throughput

Dramatically less data center

infrastructure

10x lower latency

Dramatically faster

responsiveness to customers

leveraging POWER8

high throughput low

latency I/O

• We’re only just discovering how to make this data useful

• Impossible to make this much data useful through human inspection

Large global retailers collect petabytes of data

Transactions generate tens of millions of filing

cabinets of paper

How does a retailer translate all of this data to

business value?

Group customers in segments with similar

behavior

Customize products and marketing programs

GPU Acceleration Example: Espresso

IBM Power Systems GPU Acceleration of Java Applications

• Now possible on today's Big Data and Java Workload Acceleration

– Use of segmentation or clustering in the retail industry

• Look for non-obvious patterns in the sales data and react

quickly Analyze across tens of thousands of dimensions

quickly and accurately

• Lends itself nicely to a bit of computer science known as

"k-means clustering"

– Outcome could lead to new products, revised products and

advertising, launching new campaigns….wherever the data

leads you….

Imagine generating 100 times more ideas for new products and campaigns – who can get you there?

• IBM and NVIDIA are demonstrating segmentation

using GPU accelerated machine learning for

clustering using Hadoop / Mahout

– OpenPower initiative with NVIDIA

– First product implementing GPU acceleration for

• Best-in-class ingredients

– IBM POWER8 – Designed for Big Data

– IBM Java

– NVIDIA CUDA GPU acceleration

– Ubuntu Little Endian Linux for POWER

• Achieving 8X performance improvement

GPU Espresso Demo

NVIDIA acceleration built into IBM Power S824L

8x faster than x86 Ivy

Bridge on pattern extraction

82x faster for Cognos BI and

DB2 BLU

Altera FPGA acceleration and IBM CAPI

Monte Carlo 250x faster than POWER8 core

alone, reduced C code 40x over non-CAPI FPGA

Data Engine for NoSQL 24:1 server

consolidation, 3x lower cost per user, 40TB

CAPI-attached flash

CAPI dev kit with FPGA card from Nallatech

Tyan OpenPOWER Customer Reference System

US Dept of Energy $325M super computing

contract awarded to IBM, Mellanox, and NVIDIA

OpenPOWER innovations benefit Clients

DoE systems for science and

stockpile stewardship

Sierra and Summit systems to be

>100 PF, 2 GB/core main memory,

local NVRAM, and science

performance 4x-8x Titan or Sequoia

University Research on Power8 Accelerators

• Photodynamic Therapy @ University of Toronto

• fMRI @ Western University

• Genomics @ University of Illinois Urbana-Champaign & Rice & Delft

• Seismic @ University of Texas

• Data Analytics @ North Carolina State University

• Financial Risk @ University of Florida

• The list is growing rapidly…

What is CAPI?

What’s in a name?

FPGA as an Accelerator

• FPGA: Field Programmable Gate Array

– It’s a re-programmable chip

– It can run fast (cycle times of 250 – 500 Mhz or more)

– It has Industry Standard Interfaces like PCI-E Gen3

– The Major FPGA Suppliers, Altera and Xilinx,

are OpenPOWER Foundation members

gzip Encrypt

FPGA Library

Source code for FPGAs has traditionally

been written in RTL* (VHDL** or Verilog).

Now, we also have OpenCL, a more

programmer friendly language.

*RTL = Register Transfer Level

**VHDL = VHSIC*** Hardware Description Language

***VHSIC = Very High Speed Integrated Circuit

When to Use FPGAs

• Transistor Efficiency & Extreme Parallelism

– Bit-level operations

– Variable-precision floating point

• Power-Performance Advantage

– >2x compared to Multicore (MIC) or GPGPU

– Unused LUTs are powered off

• Technology Scaling better than CPU/GPU

– FPGAs are not frequency or power limited yet

– 3D has great potential

• Dynamic reconfiguration

– Flexibility for application tuning at run-time vs. compile-time

• Additional advantages when FPGAs are network connected ...

– allows network as well as compute specialization

Why is an Accelerator Faster?

FPGA PCIE

Question: The POWER8 Processor runs at ~3Ghz while our

FPGA runs at 250Mhz. So why would an accelerator

be better?

Answer: The FPGA is better for certain algorithms, such as

those that are numerical intensive or have parallelism.

The POWER8 processor has a finite set of instructions

to implement the algorithm in SW.

The FPGA is customized logic built for specific

processing of an algorithm.

FPGA PCIE

Example 1: Numerical Intensive Algorithm

sin cos

Integral ()

Sigma ()

Sin ()

Cos ()

(n,a,v,w)

Variables

Done! Done!

FPGA PCIE

Example 2: Parallelism

Monte Carlo Risk Analysis to determine

probability of financial success:

Given current finances, run 100 scenarios

Variable distributor

Results Accumulator

(Vars)

Variables Variables

50 5 10 100

So what is new?

Accelerators on FPGAs

have been around for a

long time….

So what is new?

Coherency makes the

accelerator a peer to

the POWER8 cores

Memory Subsystem

Virt Addr

What was done before CAPI?

POWER8

FPGA PCIE

Variables Input

Device Driver

Storage Area

Variables

Output

Prior to CAPI, an application called a device driver to utilize an

FPGA Accelerator.

The device driver performed a memory mapping operation.

3 versions of the data (not coherent).

1000s of instructions in the device driver.

Memory Subsystem

Virt Addr

CAPI Coherency

POWER8

Core App

FPGA PCIE

With CAPI, the FPGA shares memory with the cores

Variables Input

Output

1 coherent version of the data.

No device driver call/instructions.

Typical I/O Model Flow:

Flow with a Coherent Model:

Shared Mem.

Notify Accelerator Acceleration

Shared Memory

Completion

DD Call Copy or Pin

Source Data

MMIO Notify

Accelerator Acceleration

Poll / Interrupt

Completion

Copy or Unpin

Result Data

Ret. From DD

Completion

Application

Dependent, but

Equal to below

Application

Dependent, but

Equal to above

300 Instructions 10,000 Instructions 3,000 Instructions 1,000 Instructions

1,000 Instructions

7.9µs 4.9µs

Total ~13µs for data prep

400 Instructions 100 Instructions

0.3µs 0.06µs

Total 0.36µs

CAPI vs. I/O Device Driver: Data Prep

FPGA is a peer to the processor

-- Caching and translations by PSL

Simple Programming paradigm

Higher performance

Architecture allows for any kind of

FPGA or even an ASIC Flexible solutions

Connection to Flash, FC, EN….

Virtualization in the Architecture Applications can share Accelerator

CAPI vs. I/O or Socket FPGA Solution

IBM Innovation Customer Impact

I/O Paradigm CAPI Paradigm

CAPI Differentiation

POWER8 Processor

Technology

• 22 nm SOI, eDRAM, 15 ML 650 mm2

Caches

• 512 KB SRAM L2 / core

• 96 MB eDRAM shared L3

Memory

• Up to 230 GB/s

sustained bandwidth

Bus Interfaces

• Durable open memory attach

interface

• Integrated PCIe Gen3

• SMP interconnect

• CAPI

Energy Management • On-chip power management microcontroller

• 12 cores (SMT8)

• 8 dispatch, 10 issue,

16 execution pipes

• 2x internal data

flows/queues

• Enhanced prefetching

• 64 KB data cache,

32 KB instruction cache

Accelerators

• Crypto and memory

expansion

• Transactional memory

• VMM assist

• Data move/VM mobility

POWER8 Scale-Out Dual Chip Module

Chip Interconnect

Core Core Core

L2 L2 L2

Core Core Core

Chip Interconnect

Core Core Core

L2 L2 L2

Core Core Core

Let’s take a closer look at how IBM Engineers made CAPI work

How CAPI Works

Algorithm Algo m rith

POWER8 Processor

Acceleration Portion:

Data or Compute Intensive,

Storage or External I/O

Application Portion:

Data Set-up, Control

Sharing the same memory space

Accelerator is a peer to POWER8 Core

CAPI Developer Kit Card

POWER8

CAPI technology connections

• Proprietary hardware to enable

coherent acceleration

• Operating system enablement

– Ubuntu LE

– Libcxl function calls

• Customer application and accelerator

• Application sets up data and calls the

accelerator functional unit (AFU)

• AFU reads and writes coherent data across the

PCIe and communicates with the application

– PSL cache holds coherent data for quick

AFU access

POWER8 Processor

Memory (Coherent)

IBM Supplied PSL

2 2 Set Work Element

Descriptor (WED) at

AddrX – may contain

addresses of other data

structures

Understands WED content - and

any other addressed data

structures

AFU reserved for work Open device

cxl_afu_open_dev

1 Connect to

accelerator

IBM Supplied

If required, App can

read or write AFU

registers

5 MMIO interface

AFU continues to work

using this interface

Reset AFU

PSL_WED_Ax is

set to AddrX

AFU_CNTL_An[E]

is set

jea gets AddrX

jcom gets start

CTL interface Start accelerator 3 Attach device

cxl_afu_attach

6 6 AFU finishes

(Mechanism is user defined)

De-assert RUNNING

Assert DONE

App knows AFU is finished

(Mechanism is user

defined)

App can start again

from top or free AFU

CTL interface

Free device

cxl_afu_free

CAPI solution flow

Resp interface

CMD interface

Buffer interface 4

AFU fetches AddrX (the WED)

starts operation

POWER8 with CAPI Cards

POWER8 Modules

CAPI Dev Kit Cards

Front View

Side View

• CAPI is a platform to enable acceleration

• CAPI provides an infrastructure to improve performance of

an application through FPGA acceleration

– Enables customer-defined acceleration within the processor complex

• CAPI allows implementation of a wide range of accelerators

to optimally address many different customer challenges

– Each implementation is a unique CAPI Solution

• A CAPI Solution is a specific implementation of an algorithm

that uses an FPGA + application

• A CAPI Solution requires logic designers and programmers

to implement the solution

• CAPI Solution Examples:

– Flash Appliance (IBM Data Engine for NoSQL)

– MonteCarlo Algorithm

Basic concepts of CAPI

CAPI vs. CAPI Solutions

Platform

Innovation

Specific

Customer

Solution

Why Accelerate on CAPI?

• Reasons to consider CAPI Acceleration

– Higher Performance

If your customer has a complex application running on a core, consider

CAPI for better performance

If your customer already does I/O attached FPGA acceleration, CAPI will

simplify their software and provide better performance

– Lower IT Costs

By moving workload to CAPI, your customer will need fewer cores

In some cases, such as the IBM Data Engine for NoSQL, CAPI can do the

same work with far less infrastructure

– Lower Power

• Running acceleration on an FPGA can result in lower power consumption

vs. running the application as software on a core

When considering CAPI for a particular solution, we compare it to:

1. The same solution running as software –OR–

2. The same solution running on an IO attached FPGA

CAPI ecosystem partners and consumers

Partner Solutions

Clients with their

Own Proprietary Solutions

CAPI-APPS

Clients

IBM CAPI Solutions IBM Data Engine for NoSQL

Have a client who wants their

IBM Application to be

accelerated on CAPI? (ex:

DB2, CPLEX, Streams)

Contact: Jonathan Dement

(dementj@us.ibm.com)

Have a client or partner who

wants to create a CAPI-App

and sell it to others? Point

them to the CAPI resources in

this doc (IBM and Nallatech

websites) and email Bruce

Wile (bwile@us.ibm.com)

about the opportunity

Have a client or partner who

wants to create a proprietary

CAPI Solution? Point them to

the CAPI resources in this doc

(IBM and Nallatech websites)

and email Bruce Wile

(bwile@us.ibm.com).

Why tell Bruce Wile about

the opportunity?

Depending on the size of the

opportunity, we will engage

the CAPI Customer

Enablement Team

CAPI Developer Kit CAPI Market Solutions

Clients create their own,

proprietary business solution. IBM & Partners create business

solutions for the CAPI Market.

Clients buy pre-packaged

solutions from the CAPI Market.

Two Paths into CAPI

CAPI App Solutions

CAPI Solutions

CAPI App Solutions

Open Development Driving CAPI Solutions

Boards / Systems

I/O / Storage / Acceleration

Chip / SOC

System / Software / Integration

Implementation / HPC / Research

Complete member list at www.openpowerfoundation.org

Market

Medicine

Finance/

Insurance

Visual /

Biometric

Analysis

Oil & Gas

Weather

Big Data/

Database/

Compute

Social/

Radiation Therapy

Pharmaceuticals

Public Health Image

Analysis Genomics

Risk Analysis

Monte Carlo

Pattern Analysis

Retail Security

Facial Recognition

Network Packet Processing

Database Acceleration/KVS

Machine Learning

Bitwise Data Manipulation

Compression/Encryption

Ensemble

Calculations of

Numerical Weather

Prediction

Reverse Time Migration

Data Analytics

Pattern Recognition

Manufacturing

Fluid Dynamics

3D Modeling CAD

Pipeline Analysis & Flow

Specialized Algorithms Deep Computation and

Critical Runtime Jobs

Edge of Network; JPEG

& Video processing

Visual /

Biometric

Analysis

Big Data/

Database/

Compute Medicine

Database Acceleration

& Fast Storage

Social/

Media Big Data/

Database/

Compute

Potential Markets for CAPI Solutions

CAPI Availability

• CAPI Developer Kit

– Procure through Nallatech

– For customers considering creating their own CAPI Solution

–CAPI Decision and Process Guide

– Requires POWER8 Server

– Available now

– See www.nallatech.com/capi

• First CAPI Solution:

– Procure through IBM

– GA in early 2015

IBM Data Engine for NoSQL

• See: http://www.ibm.com/support/customercare/sas/f/capi/home.html

CAPI Developer Kit

CAPI Developer Kit – FPGA Card

Altera Stratix V FPGA

Dual 10G SFP+

2 Banks of SDRAM

PCI-E Gen3

Complete Datasheet

CAPI Developer Kit

IBM POWER8TM Server

CAPI Developer Kit

http://www.ibm.com/support/customercare/sas/f/capi/home.html

Printed in the United States of America September 2015

IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corp.,

registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies.

A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at

www.ibm.com/legal/copytrade.shtml.

The following terms are trademarks or registered trademarks licensed by Power.org in the United States and/or other countries: Power ISA.

Information on the list of U.S. trademarks licensed by Power.org may be found at www.power.org/about/brand-center/.

Linux is a trademark of Linus Torvalds in the United States, other countries, or both.

Other company, product, and service names may be trademarks or service marks of others.

All information contained in this document is subject to change without notice. The products described in this document

are NOT intended for use in applications such as implantation, life support, or other hazardous uses where malfunction

could result in death, bodily injury, or catastrophic property damage. The information contained in this document does not

affect or change IBM product specifications or warranties. Nothing in this document shall operate as an express or implied

license or indemnity under the intellectual property rights of IBM or third parties. All information contained in this document

was obtained in specific environments, and is presented as an illustration. The results obtained in other operating

environments may vary.

While the information contained herein is believed to be accurate, such information is preliminary, and should not be relied upon for accuracy or completeness, and no representations

or warranties of accuracy or completeness are made.

Note: This document contains information on products in the design, sampling and/or initial production phases

of development. This information is subject to change without notice. Verify with your IBM field applications

engineer that you have the latest version of this document before finalizing a design.

You may use this documentation solely for developing technology products compatible with Power Architecture®. You may not modify or distribute this documentation. No license,

express or implied, by estoppel or otherwise to any intellectual property rights is granted by this document.

THE INFORMATION CONTAINED IN THIS DOCUMENT IS PROVIDED ON AN “AS IS” BASIS. In no event will IBM be

liable for damages arising directly or indirectly from any use of the information contained in this document.

IBM Systems and Technology Group

2070 Route 52, Bldg. 330

Hopewell Junction, NY 12533-6351

The IBM home page can be found at ibm.com®.

Version 1.0

29 September 2014—IBM Confidential

POWER8 Scale Out, OpenPOWER and CAPI · Power 730 Power S822 Processor POWER7+ POWER8 Sockets 2 2...

Documents

Transcript of POWER8 Scale Out, OpenPOWER and CAPI · Power 730 Power S822 Processor POWER7+ POWER8 Sockets 2 2...

iXon Ultra...iXon Ultra 888: 1024 x 1024 EMCCD, max. 30 MHz, with USB 3.0 DU-888U3-CS0-iXon Ultra 897: 512 x 512 EMCCD, max. 17 MHz, with USB 2.0 DU-897U-CS0-Choose …

MTT 2002 Seattle June 5th S. K. Leong LDMOS and Vdmos 30 - 512 Mhz BroadBand Amps.

Power Roadmap POWER8

XPR8300 EN - Repeater Builder · MOTOTRBO Repeater Model Series Band J : 136-174 MHz Q: 403-470 MHz T: 450-512 MHz Physical Packages R: Repeater Repeater ... XPR8300 EN ...

42 POWER8 Enterprise E870 From Experience

HPC Workload Performance Tuning on POWER8 with IBM XL ...spscicomp.org/.../05/gao-IBM...POWER8-Scicomp-2014.pdf · HPC Workload Perfromance Tuning on POWER8 with IBM XL Compilers

Druckschrift 99811607, 27-512 MHz KATHREIN-Antennas and ...

Druckschrift 99811607, 27-512 MHz KATHREIN … 512 MHz KATHREIN-Antennas and Antenna Line Products ... Part 1: Antennas Part 2: ... 512 MHz KATHREIN-Antennas and Antenna Line Products

Federal Communications Commission Pt. 90 - · PDF file90.353 LMS operations in the 902–928 MHz band. 90.355 LMS operations below 512 MHz. ... Federal Communications Commission Pt.

What’s possible with power8 - ibm

Oracle's Great on POWER8 Cust

POWER8 Memory Buffer Datasheet - setphaserstostun.org › power8 › power8...† Chip kill support for x4 and x8 DRAM devices † Memory scrubbing † Bidirectional fault signal to

FC/BC - UPV Universitat Politècnica de Valè · PDF file2300-2500 MHz SX Series Flex Base Mobiles 806-896 MHz 896-940 MHz 406-430 MHz 450-470 MHz 470-490 MHz 490-512 MHz ... PCS/DCS,

450-512 MHz - RFI-Motorolarfi-motorola.com/wp-content/uploads/2016/06/CC450-series.pdf · 450-512 MHz Corporate Antennas. 2 ... • Maximised gain, 6dB and 9dB options ... CC807-06

POWER8 the x86 Server Farm - IBM Business Partners use POWER8 to Lower Client Costs

Specifications - Advantechdownloadt.advantech.com/ProductFile/PIS/AIMB-221/Product... · Hyper Transport Speed 800 MHz 800 MHZ 800 MHz 800 MHz L2 Cache 1 MB 1 MB 512 KB 256 KB Chipset

POWER8 Overview v50 (1).pdf

POWER8 Power Hour with Mark Olson

IBM Announces POWER8 with OpenPOWER · PDF filePage 1 IBM Announces POWER8 with OpenPOWER Partners ... IBM typically claims that new POWER ... POWER8 will be well received by IBM’s

Optimizing Power8 Processors For Linux