Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Architectures
-
Upload
eric-van-hensbergen -
Category
Technology
-
view
29 -
download
0
Transcript of Balance, Flexibility, and Partnership: An ARM Approach to Future HPC Node Architectures
ARM Research – Software & Large Scale Systems
Node Architecture: From Present Technology to Future Exascale Nodes
Balance, Flexibility, and PartnershipAn ARM Approach to Future HPC Node Architectures
Eric Van HensbergenSenior Principal Research Engineer
ARM Research – Software & Large Scale Systems
ARM Background
ARM
royalty
ARM Research – Software & Large Scale Systems
ARM: Architecture
ARM
royalty
ARM Research – Software & Large Scale Systems
ARM: Microarchitecture
ARM
royalty
Low-power processing solutions for applications, real-time/control and microcontroller end markets
Scalable roadmap for application efficient computing
Software compatibility across a diverse application range
ARM Research – Software & Large Scale Systems
ARM: GPGPUs
ARM
royalty
Bringing visual computing to life
Combining the best of the CPU and GPU
Putting massive amounts of processing power into the hands of the application developer
ARM Research – Software & Large Scale Systems
ARM: Supporting IP
ARM
royalty
System performance with power efficiency
Enabling distributed processing with scalable architectures
Simplifying software elements through hardware coherency
ARM Research – Software & Large Scale Systems
ARM: Optimizations for Foundries
ARM
royalty
Advanced physical IP tuned for a specific foundry and process technology
Artisan Physical IP offered for more than 100 processes from 250nm to 20nm: broadest coverage in the industry
POP IP for ARM Cortex processors and Mali GPU’s deliver time to market, low risk and leadership performance
ARM Research – Software & Large Scale Systems
ARM: Software Tools and Energy Efficient Platforms
ARM
royalty
The broad ARM software ecosystem is continually advancing and evolving
Optimized software solutions enable increased system efficiency
ARM Research – Software & Large Scale Systems
ARM: Segments
ARM
royalty
Internet of Things Embedded Mobile Laptops Enterprise Networking Supercomputing
ARM Research – Software & Large Scale Systems
ARM: Business Model
ARM SemiPartner
OEM
Customer
licence
royalty
IP
chips
ARM invests in ecosystem
Ecosystem provides value chain with support & products based on ARM technology
ARM Research – Software & Large Scale Systems
Exascale Challenge: Power Efficiency
ARM Research – Software & Large Scale Systems
Top 500 MFLOPs/W over time
Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150
500
1000
1500
2000
2500
3000
3500
4000
4500
#1 #500 AVG MIN MAX
ARM Research – Software & Large Scale Systems
Maximizing Throughput Density: per mm2, per Watt
Xeon-E5 2650 V3
Cortex-A57 Cortex-A72 Xeon-E5 2660 V3
0
0.2
0.4
0.6
0.8
1
1.2 20 Thread Workload
2.7
GH
z
Rela
tive p
erf
orm
ance
(Spec2
K6
rate
)
Comparison for equivalent number of threads Platforms used:
Xeon-E5 2660 10C20T platform (measured) Xeon-E5 2650 10C20T platform (measured) Gcc compiler v4.9 with –o3 flag TDP rating source: ark.intel.com
Estimated result on example 20C ARM Cortex platforms with CCN-508, 28MB total L2+L3 cache
per-core measurements on RTL with relevant memory system Gcc compiler v4.9 with –o3 flag Scaled to 20T based on modelled and empirical results Power estimated in 16nm based on ARM internal implementations for entire CPU+interconnect complex including 20xCPU, CCN-508, L2+L3 caches• Actual results on silicon platforms may vary
2.5
GH
z
105W*
105W* <30
W
<30W
ARM Solution Benefits:
Less than 1/3rd the power for equivalent performance*
Allows power headroom for specialized computing or greater thread density
(10 cores 20 threads)(20 cores 20 threads)(20 cores 20 threads)(10 cores 20 threads)* A portion of Intel TDP power will be consumed by IO, The Cortex-A72 and Cortex-A57 estimates exclude IO power
Cortex-A72: Ideal for dense compute environments
Cortex-A72 is <20 % size
Single Broadwell CPU + 256K1 L2 ~8mm2
Cortex-A72 MP4 + 2MB L23
~8mm2
Single Cortex-A72 core 2 ~1.15mm2
A quad core Cortex-A72 with 8x L2 cache RAM is
the same size
1Source: Estimated from die-shot image provided by Intel at IDF 2014. 2/3Source: ARM trial implementations on TSMC 16FF+, using ARM Artisan libraries
Core
ARM Research – Software & Large Scale Systems
Reminder: Embedded SoC in HPC is not a new concept
ARM Research – Software & Large Scale Systems
Top 500 RMAX/Core
Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150
20
40
60
80
100
120
#1 #500 AVG MIN MAX
ARM Research – Software & Large Scale Systems
Objectives
To develop a full energy-efficient HPC prototype using low-power commercially available embedded technology.
To develop a portfolio of exascale applications to be run on this new generation of HPC systems.
To design a next-generation HPC system together with a range of embedded technologies in order to overcome the limitations identified in the prototype system.
Mont-Blanc
MB Prototype installed in the Torre Girona chapel @ BSC
Status
Prototype operational:8 standard BullX chassis, 72 compute blades,1080 compute cards, 2160 ARM Cortex-A15 processors, 1080 ARM Mali-T604 GPUs.
11 Scientific applications ported and in use for evaluation of the prototype.
Research ongoing into areas such as memory, on-chip and off-chip interconnect, compute acceleration
ARM Research – Software & Large Scale Systems
The Energy Efficient Computing Research Programme has been established through a £19M capital grant from the Department of Business Innovation and Skills to establish a centre of best practice in the UK that will enable users of computer systems to achieve the same outcomes while minimising the consumption of energy.
The Hartree CentreScience & Technology Facilities Council, UK
“This is a fantastic opportunity to meet the challenge of developing a computationally powerful and energy-efficient platform based on the 64-bit ARM v8 microprocessor … The Hartree centre will be actively developing a robust software ecosystem encompassing compilers, linkers, numerical libraries and tools – all of which are fundamental to the adoption of these types of technologies.”
Lenovo are providing a NeXtScale system: 1,152 64-bit Cavium ThunderX ARM cores in 6U.
ARM Research – Software & Large Scale Systems
BalanceOne Size Core Doesn’t Fit All
ARM Research – Software & Large Scale Systems
Top 500 Efficiency over Time (RMAX/RPEAK)
Nov-11 May-12 Nov-12 May-13 Nov-13 May-14 Nov-14 May-150%
20%
40%
60%
80%
100%
120%
#1 #500 AVG MIN MAX
HPCG (1.8%-4.07%) (1.8%-10%)
ARM Research – Software & Large Scale Systems
Seeking Balance: FastForward II
ARM Focus Areas Evaluation of next-generation
architecture in the context of DoE applications
Evaluation of throughput and multithreaded core designs for HPC
Next generation memory technologies
Design study to find right balance of core types, memory, and interconnect
Development and integration of full system simulation technology with other partners
Workload characterization and optimization for ARM architecture
https://asc.llnl.gov/fastforward/
ARM Research – Software & Large Scale Systems
Flexibility
ARM Research – Software & Large Scale Systems
Current ARM Micro architectural Flavors
• Big-Cores• Performance optimized cores• Pro: High single thread performance• Challenge: Higher power and larger area
• Little Cores• Efficiency optimized cores• Pro: Lowest energy• Challenge: Requires massive concurrency to yield
performance
• GPU/Throughput Accelerator• Highly specialized processors adapted from
gaming/graphics market space • Pro: Extremely dense performance• Challenge: Productive Programmability
What class of ARM IP?
RESEARCH & DEVELOPMENT
Source: HotChips 2014
RESEARCH & DEVELOPMENT
Source: Broadcom Presentation at IDC HPC USER FORUM APRIL 7, 2014
ARM Research – Software & Large Scale Systems
Up to 48 custom ARMv8-A cores @ 2.5GHz1S and 2S configurationUp to 4x72 bit DDR3/4 Memory ControllersFamily Specific I/O’sStandards based low latency Ethernet fabricvirtSOC™: Virtualization from Core to I/OFamily Specific Accelerators: Storage/Networking/Compute/Security The benefits of this Workload Specific approach
Efficiency (performance, latency, power, and scalability)
Best in Class Optimized solution for the specific workload
FullyVirtualized
NetworkingStorage Controller
Accelerators
OptimizedPower
LowerCost
Security
Virtualized Network
& Storage
Storage & Analytics Accelerato
r
High Speed
Network
ARM 64bitProcessor
SecurityAccelerato
r
NetworkAccelerator
ThunderX 2S Reference Platform
Cavium ThunderX
ARM Research – Software & Large Scale Systems
One size core doesn’t fit all, but one architecture can.
RESEARCH & DEVELOPMENT
Partnership
ARM Research – Software & Large Scale Systems
Challenges
ARM Research – Software & Large Scale Systems
Memory Bandwidth
But…with low memory latency
And…with low cost
But what about…data movement costs
Making solutions to the above something which can become commodity so that the price is not the primary barrier to Exascale.
Challenges for Exascale
ARM Research – Software & Large Scale Systems
Challenge: Ecosystem
RESEARCH & DEVELOPMENT
ARM Research – Software & Large Scale Systems
ARM Math Libraries
In November 2015, we plan to offer a commercially supported set of 64-bit ARMv8 numerical libraries for scientific computing, built on technology from NAG.
Enable ARM partners’ computational kernels tuned for their SOC implementation. Unified, validated framework A57, A72 and Cavium® ThunderX
optimizations available at launch date, others to follow.
All implementations hosted on arm.com
By the end of 2015, an HPC-specific ARM microsite will offer downloads, technical reference material, how-to-guides and third-party software recommendations for the scientific computing community.
2015 Focus: BLAS LAPACK FFT
ARM Research – Software & Large Scale Systems
Compilers
Commercial Open-Source
PathScale (Alpha) NAG (Alpha) GCC LLVM
C, C++, FortranOpenMP 4.0
FortranOpenMP 3.1
C, C++, Fortran,OpenMP 4.0
C, C++OpenMP 3.1
November 2014:PathScale provides the full EKOPath compiler suite including OpenACC and OpenMP 4.0 C/C++/Fortran support for ARMv8 to support HPC and Enterprise customers exploring the power efficiencies of these devices. As an enabling technology, EKOPath gives our customers the ability to compile for native ARMv8 CPU or accelerated architectures that return the fastest time to solution. Your application defines the benchmark, EKOPath lets you evaluate the new architecture with your code, across either Intel64/AMD64 and now directly compare it against the performance of enterprise ready ARMv8 processors.
November 2013:The Numerical Algorithms Group (NAG), the global numerical software and HPC services company, announces a new technical collaboration with ARM®, the world's leading semiconductor IP supplier. NAG's highly skilled team of HPC experts, numerical analysts and computer scientists will ensure the algorithms in the NAG Numerical Library and the facilities of the NAG FORTRAN Compiler are available for use on ARM's 64-bit ARMv8-A architecture-based platforms.
• Open-source focus on AArch64 correctness up to 2014.• Now improving core performance through mostly architectural
(not microarchitectural) optimisations.• Command-line enablement for new ARM cores (e.g. A72).• Most focus and improvement in floating-point code.
Current work:• Improvements for big-endian
ARM.• Floating-point rounding mode
optimization.• Making use of more
sophisticated ARM instructions.
• Scheduler / register allocation improvements.
• Improved memcpy, memset, glibc string routines.
• Improved performance on NEON intrinsics.
Current work:• Vectorizer improvement.• Loop unrolling/interleaving.• Improved register allocation.• ABI conformance.• Improve inliner heuristics.• Scheduling for Cortex-A57.• Software pipelining.• Jump threading.
RESEARCH & DEVELOPMENT
R esearch
• Co-Design• Workload optimizations and characterization for HPC & big data• Architectural & system design sensitivity sweeps for performance & energy• Simulation and modeling infrastructure
A rchitecture
• ARM Architecture Partner Engagements• Evolve architecture envelope allowing partners to better
accommodate requirements of HPC and Data Intensive Computing• Improved support for massive concurrency
Ecosystem
• Software Ecosystem Enablement• Operating systems and runtimes targeted and optimized for ARM HPC• Optimized math library enablement of ARM architecture• Parallel and vector optimizing compilers and runtimes• Cross-stack optimizations for resiliency and energy efficiency
Microarchitecture
• Broader ARM Partner Engagement• Higher performance core designs with increased computational throughput• Decreased memory latency and increased bandwidth• Multi-thread optimized cores
ARM Research – Software & Large Scale Systems
ThanksWe are growing the HPC research team and have several entry level positions open for PhDs. Come talk to me if you are interested or apply directly: http://goo.gl/re11oi