Post on 07-May-2015
Current Trends in High Performance Computing
Dr. Putchong UthayopasDepartment Head, Department of Computer Engineering,
Faculty of Engineering, Kasetsart UniversityBangkok, Thailand.
pu@ku.ac.th
I am pleased to be here!
Introduction
• High Performance Computing– An area of computing that involve the
hardware and software that help solving large and complex problem fast
• Many applications– Science and Engineering research
• CFD, Genomics, Automobile Design, Drug discovery
– High Performance Business Analysis• Knowledge Discovery• Risk analysis• Stock portfolio management
– Business is moving more to the analysis of data from data warehouse
Why we need HPC?• Change in scientific discovery – Experimental to simulation and visualization
• Critical need to solve an ever larger problem– Global Climate modeling– Life science – Global warming
• Modern business need – Design more complex machinery– More complex electronics design– Complex and large scale financial system analysis– More complex data analysis
Top 500: Fastest Computer on Our Planet
• List of the 500 most powerful supercomputers generated twice a year (June and November)
• Latest was announced in June 2012
Sequoia @ Lawrence Livermore Lab
• BlugeneQ• 34 login node– 48 cpu/node 64GB
• 98304 node– 16 cpu/node 16GB
• IBM power 7 1,572,864 CPU, 1.6 PB RAM
• Peak 20132 TFlops
Performance Development
Projected Performance Development
Top 500: Application Area
Processor Just not running faster
• Processor speed keep increasing for the last 20 years
• Common technique– Smaller process technology – increase clock speed– Improve microarchitecture• Pentium, Pentium II, Pentium III, Pentium IV, Centrino,
Core
Pitfall
• Smaller process technology let to denser transistor but….– Heat dissipation– Noise – reduce voltage
• Increase clock speed – More power used since CMOS
consume power only when switch
• Improve microarchitecture– Small improvement for a lot more
complex design • The only solution left is to use
concurrency. Doing many things at the same time
Parallel Computing• Speeding up the execution by splitting task into many
independent subtask and run them on multiple processors or core– Break large task into many small sub tasks– Execute these sub tasks on multiple core ort processors– Collect result together
14
How to achieve concurrency
• Adding more concurrency into hardware• Processor• I/O• Memory
• Adding more concurrency into software– How to express parallelism better in software
• Adding more concurrency into algorithm– How to do many thing at the same time– How to make people think in parallel
The coming (back) of multicore
Hybrid Architecture
InterconnectionNetwork
InterconnectionNetwork
Rational for Hybrid Architecture
• Most scientific application has fine grain parallelism inside– CFD, Financial computation, image processing
• Energy efficient– Employing large number of slow processor and
parallelism can help lower the power consumption and heat
Two main approaches
• Using multithreading and scale down processor that is compatible to conventional processor– Intel MIC
• Using very large number of small processors core in a SIMD model. Evolving from graphics technology – NVIDIA GPU– AMD fusion
Many Integrated Core Architecture
• Effort by Intel to add a large number of core into a computing system
Multithreading Concept
Challenges
• Large number of core will have to divide memory among them– Much smaller memory per core– Demand high memory bandwidth
• Still need an effective fine grain parallel programming model
• No free lunch , programmer have to do some work
4 cores
What is GPU Computing?
Computing with CPU + GPUHeterogeneous Computing
146X
Medical Medical Imaging Imaging U of UtahU of Utah
36X
Molecular Molecular DynamicsDynamics
U of Illinois, U of Illinois, UrbanaUrbana
18X
Video Video TranscodingTranscoding
Elemental TechElemental Tech
50X
Matlab Matlab ComputingComputing
AccelerEyesAccelerEyes
100X
AstrophysicAstrophysicss
RIKENRIKEN
149X
Financial Financial simulationsimulation
OxfordOxford
47X
Linear AlgebraLinear AlgebraUniversidad
Jaime
20X
3D 3D UltrasoundUltrasoundTechniscanTechniscan
130X
Quantum Quantum ChemistryChemistry
U of Illinois, U of Illinois, UrbanaUrbana
30X
Gene Gene SequencingSequencing
U of MarylandU of Maryland
Not 2x or 3x : Speedups are 20x to 150x
CUDA Parallel Computing Architecture
• Parallel computing architecture and programming model
• Includes a C compiler plus support for OpenCL and DX11 Compute
• Architected to natively support all computational interfaces (standard languages and APIs)
ATI’s Compute “Solution”
ATI’s Compute “Solution”
Compiling C for CUDA Applications
NVCC CPU Code
C CUDAKey Kernels
CUDA objectfiles
Rest of CApplication
CPU objectfiles
Linker
CPU-GPUExecutable
Simple “C” Description For Parallelism
void saxpy_serial(int n, float a, float *x, float *y){ for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i];}// Invoke serial SAXPY kernelsaxpy_serial(n, 2.0, x, y);
__global__ void saxpy_parallel(int n, float a, float *x, float *y){ int i = blockIdx.x*blockDim.x + threadIdx.x; if (i < n) y[i] = a*x[i] + y[i];}// Invoke parallel SAXPY kernel with 256 threads/blockint nblocks = (n + 255) / 256;saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y);
Standard C Code
Parallel C Code
Computational Finance
Source: CUDA SDK
Financial Computing Software vendorsSciComp : Derivatives pricing modelingHanweck: Options pricing & risk analysisAqumin: 3D visualization of market dataExegy: High-volume Tickers & Risk AnalysisQuantCatalyst: Pricing & Hedging EngineOneye: Algorithmic TradingArbitragis Trading: Trinomial Options Pricing
Ongoing workLIBOR Monte Carlo market modelCallable Swaps and Continuous Time Finance
Source: SciComp
Weather, Atmospheric, & Ocean Modeling
Source: Michalakes, Vachharajani
CUDA-accelerated WRF availableOther kernels in WRF being ported
Ongoing workTsunami modelingOcean modelingSeveral CFD codes
Source: Matsuoka, Akiyama, et al
New emerging Standard
• OpenCL– Support by many vendor including apple– Target for both GPU based SIMD and multithreading– More complex to program that CUDA
• OpenACC– OpenACC is a programming standard for parallel
computing developed by Cray, CAPS, Nvidia and PGI– simplify parallel programming of heterogeneous
CPU/GPU systems.– Directives based
Cluster computing• The use of large number of server that linked on
a high speed local network as one single large supercomputer
• Popular way of building supercomputer • Software– Cluster aware OS
• Windows compute cluster server 2008• NPACI Rocks Linux
• Programming system such as MPI• Use mostly in computer aided design,
engineering, scientific research
Comment
• Cluster computing is a very mature discipline• We know how to build a sizable cluster very well– Hardware integration– Storage integration : Luster, GPFS– Scheduler: PBS, Torque, SGE, LSF– Programming MPI– Distribution : ROCKS
• Cluster is a foundation fabric for grid and cloud
TERA Cluster• 1 Frontend (HP
ProLiant DL360 G5 Server) and 192 computer nodes– Intel Xeon 3.2
GHz (Dual core, Dual processor)
– Memory 4 GB (8GB for Frontend & infiniband nodes)
– 70x4 GB SCSI HDD (RAID1)
• 4 Storage Servers– Lustre file
system for TERA cluster's storage
– Attached with Smart Array P400i Controller for 5TB space
August 29,2008 TGCC 2008, Khon Khan University , Thailand
Edge Switch 1Gbps EthernetEdge Switch 1Gbps Ethernet
FESunyata
FEAraya
WinHPC(FE)
TERA(FE)
SPARE1(FE)
SPARE2(FE)
FS1FS1
FS2FS2
FS3FS3
FS4FS4
4 nodes4 nodes 4 nodes4 nodes 64nodes
96 nodes +
16 sparenodes
200 Ports Gigabit Ethernet switch200 Ports Gigabit Ethernet switch
Storage Tier 5TB Lustre FS
Anatta(FE)
15nodes
KU Fiber Backbone (1Gbps Fiber)
2.5Gbps to UninetStorage 48 TB
1 Gbps Ethernet/Fiber
Grid Computing Technology
• Grid computing enables the virtualization of distributed computing and data resources such as processing, network bandwidth and storage capacity to create a single system image, granting users and applications seamless access to vast IT capabilities.
• Just as an Internet user views a unified instance of content via the Web, a grid user essentially sees a single, large virtual computer.
Grid Architecture• Fabric Layer
– Protocol and interface that provide access to computing resources such as CPU, storage
• Connectivity Layer– Protocol for Grid-specific network
transaction such as security GSI• Resources Layer
– Protocol to access a single resources from application• GRAM (Grid Resource Allocation
Management)• GridFTP ( data access)• Grid Resource Information Service
• Collective layer– Protocol that manage and access
group of resources Fabric
Connectivity
Resources
Collective Layer
Application Layer
Globus asService-Oriented Infrastructure
IBM
IBM
Uniform interfaces,security mechanisms,Web service transport,
monitoring
Computers StorageSpecialized resource
UserApplication
UserApplication
UserApplication
IBM
IBM
GRAM GridFTPHost EnvUser Svc
DAIS
Database
ToolTool Reliable
FileTransfer
MyProxy
Host EnvUser Svc
MDS-Index
Introduction to ThaiGrid
• A National Project under Software Industry Promotion Agency (Public Organization) , Ministry of Information and Communication Technology
• Started in 2005 from 14 member organizations
• Expanded to 22 organizations in 2008
TGCC 2008, Khon Khan University , ThailandAugust 29,2008
Thai Grid Infrastructure
2.5 G
bps
1 Gb
ps
1 G
bps 1
Gbp
s
1 Gbps
2.5 Gbps155 Mbps
310
Mbp
s15
5 M
bps
310
Mbp
s155
Mbps
155 Mbps
155
Mbp
s
155 Mbps
155
Mbp
s
19 sitesAbout 1000 CPU core.
August 29,2008 TGCC 2008, Khon Khan University , Thailand
ThaiGrid Usage• ThaiGrid provides about 290
years of computing time for members– 9 years on the grid– 280 years on tera
• 41 projects from 8 areas are being support on Teraflop machine
• More small projects on each machines
TGCC 2008, Khon Khan University , ThailandAugust 29,2008
Medicinal Herb Research• Partner
– Cheminormetics Center, Kasetsart Univesity (Chak Sangma and team)
• Objective– Using 3D-molecular databse and virtual
screening to verify the traditional medicinal herb
• Benefit– Scientific proof of the ancient
traditional drug – Benefit poor people that still rely on
the drug from medicinal herb – Potential benefit for local
pharmaceutical industry
TGCC 2008, Khon Khan University , Thailand
Virtual Screening
Infrastructure
Lab Test
August 29,2008
NanoGrid
• Objective– Platform that support computational Nano science
research• Technology used
– AccelRys Materials Studio– Cluster Scheduler: Sun Grid Engine and Torque
TGCC 2008, Khon Khan University , Thailand
ThaiGridThaiGrid MS-Gateway
MS-Gateway
Computing ResourcesComputing Resources
1
2
3
August 29,2008
Challenges
• Size and Scale• Manageability– Deployment– Configuration– Operation
• Software and Hardware Compatibility
Grid System Architecture• Clusters– Satellite Sets
• 16 clusters delivered from ThaiGrid for initial members
• Composed of 5 nodes of IBM eServer xSeries 336 – Intel Xeon 2.8Ghz (Dual
Processor)– x86_64 architecture– Memory: 4 GB (DDR2 SDRAM)
– Other sets• Various type of servers and
number of nodes • Provided by member institutes
of ThaiGrid
C C C C
H
C C C C
HCC CC CC CC
H
CC CC CC CC
HREN
GCC
Grid Scheduler
Grid as a Super Cluster
TGCC 2008, Khon Khan University , ThailandAugust 29,2008
Is grid still alive?• Yes, grid is a useful technology for certain task– Bit torrent for massive file exchange infrastructure– European Grid is using it to share LHC data
• Pit fall of the grid– Network is still not reliable and fast enoughlong term
operation– Multi-site , multi- authority concept make it very complex
for • system management• Security• User to really use the system
• Recent trend is to move to centralized cloud
What is Clouding Computing?
Source: Wikipedia (cloud computing)
GoogleGoogle
AmazonAmazon
YahooYahoo MicrosoftMicrosoft
SaleforceSaleforce
Why Cloud Computing?• The illusion of infinite computing resources available on
demand, thereby eliminating the need for Cloud Computing users to plan far ahead for provisioning.
• The elimination of an up-front commitment by Cloud users, thereby allowing companies to start small and increase hardware resources only when there is an increase in their needs.
• The ability to pay for use of computing resources on a short-term basis as needed (e.g., processors by the hour and storage by the day) and release them as needed, thereby rewarding conservation by letting machines and storage go when they are no longer useful.
Source: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD lab, UC Berkeley
Source: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD lab, UC Berkeley
Cloud Computing Explained
• Saas (Software as a Services) Application delivered over internet as a services (gmail)
• Cloud is a massive server and network that serve Saas to large number of user
• Service being sold is called Utility computing
Source: “Above the Clouds: A Berkeley View of Cloud Computing”, RAD lab, UC Berekeley
Enabling Technology for Cloud Computing
• Cluster and Grid Technoogy– The ability to build a highly scalable computing
system that consists of 100000 -1000000 nodes• Service oriented Architecture– Everything is a service– Easy to build, distributed, integrate into large scale
aplication• Web 2.0– Powerful and flexible user interface for intenet enable
world
Cloud Service Model
Cloud Computing Software Stack
Architecture of Service Oriented Cloud Computing Systems (SOCCS)
User Interface
Cloud Application
Node Hardware
CSMDSS
CCR
Operating System
Interconnection Network
Node Hardware
Node Hardware
Operating System
Operating System
SOCCS can be constructed by combining CCR/DSS Software to form scalable service to a client application.
Cloud Service Management (CSM) acts as a resources management system that keeps track of the availability of services on the cloud.
57
Cloud System Configuration
58
A Proof-of-Concept ApplicationPickup and Delivery Problem with Time Window (PDPTW) Pickup and Delivery Problem with Time Window (PDPTW) is a is a
problem of serving a number of transportation requests based problem of serving a number of transportation requests based on limited number of vehicles.on limited number of vehicles.
The objective of the problem is to minimize the sum of the The objective of the problem is to minimize the sum of the distance traveled by the vehicles and minimize the sum of the distance traveled by the vehicles and minimize the sum of the time spent by each vehicle.time spent by each vehicle.
59
PDPTW on the cloud using SOCCS
Master/Worker Master/Worker model is adopted as model is adopted as a framework for a framework for service interaction.service interaction.
The algorithm is The algorithm is partitioned using partitioned using domain domain decomposition decomposition approach.approach.
Cloud application Cloud application control the control the decomposition of decomposition of the problem by the problem by sending each sub sending each sub problem to worker problem to worker service and collect service and collect the results back to the results back to the best answer.the best answer.
60
Vehicle queue (Port)
Solution queue (Port)
Enqueue work andThread Execute
work
Work queue(Dispatcher queue)
After execute. Send solution to port
Parallel Runtime Interface(Arbiter)
Solve PDPTWfunction
Number of vehicles
PDPTWfunction
Ou
tpu
t
Master thread Worker thread
Solution of PDPTW
Gather solutionfunction
After execute. Send result to output
Dispatcher
Inp
ut
Solution of
PDPTW
Number of vehicles
Gather solutionfunction
Solution of
PDPTW
Number of vehicles
PDPTWfunction
Vehicle queue (Port)
Parallel Runtime Interface(Arbiter)
Gather solutionfunction
Threads pool
Vehicle queue (Port)
Solution queue (Port)
Enqueue work andThread Execute
work
Work queue(Dispatcher queue)
After execute. Send solution to port
Parallel Runtime Interface(Arbiter)
Solve PDPTWfunction
Number of vehicles
PDPTWfunction
Ou
tpu
t
Master thread Worker thread
Solution of PDPTW
Gather solutionfunction
After execute. Send result to output
Dispatcher
Inp
ut
Solution of
PDPTW
Number of vehicles
Gather solutionfunction
Solution of
PDPTW
Number of vehicles
PDPTWfunction
Vehicle queue (Port)
Parallel Runtime Interface(Arbiter)
Gather solutionfunction
Threads pool
Results
Speed up on a Speed up on a single node single node with 4 coreswith 4 cores
61
ResultsPerformance: Performance:
Speedup and Speedup and efficiency derived efficiency derived from average from average runtime on 1, 2, 4, runtime on 1, 2, 4, 8 and 16 compute 8 and 16 compute nodes.nodes.
62
We are living in the world of Data
GeophysicalExploration
Medical Imaging
VideoSurveillance
Mobile Sensors
Gene Sequencing
Smart Grids
Social Media
Big Data
“Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the strictures of your database architectures. To gain value from this data, you must choose an alternative way to process it.”
Reference: “What is big data? An introduction to the big data landscape.”, Edd Dumbill, http://radar.oreilly.com/2012/01/what-is-big-data.html
The Value of Big Data• Analytical use– Big data analytics can reveal insights hidden
previously by data too costly to process. • peer influence among customers, revealed by analyzing
shoppers’ transactions, social and geographical data. – Being able to process every item of data in reasonable
time removes the troublesome need for sampling and promotes an investigative approach to data.
• Enabling new products.– Facebook has been able to craft a highly personalized
user experience and create a new kind of advertising business
3 Characteristics of Big Data
Big Data Challenge• Volume– How to process data so big that can not be move, or
store. • Velocity– A lot of data coming very fast so it can not be stored
such as Web usage log , Internet, mobile messages. Stream processing is needed to filter unused data or extract some knowledge real-time.
• Variety– So many type of unstructured data format making
conventional database useless.
How to deal with big data
• Integration of – Storage – Processing– Analysis Algorithm– Visualization
Massive Data
Stream
Stream processing
Processing
Processing
Processing
VisualizeVisualize
Analysis
Storage
A New Approach For Distributed Big Data
• Disparate Systems• Manual Administration• One Tenant, Many Systems• IT Provisioned Storage
• Single System Across Locations• Automated Policies • Many Tenants One System• Self-Service Access
L.A. BOSTON LONDON L.A. BOSTON LONDON
Storage Islands Single Storage Pool
Hadoop• Hadoop is a platform for distributing computing problems across a
number of servers. First developed and released as open source by Yahoo.– Implements the MapReduce approach pioneered by Google in
compiling its search indexes.– Distributing a dataset among multiple servers and operating on the
data: the “map” stage. The partial results are then recombined: the “reduce” stage.
• Hadoop utilizes its own distributed filesystem, HDFS, which makes data available to multiple computing nodes
• Hadoop usage pattern involves three stages:– loading data into HDFS,– MapReduce operations, and– retrieving results from HDFS.
WHAT FACEBOOK KNOWS
http://www.facebook.com/data
Cameron Marlow calls himself Facebook's "in-house sociologist." He and his team can analyze essentially all the information the site gathers.
The links of Love• Often young women specify that
they are “in a relationship” with their “best friend forever”.– Roughly 20% of all relationships for
the 15-and-under crowd are between girls.
– This number dips to 15% for 18-year-olds and is just 7% for 25-year-olds.
• Anonymous US users who were over 18 at the start of the relationship– the average of the shortest number
of steps to get from any one U.S. user to any other individual is 16.7.
– This is much higher than the 4.74 steps you’d need to go from any Facebook user to another through friendship, as opposed to romantic, ties.
http://www.facebook.com/notes/facebook-data-team/the-links-of-love/10150572088343859
Graph shown the relationship of anonymous US users who were over 18 at the start of the relationship.
Why?
• Facebook can improve users experience – make useful predictions about users' behavior– make better guesses about which ads you might
be more or less open to at any given time• Right before Valentine's Day this year a blog
post from the Data Science Team listed the songs most popular with people who had recently signaled on Facebook that they had entered or left a relationship
Data Tsunami
• Data flood is coming, no where to run now!– Data being generated
anytime, anywhere, anyone– Data is moving in fast– Data is too big to move, too
big to store• Better be prepare– Use this to enhance your
business and offer better services to customer
The Opportunities and Challenges ofExascale Computing
• Summary of findings from many workshop in US.
• List issues needed to overcome
• We will present only some challenges
Hardware Challenges
• Major improvement in hardware is needed.
Power Challenge• Power consumption of the
computers is the largest hardware research challenge.
• Today, power costs for the largest petaflop systems are in the range of $5-10M60 annually
• An exascale system using current technology.– the annual power cost to operate
the system would be above $2.5B per year.
– The power load would be over a gigawatt
• The target of 20 megawatts, identified in the DOE Technology Roadmap, is primarily based on keeping the operational cost of the system in some kind of feasible range.
Memory Challenge
• Memory subsystem is too slow
Data Movement Challenge
System Resiliency Challenge
• For exascale systems, the number of system components will be increasing faster than component reliability, with projections in the minutes or seconds for exascale systems.
• Exascale systems will experience various kind of faults many times per day. – Systems running 100 million cores will continually
see core failures and the tools for• Dealing with them will have to be rethought.
“Co Design” Challenge‐
The Computer Science Challenges
• A programming model effort is a critical component– clock speeds will be flat or even dropping to save
energy. All performance improvements within a chip will come from increased parallelism. The amount of memory per arithmetic
– need for fine-grained parallelism and a programming model other than message passing or coarse-grained threads
Under the radar
• Mobile processor run super computer• Hybrid war! GPU VS. MIC• I/O goes solid state• Programming standard war– CUDA/ OpenCL/ OpenMP/ OpenACC
Summary
• We are in the challenging world• Demand for HPC system, application will
increase.– Software tool , technology, hardware is changing
to catch up.
• The greatest challenge is how to quickly develop software for the next generation computing system
THANK YOU