Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication...

19
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ ˜ benson https://courses.cs.ut.ee/2014/paralleel/fall/Main/HomePage 22 September 2014

Transcript of Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication...

Page 1: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Parallel Computing

Benson Muite

[email protected]://math.ut.ee/˜benson

https://courses.cs.ut.ee/2014/paralleel/fall/Main/HomePage

22 September 2014

Page 2: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Document Preparation: LaTeX and Lyx

• https://en.wikibooks.org/wiki/LaTeX

• http://texblog.org/about/

• http://www.latex-project.org/

• http://www.lyx.org/

Page 3: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Computer Architecture

Page 4: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Parallel Computer Architecture

• Chip Architecture Review• Accelerators• Graphics Cards• Intel Xeon Phi• Parallel Computer Networking• CM3• The Earth Simulator• IBM Blue Gene• K computer• Titan• Tianhe II

Page 5: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Chip Architecture Review

• Typical chip today has multiple cores• Data may need to be obtained from a hard disk, RAM or

cache before being processed• For many applications getting data can be more of a

constraint than computing the data

Page 6: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Example HPC Chip Architectures

• Intel Haswell• AMD Opteron• SPARC64 XIfx• NEC SX-ACE• IBM Power 8• IBM PowerPC A2• Hotchips (http://www.hotchips.org/), Coolchips

(http://www.coolchips.org/2015/)

Page 7: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Accelerators

• External specialized device for floating point operations• Typically good at doing many simplified instructions in

parallel• High latency is compensated by high bandwidth

Page 8: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Graphics Cards and General Purpose Computing onGraphics Cards

• Nvidia – many simple cores, CUDA, CUDA Fortran, OpenACC, OpenCL and OpenGL application programminginterfaces, strong support of academic community

• AMD – many simple cores, Open CL and OpenGL. Havelaunched APU (Accelerated Processing Unit) whichcombines CPU and GPU

• Embedded graphics cards in AMD APU, Cell phone chips,such as Qualcomm snapdragon

Page 9: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Intel Xeon Phi

• 1Tflop of performance• Mini-supercomputer in a compute card• Simplified x86 cores• Typically easy to get code to run, more difficult to get code

to run efficiently

Page 10: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Parallel Computer Networks

• Bus – simple, cheap, poor communication performance• Ring – simple, cheap, poor communication performance• Mesh – simple, more expensive than ring, better

communication performance than ring• Hypercube – good communication performance, expensive

at a large scale• Torus 2D, 3D, 4D, 6D – good communication performance,• Fat tree – Commonly used, not quite as good performance

as a torus, but cheaper• Which topology is cost effective for a monte carlo

simulation?• What is the topology of Rocket?

Page 11: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Parallel Computer Networks

• http://htor.inf.ethz.ch/research/topologies/

Page 12: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

CM-5• http://people.csail.mit.edu/bradley/cm5/,

https://en.wikipedia.org/wiki/Connection_Machine

Figure: NAS Thinking Machines CM-5, photographer: TomTrower, 1993 (This is probably a 256 processor machine.)

• 131 Gflops on 1024 processors• World’s most powerful known computer in June 1993• Fat tree topology network• Thinking Machines grew out of Danny Hills doctoral

research, but is no longer producing supercomputers

Page 13: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

The Earth Simulator• https://en.wikipedia.org/wiki/Earth_Simulator

http://www.jamstec.go.jp/ceist/avcrg/index.en.html

Figure: Old Earth Simulator Figure: Earth Simulator 2

• 35.86 Tflops on 5120 processors• World’s most powerful known computer between March

2002 and November 2004• Vector processors• Five times faster than previous first computer on Top500

Page 14: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

IBM Blue Gene L• https://en.wikipedia.org/wiki/Blue_Gene#Blue_Gene.2FL

https://asc.llnl.gov/computing_resources/bluegenel/photogallery.html

Figure: Adam Bertsch next to a Blue Gene L system atLawrence Livermore National Laboratories

• 596 Tflops on 106,496 dual core processors• World’s most powerful known computer between

November 2004 and November 2007• 3D torus and many not so fast cores• More at

https://asc.llnl.gov/computing_resources/bluegenel/configuration.html

Page 15: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

K computer• https://en.wikipedia.org/wiki/K_computer

http://www.aics.riken.jp/en/outreach/photo-gallery/

Figure: K computer at RIKEN, picture courtesy of RIKEN.

• Currently 10.5 Pflops on 88,128 SPARC64 VIIIfxprocessors with 8 cores per processor

• World’s most powerful known computer between June2011 and June 2012

• 6D “mesh/”torus network and many fast and smart cores• More at http://www.aics.riken.jp/en/k-computer/system

Page 16: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Titan• https://en.wikipedia.org/wiki/Titan_%28supercomputer%29

https://www.olcf.ornl.gov/

Figure: Titan Supercomputer at Oak Ridge National Laboratory

• 27 Pflops on 18,688 AMD Opteron 6274 16-core CPUsand 18,688 Nvidia Tesla K20X GPUs

• World’s most powerful known computer betweenNovember 2012 and June 2013

• More at https://www.olcf.ornl.gov/computing-resources/titan-cray-xk7/

Page 17: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Tianhe II

• https://en.wikipedia.org/wiki/Tianhe-2 https://www.olcf.ornl.gov/

• https://duckduckgo.com/?q=tianhe+II+pictures

• 33.86 Pflops on 32,000 Intel Xeon E5-2692 chips with48,000 Xeon Phi 31S1P coprocessors

• Fat tree topology, American chips, but Fat tree topologyInterconnect is made in China

• World’s most powerful known computer• More at

www.netlib.org/utk/people/JackDongarra/PAPERS/tianhe-2-dongarra-report.pdf

Page 18: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

Summary

• Supercomputer architectures are still evolving• Depending on the problem you are solving, the best choice

of computer architecture and algorithm should be made ifpossible

• In many cases, you have no choice in the computerarchitecture of a supercomputer, but do have some choicein the algorithm

• Sometimes you are lucky and can choose both, but mayneed to write a lot of code

Page 19: Parallel Computing · Parallel Computer Networks Bus – simple, cheap, poor communication performance Ring – simple, cheap, poor communication performance Mesh – simple, more

New Key Concepts and References

• Parallel Computer Architecture; RR 2.1-2.3• Rahman, R. Intel Xeon Phi Coprocessor Architecture

and Tools: The Guide for Application Developers,Apress Open, (2013) $0.35 on Amazon

• T. Hoefler “Networking and Computer Architecture”http://htor.inf.ethz.ch/teaching/CS498/

• A. Grama, A. Gupta, G. Karypis, V. Kumar, Introduction toParallel Computing, 2nd Ed., Addison Wesley (2003)

• Wang, E., Zhang, Q., Shen, B., Zhang, G., Lu, X., Wu, Q.,Wang, Y. High-Performance Computing on theIntel R©Xeon PhiTM, Springer (2014) http://www.springer.com/computer/communication+networks/book/978-3-319-06485-7?otherVersion=978-3-319-06486-4