Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance...

Case studies in Optimizing

High Performance Computing Software

Jan WesterholmHigh performance computing

Department of Information Technologies

Faculty of Technology / Åbo Akademi University

FINHPC / Åbo Akademi Objectives

• Sub-project in FINHPC• Three year duration 01.07.2005-30.06.2008• Objective: to improve code individuals and research

groups have written and are running on CSC machines– faster code, with in many cases exactly the same

numerical results as before– ability to run bigger problems

• Work approach: apply well known techniques from computer science

• Faster programs may imply better quality for results• Better throughput for everybody

FINHPC / Åbo AkademiLimitations

• We will use:– parallelization techniques– code optimization

• cache utilization (particularly L2-cache)• microprocessor pipeline continuity• data blocking: grid scan order

– introduction of new data structures– replacement of very simple algorithms

• sorting (quicksort instead of bubble sort)

– open source libraries

FINHPC / Åbo AkademiLimitations

• We will not:– introduce better physics, chemistry, etc.– replace chosen basic numerical technique – replace individual algorithms unless they

are clearly modularized (matrix inversion as library routine)

3 case studies

• Lattice-Boltzmann fluid simulation : 3DQ19

• Protein covariance analysis: Covana

• Fusion reactor simulation: Elmfire

3DQ19: Lattice Boltzmann

fluid mechanics• Jyväskylä University / Jussi Timonen,

Keijo Mattila; ÅA / Anders Gustafsson

• Physical background: – phase space distribution simulated in time– Boltzmann's equation: drift term and

collision term– physical quantities = moments of

distribution

3DQ19: Program Profiling

Flat profile: % cumulative self self totaltime seconds seconds calls ms/call ms/call name33.96 43.65 43.65 50 873.00 1230.10 everything2to1()30.79 83.22 39.57 50 791.40 1148.50 everything1to2()27.79 118.93 35.71 49000000 0.00 0.00

relaxation_BGK()2.30 121.89 2.96

shmem_msgs_available1.19 123.42 1.53 100 15.30 15.30 send_west()1.11 124.85 1.43 100 14.30 14.30 send_east()0.82 125.91 1.06 recv_message0.45 126.49 0.58

sock_msg_avail_on_fd0.37 126.97 0.48 100 4.80 4.80 per_bound_xslice()0.33 127.40 0.43 1 430.00 430.00 init_fluid()0.31 127.80 0.40 1 400.00 400.00 local_profile_y()0.23 128.10 0.30

socket_msgs_available0.19 128.34 0.24 1 240.00 240.00 calc_mass()0.04 128.39 0.05 net_recv0.03 128.43 0.04 1 40.00 40.00 allocation()0.02 128.46 0.03 main

3DQ19: Optimizations

• Parallelization: well done already!

• Code optimization– blocking: grid scan order– anti-dependency: make blocks of code

independent– deep fluid: mark those grid points which do

not have solids as neighbours

3DQ19: Blocking

3DQ19: Results on three parallel systems

Athlon 1800 IBMSC AMD64everything1to2(): 18,8 19,48 10,06everything2to1(): 19,34 18,78

10,52send_west(): 8,4 0,68 1,96send_east(): 8,31 1,17 3,14Total time (s): 55,15 40,28

25,76Time gained (s): 27,48 14,13

14,76Speed up (%): 33% 26% 36%

2nd case study: Covana Protein Covariance

analysis• Institute of Medical Technology, University of

Tampere / Mauno Vihinen, Bairong Chen; ÅA / André Norrgård

• Biological background– physico-chemical groups of amino acids– protein function from structure

• pair and triple correlations between amino acids

• web server for covariance analysis

Covana: Protein covariance

analysis• Protein sequences: calculate correlations

between columns of amino acids

• Typical size• 50-150 sequences (rows)• 300-1500 amino acids in a sequence (columns)

>Q9XW32_CAEEL/9-307IDVTKPTFLLTFYSIHGTFALVFNILGIFLIMK-NPKIVKMYKGFMINMQ-ILSLLADAQTTLLMQPVYILPIIGGYTNGLLWQVFR----LSSHIQMAMF---LLLLY---------LQVASIVCAIVTKYHVVSNIGKLSDRSI-LFWIF---VIVYHGCAFVITGFFSVS-CLARQ--EEENLIK------T-KFPNAISVFTLEN--VAIYDLQVN---KWMMITTILFAFMLTSSIVISFY--FSVRLLKTLPSKRNTISARSFRGHQIAVTSLM-AQAT-VPFLVL---IIP--IGTIVYLFVHVLP------NAQ-----EISNIMMAV--YSFHASLST---FVMIISTPQY

Covana: Code optimization

• Effective data structures: dynamic memory allocation

• Effective generic algorithms: sorting• Avoid recalculations

Covana: Run time

Runtime

0

50

100

150

200

250

1 4 5 6 7 8 9 10 11 12 13 14 15 24 31

Version

Tim

e (s

)

Runtime

Covana: Results

– Runtime:• Original : 227.8 s• Final Version : 2.0 s• Improvement : 112 times faster

– Computer memory usage:• Original : 3250 MB • Final Version : 37 MB• Improvement : 88 times less.

– Disk space usage:• Original : 277 MB• Final version : 21 MB• Improvement : 13 times less.

3rd study case: ELMFIRE Tokamak fusion reactor

simulation

• Jukka Heikkinen, Salomon Janhunen, Timo Kiviniemi / Advanced Energy Systems / HUT; ÅA / Artur Signell

• Physical background: – particle simulation with averaged

gyrokinetic Larmor orbits– turbulence and plasma modes

Elmfire: Tokamak fusion reactor simulation

• Goal 1: Computer platform independence– replacing proprietary library routines for random

number generation with open source routines– replacing proprietary library routines for distributed

solution of sparse linear systems with open source library routines

• Goal 2: Scalability– Elmfire ran on at most 8 processors– new data structures for sparse matrices were

invented, which make element updates efficient

Elmfire

Small problem12M particles, 8 processors

0

100

200

300

400

500

600

700

IBMSC: Orig Sepeli: Orig Sepeli: AVL Sepeli: AVL +hash

Program version

Tim

e (s

)

Elmfire

Big problem (60 times bigger than the small problem)166M particles, 64 processors

0

500

1000

1500

2000

2500

3000

3500

IBMSC: Orig Sepeli: Orig Sepeli: AVL Sepeli: AVL +hash

Program version

Tim

e (s

)

Conclusions

• Software can be improved!– modern microprocessor architecture is

taken into account: • cache utilization• pipeline

– use of well-established computer science methods

Conclusions

• In 1 case out 3, a clear impact on run time was made

• In 2 cases out of 3, previously intractable results can now be obtained

• Are these three cases representative of code running on CSC machines?– the next two cases are under study!

What have we learnt?

• Computer scientists with minimal prior knowledge of e.g. physical sciences can contribute to HPC

• Are supercomputers needed to the extent they are used today at CSC?

• Interprocess communication often a bottleneck– Parallel computing with 1000 processors may

become routine in the future for certain types of problems

• Who should do the coding? – Code for production use (intensive cycles of use,

maintainability) should be outsourced?

Co-workers:

• Mats Aspnäs, Ph.D

• Anders Gustafsson, M.Sc.

• Artur Signell, M.Sc.

• André Norrgård

THANK YOU!

Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance...

Documents

Transcript of Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance...