Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance...
-
Upload
beverly-cummings -
Category
Documents
-
view
215 -
download
2
Transcript of Case studies in Optimizing High Performance Computing Software Jan Westerholm High performance...
Case studies in Optimizing
High Performance Computing Software
Jan WesterholmHigh performance computing
Department of Information Technologies
Faculty of Technology / Åbo Akademi University
FINHPC / Åbo Akademi Objectives
• Sub-project in FINHPC• Three year duration 01.07.2005-30.06.2008• Objective: to improve code individuals and research
groups have written and are running on CSC machines– faster code, with in many cases exactly the same
numerical results as before– ability to run bigger problems
• Work approach: apply well known techniques from computer science
• Faster programs may imply better quality for results• Better throughput for everybody
FINHPC / Åbo AkademiLimitations
• We will use:– parallelization techniques– code optimization
• cache utilization (particularly L2-cache)• microprocessor pipeline continuity• data blocking: grid scan order
– introduction of new data structures– replacement of very simple algorithms
• sorting (quicksort instead of bubble sort)
– open source libraries
FINHPC / Åbo AkademiLimitations
• We will not:– introduce better physics, chemistry, etc.– replace chosen basic numerical technique – replace individual algorithms unless they
are clearly modularized (matrix inversion as library routine)
3 case studies
• Lattice-Boltzmann fluid simulation : 3DQ19
• Protein covariance analysis: Covana
• Fusion reactor simulation: Elmfire
3DQ19: Lattice Boltzmann
fluid mechanics• Jyväskylä University / Jussi Timonen,
Keijo Mattila; ÅA / Anders Gustafsson
• Physical background: – phase space distribution simulated in time– Boltzmann's equation: drift term and
collision term– physical quantities = moments of
distribution
3DQ19: Program Profiling
Flat profile: % cumulative self self totaltime seconds seconds calls ms/call ms/call name33.96 43.65 43.65 50 873.00 1230.10 everything2to1()30.79 83.22 39.57 50 791.40 1148.50 everything1to2()27.79 118.93 35.71 49000000 0.00 0.00
relaxation_BGK()2.30 121.89 2.96
shmem_msgs_available1.19 123.42 1.53 100 15.30 15.30 send_west()1.11 124.85 1.43 100 14.30 14.30 send_east()0.82 125.91 1.06 recv_message0.45 126.49 0.58
sock_msg_avail_on_fd0.37 126.97 0.48 100 4.80 4.80 per_bound_xslice()0.33 127.40 0.43 1 430.00 430.00 init_fluid()0.31 127.80 0.40 1 400.00 400.00 local_profile_y()0.23 128.10 0.30
socket_msgs_available0.19 128.34 0.24 1 240.00 240.00 calc_mass()0.04 128.39 0.05 net_recv0.03 128.43 0.04 1 40.00 40.00 allocation()0.02 128.46 0.03 main
3DQ19: Optimizations
• Parallelization: well done already!
• Code optimization– blocking: grid scan order– anti-dependency: make blocks of code
independent– deep fluid: mark those grid points which do
not have solids as neighbours
3DQ19: Blocking
3DQ19: Results on three parallel systems
Athlon 1800 IBMSC AMD64everything1to2(): 18,8 19,48 10,06everything2to1(): 19,34 18,78
10,52send_west(): 8,4 0,68 1,96send_east(): 8,31 1,17 3,14Total time (s): 55,15 40,28
25,76Time gained (s): 27,48 14,13
14,76Speed up (%): 33% 26% 36%
2nd case study: Covana Protein Covariance
analysis• Institute of Medical Technology, University of
Tampere / Mauno Vihinen, Bairong Chen; ÅA / André Norrgård
• Biological background– physico-chemical groups of amino acids– protein function from structure
• pair and triple correlations between amino acids
• web server for covariance analysis
Covana: Protein covariance
analysis• Protein sequences: calculate correlations
between columns of amino acids
• Typical size• 50-150 sequences (rows)• 300-1500 amino acids in a sequence (columns)
>Q9XW32_CAEEL/9-307IDVTKPTFLLTFYSIHGTFALVFNILGIFLIMK-NPKIVKMYKGFMINMQ-ILSLLADAQTTLLMQPVYILPIIGGYTNGLLWQVFR----LSSHIQMAMF---LLLLY---------LQVASIVCAIVTKYHVVSNIGKLSDRSI-LFWIF---VIVYHGCAFVITGFFSVS-CLARQ--EEENLIK------T-KFPNAISVFTLEN--VAIYDLQVN---KWMMITTILFAFMLTSSIVISFY--FSVRLLKTLPSKRNTISARSFRGHQIAVTSLM-AQAT-VPFLVL---IIP--IGTIVYLFVHVLP------NAQ-----EISNIMMAV--YSFHASLST---FVMIISTPQY
Covana: Code optimization
• Effective data structures: dynamic memory allocation
• Effective generic algorithms: sorting• Avoid recalculations
Covana: Run time
Runtime
0
50
100
150
200
250
1 4 5 6 7 8 9 10 11 12 13 14 15 24 31
Version
Tim
e (s
)
Runtime
Covana: Results
– Runtime:• Original : 227.8 s• Final Version : 2.0 s• Improvement : 112 times faster
– Computer memory usage:• Original : 3250 MB • Final Version : 37 MB• Improvement : 88 times less.
– Disk space usage:• Original : 277 MB• Final version : 21 MB• Improvement : 13 times less.
3rd study case: ELMFIRE Tokamak fusion reactor
simulation
• Jukka Heikkinen, Salomon Janhunen, Timo Kiviniemi / Advanced Energy Systems / HUT; ÅA / Artur Signell
• Physical background: – particle simulation with averaged
gyrokinetic Larmor orbits– turbulence and plasma modes
Elmfire: Tokamak fusion reactor simulation
• Goal 1: Computer platform independence– replacing proprietary library routines for random
number generation with open source routines– replacing proprietary library routines for distributed
solution of sparse linear systems with open source library routines
• Goal 2: Scalability– Elmfire ran on at most 8 processors– new data structures for sparse matrices were
invented, which make element updates efficient
Elmfire
Small problem12M particles, 8 processors
0
100
200
300
400
500
600
700
IBMSC: Orig Sepeli: Orig Sepeli: AVL Sepeli: AVL +hash
Program version
Tim
e (s
)
Elmfire
Big problem (60 times bigger than the small problem)166M particles, 64 processors
0
500
1000
1500
2000
2500
3000
3500
IBMSC: Orig Sepeli: Orig Sepeli: AVL Sepeli: AVL +hash
Program version
Tim
e (s
)
Conclusions
• Software can be improved!– modern microprocessor architecture is
taken into account: • cache utilization• pipeline
– use of well-established computer science methods
Conclusions
• In 1 case out 3, a clear impact on run time was made
• In 2 cases out of 3, previously intractable results can now be obtained
• Are these three cases representative of code running on CSC machines?– the next two cases are under study!
What have we learnt?
• Computer scientists with minimal prior knowledge of e.g. physical sciences can contribute to HPC
• Are supercomputers needed to the extent they are used today at CSC?
• Interprocess communication often a bottleneck– Parallel computing with 1000 processors may
become routine in the future for certain types of problems
• Who should do the coding? – Code for production use (intensive cycles of use,
maintainability) should be outsourced?
Co-workers:
• Mats Aspnäs, Ph.D
• Anders Gustafsson, M.Sc.
• Artur Signell, M.Sc.
• André Norrgård
THANK YOU!