BU SciDAC Meeting
-
Upload
ralph-adams -
Category
Documents
-
view
33 -
download
0
description
Transcript of BU SciDAC Meeting
BU SciDAC Meeting
Balint JooJefferson Lab
Anisotropic Clover
Why do it ? Anisotropy -> Fine Temporal Lattice Spacing at
moderate cost Combine with Group Theoretical Baryon
Operators -> Access to Excited States Nice preliminary results – with just Wilson
Excited states
States with spin 5/2+http://arxiv.org/pdf/hep-lat/0609052
http://arxiv.org/pdf/hep-lat/0601029
Anisotropic Clover
Why do it ? Part of Jlab 3 prong Lattice QCD programme
Prong 1: Dynamical Anisotropic Clover Prong 2: DWF on a staggered sea (MILC Configs) Prong 3: Large Scale Dynamical DWF
This programme was specially commended by the DOE at our recent Science and Technology Review
Anisotropic Clover is a major part of the INCITE proposal (for XT3 and BG/?) machines
Anisotropic Clover
Level 2 Clover Term and Inverse & Force Term Wired into Chroma -> Provides HMC/RHMC
Our Choice of Gauge Action: Plaquette + Rectangle + Adjoint Term
Fermion Action Anisotropic Clover + Stout Smearing
Stout Force Recursion Usual Barrage of DF techniques
Hasenbusch + Chronology for 2 flavours RHMC for the +1 flavour Multi time scale integrators
CG Inverter Performance
We only got 7.3Tflops on 8K CPUs :( - but we didn't work much at all at optimzation
Clover Work Under SciDAC 2
Performance is OK but want better... Optimizations
Clover SSE Optimizations for Clusters & XT3 BAGEL terms for BG/???
Multi Mass Inverter, Trace Terms Would like to optimize the actual bottleneck
CG Inverter is not the current bottleneck Help from our friends at RENCI at identifying the
exact hotspots? (Right now we rely on gprof) Algorithmic: Temporal Preconditioning ('later)
Thoughts at the back of my mind
Are we actually going to get any time at ORNL? We asked for a lot
I think 20M CPU hours just for the clover stuff Incite proposal was extremely hurried We had to respond very quickly
Many small groups did not have (stand?) a chance How much effort should we be investing? Should we be focusing on BlueGene/? and
clusters more?
CRE and ILDG Progress on CRE has been slow. Why?
Manpower reasons in SciDAC 1? People are happily running production already
without it? In which case is it just LOW VALUE? where are the 'armies of new users' who need it?
What are the issues? Intimately tied to infrastructure at each site. site infrastructure leverages off experiments
different everywhere High Maintenance
PBS, LoadLeveller, NSF? dcache anyone? upgrade of mvapich, OpenMPI, IB fabric etc
Inherently non portable (what about ANL/ORNL)
CRE and ILDG
If it has low value, no user demand and is high maintenance and won't work outsideour sites.... is it worth doing? can we just drop it ? PLEASE? Anyway common environments are so passe
and 90s. Nowadays we should think about 'interoperable grid environments' – they're IN!
ILDG
Middleware Progressed but still on eXist MDC dumb RC: (just remap the LFN to a FNAL
dcache name) Issues:
Where is all the markup ? Eventually need more sophisticated RC ? Markup is NOT anisotropy aware (future fights in
the MDWG – will take time) working towards interoperability
Meeting at JlLab Dec 11-13. Can folks from BNL and FNAL come?
Testing and Release
Unit Testing v.s. End to End Testing Too much existing code
We intermix QMP, QDP++, QIO, XpathReader, LIME, Chroma,
Wilson Dslash or BAGEL Dslash, possibly BAGEL linear algebra, level 3 CG-DWF
Unit testing all of these is difficult End to End Tests: Compare the final result
eg: correlation functions Lots of output – selective diffs?
QDP++ Uses XML, Selective Diffs through XMLDiff
Structure Test Consists of
Executable, Input XML, Expected Output XML Metric file to decide which bits of the Output we
need to check Runner – abstract away running
Trivial Runner (just re-echoes your commands) MPIRUN runner (runs on 2 Jlab IB nodes) prototype YOD runner (for XT3) LoadLeveller runner (for BG/L) – yucky
Driver Scripts run interactively (eg scalar targets) & check submit jobs to a queue, check later (for queues)
What has testing taught us?
We run through this regression framework nightly: gcc3,gcc4, scalar, parscalar-ib
What runs fine with gcc3.x on RHEL won't necessarily run fine with gcc4.x on FC5 Maintenance:
Keep up with compilers – identify problems ICC – catastrophic error: can't allocate register (SSE inline) VACPP (XLC) – 'Internal Compiler error: Please contact IBM
representative' on templates PGI: No inline assembler? intrinsics? we really MUST focus on this issue or will it be GCC 3.4.x forever (seems most stable so far)
SciDAC Release Pages?
What's the actual problem here? Jlab page has releases that live in the JLAB CVS
release directory previous versions (by vox populi) We strive to keep the pages up to date
Not everyone uses Jlab CVS. Why? do you prefer to run your own repository? do you you want to use Subversion? do you think only sissies use version control?
Centralizing release management is bad imagine if I had to be responsible for the release of a
code that I myself could only pick up by web page? Is it only John Kogut who is unhappy?
A possible solution ...
... to the problem which may or may not exist A SourceForge like setup (Gforge) Provides Per Project
Web-Space, Release Tarball Space Source Code Management Modules (CVS & SVN)
May be able to 'proxy' for your own repo. Mailing Lists, Bugtracker, Newsfeeds yadda yadda Wiki like authentication
Our new Sysadmins are installing this at JLAB But all the effort is wasted if folks don't use it...