The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...
-
Upload
wesley-lambert -
Category
Documents
-
view
215 -
download
0
Transcript of The High Performance Simulation Project Status and short term plans 17 th April 2013 Federico...
The High Performance Simulation Project
Status and short term plans
17th April 2013
Federico Carminati
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Where are we now?
Present status Several investigations of possible alternatives for “extremely
parallel – no lock” transport Not much code written, several blackboards full Some investigation on a simplified but fully vectorized model to
prove vectorization gain New design in preparation
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Major points under discussion
How to minimise locks and maximise local handling of particles
How to handle hit and digit structures How to preserve the history of the particles
This point seems more difficult at the moment and it requires more design
What is the possible speedup obtained by micro-parallelisation
What are the bottlenecks and opportunities with parallel I/O
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Dispatcher thread
Thread local
4
Current design
Logical Volume
Logical Volume basket
p array
p* array
Transport
Output particle
store
p* array
p
p* p* p* p* p*
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Features
Pros Good parallel performance but… Easy recording of particle history Limited data movement
Cons Possible limited scalability with large number of cores Non locality of particle in memory Difficult to introduce hits and digits maintaining locality
5
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
6
Design under study
Input particle list
Output particle list
p array
Hits
p array
History
List of logical Volumes
List of baskets for lv
Active event list
Sensitive volumes
Digits for lv and event ev
Logical Volume lv
List of active events for lv
Event ev
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
7
Design under study
Input particle list
Output particle list
p array
Hits
p array
History
List of logical Volumes
List of baskets for lv
Active event list
Sensitive volumes
Digits for lv and event ev
Logical Volume lv
List of active events for lv
Event ev
Transport thread
Digitizer thread
Ev build thread
Events
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
8
Design under study
Input particle list
Output particle list
p array
Hits
p array
History
List of logical Volumes
List of baskets for lv
Active event list
Sensitive volumes
Digits for lv and event ev
Logical Volume lv
List of active events for lv
Event ev
Continuouslyrotated
Flushedat the end of
event
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Features
Pros Excellent potential locality Easy to introduce hits and digits
Cons One more copy (but it is done in parallel) More difficult to preserve particle history (it is non-local!) and
introduce particle pruning
9
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow I The transport thread takes particles from the input
buffer and transports them till they stop, interact or exit from the volume At this point they are inserted in the output particle buffer for
further processing If the LV is a sensitive detector, hits are generated and stored
per LV basket A LV basked history record is kept (we have no idea how for
the moment, we need more blackboard work!) Input and output particle buffers are fixed size
structures, which can however evolve (be optimised) during simulation
10
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
11
Design under study
Input particle list
Output particle list
p array
Hits
p array
History
List of logical Volumes
List of baskets for lv
Active event list
Sensitive volumes
Digits for lv and event ev
Logical Volume lv
List of active events for lv
Event ev
✔ full!
✗ empty!
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow II When an input particle buffer is exhausted
It is marked as such by the transport thread No lock if its kept assigned to the LV basket, but possible
memory waste Can be passed to a queue of “used baskets”, but this implies
a lock In case of a flag, the dispatcher thread has to scan all LV->all
basked->all output buffers to know which ones used, but this can be optmized
Used buffers are scanned by the dispatcher thread that updates a global track counter per event -1 for each stopped “dead” particle
And then they are declared “empty” to be reused The transport thread picks up another “ready” basket
12
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow III When an output particle buffer is full, it is marked
as such Again queue insertion or just a flag In case of a flag, the dispatcher thread has to scan all LV->all
basked->all output buffers to know which ones are full, but this can be optmized
The transport thread picks another empty output buffer The dispatcher thread copies particles from the full
output particle buffer to LV-specific input particle buffers Increasing the global particle event counter
When an input particle buffer is full, the dispatcher declares it “ready to be transported”
13
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
14
Design under study
Input particle list
Output particle list
p array
p array
List of logical Volumes
List of baskets for lvLogical Volume lv
Empty buffer list Full buffer list
Ready buffer list
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow IV Note an important point
The LV basket structure has input and output particle buffers and hits and history buffers
Input and output particle buffers are Multi-event Volatile, they get emptied and filled during transport of a
single event Hits and history buffers are
Per event Permanent during the transport of a single event A basket of a LV can be handled by different threads
successively, each one with a new input and output buffers …but all these threads will add to the Hits and history data
structure till the event is flushed
15
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Processing flow V When an event is finished, the digitizer thread
kicks in and scans all the hits in all the baskets of all the LVs and digitise them, inserting them in the LV event->digit structure
When this is over, the event is built into the event structure (to be designed!) by the event builder thread
After that, the history for this event is assembled by the same thread
Then the event is output
16
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Questions? How many dispatcher, digitizer and event-builder
threads? Difficult to say, we need some more quantitative design work Measurements with G4 simulations could help
Transport thread numbers will have to adapt to the size of simulation and of the detector In ATLAS for instance 50% of the time is spent in 0.75% of
the volumes Threads could be distributed proportionally to the time spent
in the different LVs
17
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Simple observation: HEP transport is mostly local !
• Locality not exploited by the classical transportation approach
• Existing code very inefficient (0.6-0.8 IPC)
• Cache misses due to fragmented code
50 per cent of the time spent in 50/7100 volumes
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Questions?
What about memory? Fortunately we do not have “that many” LVs
19
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Detector Physical volumes Logical volumes
ALICE 4,354,735 4,764
ATLAS 29,046,966 7,143
CMS 1,166,318 1,537
LHCb 18,491,756 709
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Grand strategy
20
Simulation job
Create vectors
Basic algorithms
Use vectors
We are concentrating here
But we should look also here
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
21
Short term tasks
Continue the design work – essential before any more substantial implementation This is the most important task at the moment We have to evaluate the potential bottlenecks before starting the
implementation Implement the new design and evaluate it against the first Demonstrate speedup of some chosen geometry routines
Both on x86 CPUs and GPUs Demonstrate speedup of some chosen physics methods
Particularly in the EM domain
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
22
Possible timeline
Summer 2013 Implement a prototype according to the present design Get essential numbers from G4 (to be defined!)
Total particle in a shower, profile of development of a shower in terms of multiplicity, locality of transport ecc ecc.
Vectorize, GPU-ize, Phi-ize at least three geometry classes (simple, intermediate, hard)
Vectorize, GPU-ize, Phi-ize at least a couple of EM simplified methods (from G4?)
Fall 2013 Interface the methods above to the prototype to realise a first
protype of vectorized transport
SFT S o F T w a r e D e v e l o p m e n t f o r E x p e r i m e n t s
Thank you!