Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.

Event Reconstruction in STSEvent Reconstruction in STS

I. KiselI. KiselGSIGSI

CBM-RF-JINR MeetingCBM-RF-JINR MeetingDubna, May 21, 2009Dubna, May 21, 2009

21 May 2009, Dubna21 May 2009, Dubna Ivan Kisel, GSIIvan Kisel, GSI 22/24/24

Many-core HPCMany-core HPC

• Heterogeneous systems of many coresHeterogeneous systems of many cores• Uniform approach to all CPU/GPU familiesUniform approach to all CPU/GPU families• Similar programming languages (CUDA, Ct, OpenCL)Similar programming languages (CUDA, Ct, OpenCL)• Parallelization of the algorithm (vectors, multi-threads, many-cores)Parallelization of the algorithm (vectors, multi-threads, many-cores)

• On-line event selectionOn-line event selection• Mathematical and computational optimizationMathematical and computational optimization• Optimization of the detectorOptimization of the detector

? OpenCL ?? OpenCL ?? OpenCL ?? OpenCL ?

GamingGaming STI: STI: CellCell

GamingGaming STI: STI: CellCell

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

GP CPUGP CPU Intel: Intel: LarrabeeLarrabee

??

CPUCPU Intel: Intel: XX-coresXX-cores

CPUCPU Intel: Intel: XX-coresXX-cores

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

FPGAFPGA Xilinx: Xilinx: VirtexVirtex

??

CPU/GPUCPU/GPU AMD: AMD: FusionFusion

CPU/GPUCPU/GPU AMD: AMD: FusionFusion??

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla

GP GPUGP GPU Nvidia: Nvidia: TeslaTesla


Current and Expected Eras of Intel Processor ArchitecturesCurrent and Expected Eras of Intel Processor Architectures

From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005. From S. Borkar et al. (Intel Corp.), "Platform 2015: Intel Platform Evolution for the Next Decade", 2005.

• Future programming is 3-dimentionalFuture programming is 3-dimentional• The amount of data is doubling every 18-24 monthThe amount of data is doubling every 18-24 month• Massive data streamsMassive data streams• The RMS (Recognition, Mining, Synthesis) workload in real timeThe RMS (Recognition, Mining, Synthesis) workload in real time• Supercomputer-level performance in ordinary servers and PCsSupercomputer-level performance in ordinary servers and PCs• Applications, like real-time decision-making analysisApplications, like real-time decision-making analysis

CoresCores

HW ThreadsHW Threads

SIMD widthSIMD width


Cores and HW ThreadsCores and HW Threads

CPU architecture in CPU architecture in 20092009

CPU of your laptop inCPU of your laptop in 20152015

CPU architecture in CPU architecture in 19XX19XX

1 Process per CPU1 Process per CPU

CPU architecture inCPU architecture in 20002000

2 Threads per Process per CPU2 Threads per Process per CPU

ProcessProcessThread1 Thread2Thread1 Thread2

exeexe r/wr/wr/wr/w exeexeexeexe r/wr/w... ...... ...

Cores and HW threads are seen by an operating system as CPUs:Cores and HW threads are seen by an operating system as CPUs:> cat /proc/cpuinfo> cat /proc/cpuinfo

Maximum half of threadsMaximum half of threadsare executedare executed


SIMD WidthSIMD Width

D1D1 D2D2

S4S4S3S3S2S2S1S1

DD

S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1

S16S16S15S15S14S14S13S13S12S12S11S11S10S10S9S9S8S8S7S7S6S6S5S5S4S4S3S3S2S2S1S1

Scalar double precision (Scalar double precision (6464 bits) bits)

Vector (SIMD) double precision (Vector (SIMD) double precision (128128 bits) bits)

Vector (SIMD) single precision (Vector (SIMD) single precision (128128 bits) bits)

Intel AVX (2010) vector single precision (Intel AVX (2010) vector single precision (256256 bits) bits)

Intel LRB (2010) vector single precision (Intel LRB (2010) vector single precision (512512 bits) bits)

22 or or 1/21/2

44 or or 1/41/4

88 or or 1/81/8

1616 or or 1/161/16

FasterFaster or or SlowerSlower ? ?

SIMD = Single Instruction Multiple DataSIMD = Single Instruction Multiple DataSIMD uses vector registers SIMD uses vector registers SIMD exploits data-level parallelismSIMD exploits data-level parallelism

CPUCPU

ScalarScalar VectorVectorDD SS SS SS SS


SIMD KF Track Fit on Intel Multicore Systems: SIMD KF Track Fit on Intel Multicore Systems: ScalabilityScalability

H. Bjerke, S. Gorbunov, I. Kisel, V. Lindenstruth, P. Post, R. Ratering

Real-time performance on different CPU architectures – speed-up 100 with 32 threadsReal-time performance on different CPU architectures – speed-up 100 with 32 threads

Speed-up 3.7 on the Xeon 5140 (Woodcrest)Speed-up 3.7 on the Xeon 5140 (Woodcrest) Real-time performance on different Intel CPU platformsReal-time performance on different Intel CPU platforms

scalascalarr

doubledouble single ->single -> 22 44 88 1616 3232

1.001.00

10.0010.00

0.100.10

0.010.01

2xCell SPE ( 16 )2xCell SPE ( 16 )Woodcrest ( 2 )Woodcrest ( 2 )Clovertown ( 4 )Clovertown ( 4 )Dunnington ( 6 )Dunnington ( 6 )

# threads# threads

Fit

tim

e,

Fit

tim

e, s

/tra

cks/

track


Intel Larrabee: Intel Larrabee: 32 Cores32 Cores

L. Seiler et all, Larrabee: A Many-Core x86 Architecture for Visual Computing, ACM Transactions on Graphics, Vol. 27, No. 3, Article 18, August 2008.

LRB vs. GPU:LRB vs. GPU:Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Larrabee will differ from other discrete GPUs currently on the market such as the GeForce 200 Series and the Radeon 4000 series in three major ways:Radeon 4000 series in three major ways:• use the x86 instruction set with Larrabee-specific extensions;use the x86 instruction set with Larrabee-specific extensions;• feature cache coherency across all its cores;feature cache coherency across all its cores;• include very little specialized graphics hardware.include very little specialized graphics hardware.

LRB vs. CPU:LRB vs. CPU:The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: The x86 processor cores in Larrabee will be different in several ways from the cores in current Intel CPUs such as the Core 2 Duo: • LRB's LRB's 3232 x86 x86 corescores will be based on the much simpler Pentium design; will be based on the much simpler Pentium design;• each core supports each core supports 4-way 4-way simultaneous simultaneous multithreadingmultithreading, with 4 copies of each processor register; , with 4 copies of each processor register; • each core contains a each core contains a 512-bit vector512-bit vector processing unit, able to process processing unit, able to process 16 single precision floating point16 single precision floating point numbers at a time; numbers at a time;• LRB includes explicit cache control instructions;LRB includes explicit cache control instructions;• LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;LRB has a 1024-bit (512-bit each way) ring bus for communication between cores and to memory;• LRB includes one fixed-function graphics hardware unit.LRB includes one fixed-function graphics hardware unit.


General Purpose Graphics Processing Units (GPGPU)General Purpose Graphics Processing Units (GPGPU)

• Substantial evolution of graphics hardware over the past yearsSubstantial evolution of graphics hardware over the past years• Remarkable programmability and flexibilityRemarkable programmability and flexibility• Reasonably cheapReasonably cheap• New branch of research – GPGPUNew branch of research – GPGPU


NVIDIA HardwareNVIDIA Hardware

S. Kalcher, M. Bach

• Streaming multiprocessorsStreaming multiprocessors• No overhead thread switchingNo overhead thread switching• FPUs instead of cache/controlFPUs instead of cache/control• Complex memory hierarchyComplex memory hierarchy• SIMT – Single Instruction Multiple ThreadsSIMT – Single Instruction Multiple Threads

GT200GT200• 30 multiprocessors30 multiprocessors• 30 DP units30 DP units• 8 SP FPUs per MP 8 SP FPUs per MP • 240 SP units240 SP units• 16 000 registers per MP16 000 registers per MP• 16 kB shared memory per MP16 kB shared memory per MP• >= 1 GB main memory>= 1 GB main memory• 1.4 GHz clock1.4 GHz clock• 933 GFlops SP933 GFlops SP


SIMD/SIMT Kalman Filter on the CSC-Scout ClusterSIMD/SIMT Kalman Filter on the CSC-Scout Cluster

CPU1600

GPU9100

M. Bach, S. Gorbunov, S. Kalcher, U. Kebschull, I. Kisel, V. Lindenstruth

18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)18x(2x(Quad-Xeon, 3.0 GHz, 2x6 MB L2), 16 GB)++

27xTesla S1070(4x(GT200, 4 GB))27xTesla S1070(4x(GT200, 4 GB))


CPU/GPU Programming FrameworksCPU/GPU Programming Frameworks

• Cg, OpenGL Shading Language, Direct XCg, OpenGL Shading Language, Direct X• Designed to write shaders• Require problem to be expressed graphically

• AMD BrookAMD Brook• Pure stream computingPure stream computing• No hardware specificNo hardware specific

• AMD CAL (Compute Abstraction Layer)AMD CAL (Compute Abstraction Layer)• Generic usage of hardware on assembler levelGeneric usage of hardware on assembler level

• NVIDIA CUDA (Compute Unified Device Architecture)NVIDIA CUDA (Compute Unified Device Architecture)• Defines hardware platformDefines hardware platform• Generic programmingGeneric programming• Extension to the C languageExtension to the C language• Explicit memory managementExplicit memory management• Programming on thread levelProgramming on thread level

• Intel Ct (C for throughput)Intel Ct (C for throughput)• Extension to the C languageExtension to the C language• Intel CPU/GPU specificIntel CPU/GPU specific• SIMD exploitation for automatic parallelismSIMD exploitation for automatic parallelism

• OpenCL (Open Computing Language)OpenCL (Open Computing Language)• Open standard for generic programmingOpen standard for generic programming• Extension to the C languageExtension to the C language• Supposed to work on any hardwareSupposed to work on any hardware• Usage of specific hardware capabilities by extensionsUsage of specific hardware capabilities by extensions


Cellular Automaton Track FinderCellular Automaton Track Finder

500500

200200

1010


L1 CA Track Finder: EfficiencyL1 CA Track Finder: Efficiency

Track categoryTrack category Efficiency, Efficiency, %%

Reference set (>1 Reference set (>1 GeV/c)GeV/c)

95.295.2

All set (≥4 hits,>100 All set (≥4 hits,>100 MeV/c)MeV/c)

89.889.8

Extra set (<1 GeV/c)Extra set (<1 GeV/c) 78.678.6

CloneClone 2.82.8

GhostGhost 6.66.6

MC tracks/ev foundMC tracks/ev found 672672

Speed, s/evSpeed, s/ev 0.80.8

I. Rostovtseva

• Fluctuated magnetic field?Fluctuated magnetic field?• Too large STS acceptance?Too large STS acceptance?• Too large distance between STS stations?Too large distance between STS stations?


L1 CA Track Finder: ChangesL1 CA Track Finder: Changes

I. Kulakov


L1 CA Track Finder: TimingL1 CA Track Finder: Timing

I. Kulakov

Time

old new

1 thread 2 threads 3 threads

CPU Time [ms] 575 278 321 335

Real Time [ms] 576 286 233 238

old – old version (from CBMRoot DEC08)new – new paralleled version

Statistic: 100 central eventsProcessor: Pentium D, 3.0 GHz, 2 MB.

R [cm] 1010 9 8 7 6 5 4 33 2 1 0.5 -

CPU time [ms] 320320 285 254 220 192 171 149 132132 123 113 106 96

Real time [ms] 233233 213 193 175 154 144 129 120120 108 100 94 85

Ref set 0.970.97 0.97 0.97 0.97 0.97 0.97 0.97 0.970.97 0.97 0.96 0.96 0.96

All set 0.920.92 0.92 0.92 0.92 0.92 0.92 0.92 0.920.92 0.91 0.91 0.91 0.91

Extra 0.810.81 0.81 0.81 0.81 0.81 0.82 0.82 0.820.82 0.81 0.80 0.80 0.80

Clone 0.040.04 0.04 0.04 0.04 0.04 0.04 0.04 0.040.04 0.04 0.04 0.04 0.04

Ghost 0.040.04 0.04 0.04 0.04 0.04 0.04 0.04 0.040.04 0.04 0.04 0.04 0.04

tracks/event 686686 687 687 687 688 688 688 688688 687 684 682 682


On-line = Off-line Reconstruction ?On-line = Off-line Reconstruction ?

Off-line and on-line reconstructions will and should be parallelized Off-line and on-line reconstructions will and should be parallelized Both versions will be run on similar many-core systems or even on the same PC farmBoth versions will be run on similar many-core systems or even on the same PC farm Both versions will use (probably) the same parallel language(s), such as OpenCLBoth versions will use (probably) the same parallel language(s), such as OpenCL Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA?Can we use the same code, but with some physics cuts applied when running on-line, like L1 CA? If the final code is fast, can we think about a global on-line event reconstruction and selection?If the final code is fast, can we think about a global on-line event reconstruction and selection?

Intel SIMDIntel SIMD Intel MIMDIntel MIMD Intel CtIntel Ct NVIDIA NVIDIA CUDACUDA

OpenCLOpenCL

STSSTS ++ ++ ++ ++ ––

MuChMuCh

RICHRICH

TRDTRD

Your RecoYour Reco

Open Charm Open Charm AnalysisAnalysis

Your AnalysisYour Analysis


SummarySummary• Think parallel !Think parallel !• Parallel programming is the key to the full potential of the Tera-scale platformsParallel programming is the key to the full potential of the Tera-scale platforms• Data parallelism vs. parallelism of the algorithmData parallelism vs. parallelism of the algorithm• Stream processing – no branchesStream processing – no branches• Avoid direct accessing main memory, no maps, no look-up-tablesAvoid direct accessing main memory, no maps, no look-up-tables• Use SIMD unit in the nearest future (many-cores, TF/s, …)Use SIMD unit in the nearest future (many-cores, TF/s, …)• Use single-precision floating point where possibleUse single-precision floating point where possible• In critical parts use double precision if necessaryIn critical parts use double precision if necessary• Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)Keep portability of the code on heterogeneous systems (Intel, AMD, Cell, GPGPU, …)• New parallel languages appear: OpenCL, Ct, CUDANew parallel languages appear: OpenCL, Ct, CUDA• GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!GPGPU is personal supercomputer with 1 TFlops for 300 EUR !!!• Should we start buying them for testing the algorithms now?Should we start buying them for testing the algorithms now?


Back-up Slides (1-5)Back-up Slides (1-5)

Back-upBack-up


Back-up Slides (1/5)Back-up Slides (1/5)

Back-upBack-up



Back-upBack-up



Back-upBack-up

SIMD is out of consideration (I.K.)SIMD is out of consideration (I.K.)



Back-upBack-up


Tracking WorkshopTracking Workshop

Please be invited to thePlease be invited to the

Tracking WorkshopTracking Workshop15-17 June 2009 at GSI15-17 June 2009 at GSI

Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.

Documents

Transcript of Event Reconstruction in STS I. Kisel GSI CBM-RF-JINR Meeting Dubna, May 21, 2009.