Krzysztof M. Korcyl, Joanna Pła ż ek, Janusz Chwastowski, Piotr Pozna ń ski Cracow University of...

download Krzysztof M. Korcyl, Joanna Pła ż ek, Janusz Chwastowski, Piotr Pozna ń ski Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland emails:

If you can't read please download the document

Transcript of Krzysztof M. Korcyl, Joanna Pła ż ek, Janusz Chwastowski, Piotr Pozna ń ski Cracow University of...

  • Slide 1

Krzysztof M. Korcyl, Joanna Pa ek, Janusz Chwastowski, Piotr Pozna ski Cracow University of Technology, ul. Warszawska 24, 31-155 Cracow, Poland emails: {kkorcyl, jplazek, jchwastowski, ppoznanski}@pk.edu.pl Selected issues of histogramming on GPGPUs Slide 2 thousands of sensors sensor interface cards data collection nodes network data monitoring node (s) GPU histogramming card Large Scale Data Quality Monitoring System Slide 3 RAM CPU SHARED BLOCK SHARED BLOCK SHARED BLOCK SHARED BLOCK GLOBAL DEVICE NETWORK CARD RING BUFFER Data flow Input data RAM GLOBAL page-locked (pinned) memory Slide 4 thread_0 hist[0] thread_1 hist[0] thread_30 hist[0] thread_31 hist[0] thread_0 hist[1] thread_1 hist[1] thread_0 hist[255] thread_1 hist[255] thread_31 hist[255] 32 histograms (threads) * 256 bins * 4 B = 32 768 B = 32 kB Storage histograms in shared memory banked coding Slide 5 thread_0 hist[0] thread_0 hist[1] thread_0 hist[30] thread_0 hist[31] thread_0 hist[32] thread_0 hist[33] thread_31 hist[255] 48 histograms (threads) * 256 bins * 4 B = 49 152 B = 48 kB Storage histograms in shared memory notbanked coding Slide 6 Results Zeus CPU - GPGPU operating system: Scientific Linux 5 processors: 12-core Intel Xeon RAM: 99 GB Tesla M2090 Global memory available on device in bytes: 5636554752 Shared memory available per block in bytes: 49152 Warp size in threads: 32 Number of multiprocessors on device: 16 Slide 7 Results Input data 100 events two data sets one fully random and the other with half of the channels set to 0 Implementation banked, banked_halfZero, notbaned, notbanked_halfZero, cpu_halfZero, GPU_FPoperation (use some floating point operation), GPU_pinned_RAM (use page-locked memory). Slide 8 Slide 9 Slide 10 Future Explore histogramming efficiency with CPU and GPGPU for other data types: bit, 16-bit integer, floating-point (range of interesing values + underflow and overflow) Implement data transfer over the network: Data computers send data to histogramming node(s) Server at the histogramming node collects partial data and combines them in CPU RAM Monitoring thread on CPU activates GPU kernel when data ready Look into removing transmission bottleneck by installing 10Gb Ethernet card at the histogramming node