Combining NVIDIA Docker and databases to enhance agile...
Transcript of Combining NVIDIA Docker and databases to enhance agile...
ORNL is managed by UT-Battelle for the US Department of Energy
Combining NVIDIA Docker and databases to enhance agile development and optimize resource allocation
Chris Davis,Sophie Voisin,Devin White,Andrew Hardin
Scalable and High Performance Geocomputation Team Geographic Information Science and Technology GroupOak Ridge National Laboratory
GTC 2017 – May 2017
2
Outline
• Background
• Example HPC Application
• Study Results
• Lessons Learned / Future Work
3
The Story
• We are:– Developing an HPC suite of applications– Spread across multiple R&D teams– In an Agile development process– Delivering to a production environment– Needing to support multiple systems / multiple capabilities– Collecting performance metrics for system optimization
4
Why We Use NVIDIA-DockerResource Optimization
GPU Access
Flexibility
Operating Systems
NVIDIA-Docker Docker Virtual Machine
5
Hardware – Quadro: Compute + Display
Card M4000 P6000
Capability 5.2 6.1Block 32 32SM 13 30Cores 1664 3840Memory 8GB 24GB
6
Hardware – Tesla: Compute Only
Card K40 K80
Capability 3.5 3.7Block 16 16SM 15 13Cores 2880 2496Memory 12GB 12GB
7
Hardware – High End
DELL C4130
GPU 4 x K80
RAM 256GB
Cores 48
SSD Storage 400GB
8
Constructing Containers
• Build Container:– Based off NVIDIA Images at gitlab.com– https://gitlab.com/nvidia/cuda/tree/centos7– CentOS 7– CUDA 8.0 / 7.5– cuDNN 5.1– GCC 4.9.2– Cores: 24– Mount local folder with code
• Compile against chosen compute capability• Copy product inside container• ”docker commit” container updates to new image• “docker save” to Isilon
Isilon
Container
Container
Container
Git Repo
PostgreSQL
Compile Stats
Profile Stats
Data
HPC Server
NVIDIA-Docker
GPUsCPUs
Local Drive
Container
9
Running Containers
• For each compute capability:– “docker load” from Isilon storage– Run container & profile script– Send nvprof results to Profile Stats DB– Container/Image removed
Isilon
Container
Container
Container
PostgreSQL
Compile Stats
Profile Stats
Data
HPC Server
NVIDIA-Docker
GPUsCPUs
Local Drive
Container
10
Hooking It All Together
HPC Server
NVIDIA-Docker
GPUsCPUs
Local Drive
Container
Isilon
Container
Container
Container
Git Repo
PostgreSQL
Compile Stats
Profile Stats
Data
HPC Server
NVIDIA-Docker
GPUsCPUs
Local Drive
Container
HPC Server
NVIDIA-Docker
GPUsCPUs
Local Drive
Container
• One server generates containers
• All servers pull containers from Isilon
• Data to be processed pulled from Isilon
• Container build stats stored in Compiler DB
• Container execution stats stored in Profiler DB
11
Profiling Combinations
• nvprof– Output Parsed– Sent to Profile DB
• Containers for:– Cuda Version– Each Capability– All Capabilities– CPU only
• Data sets: 4
• Total of 104 profiles
CPU
3.0
3.5
3.75.0
5.2
6.0
6.1
CUDA 8.0
D1
D2D3
D4
M4000
K80
P6000
K40
All Capabilities
CUDA 7.5
12
Database
Hostname
Dataset
CUDA Version
Num CPU Threads
Compile Time
Compute Capability
Execution Time
Timestamp
GPU Device
Num CPU Threads
Timestamp
Num CPU Threads
Dataset
Kernel / API Call
Step Time Percent
Step Time
Num Calls
Ave Time
Min Time
Max Time
Step Name
Timestamp
• Postgres Databases– Shared Fields– Compile DB– Run Time DB– NVPROF DB
13
Outline
• Background
• Example HPC Application
• Study Results
• Lessons Learned / Future Work
14
Example HPC Application
• Geospatial metadata generator– Leverages Open Source 3rdparty libraries
• OpenCV, Caffe, GDAL, …
– Computer Vision Algorithms – GPU Enabled• SURF, ORB, NCC, NMI…
– Automated matching against control data– Calculates geospatial metadata for input imagery
Satellites Manned Aircraft Unmanned Aerial Systems
15
• Two-step Image Re-alignment Application using NMI
Example HPC Application - GTC16
Input Image
Source Selection
Global Localization
Registration
Resection
MetadataOutput Image
GPU
Preprocessing
CPU
Pipeline
Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP
Normalized Mutual Information
!"# = &' + &)&*
Histograms SourceControl
Joint
16
• Global Localization
Example HPC Application - GTC16
Input Image
Source Selection
Global Localization
Registration
Resection
MetadataOutput Image
GPU
Preprocessing
CPU
Pipeline
Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP
Control 382x100
Tactical 258x67
• Objective– Re-align the source image with the control image.
• Method In-house Implementation– Roughly match source and control images.
– Coarse resolution
– Mask for non-valid data
– Exhaustive search
Solutions 4250
17
Example HPC Application - GTC16
• Global Localization
18
• Similarity Metric
Example HPC Application - GTC16
– Normalized Mutual Information
– Histogram with masked area• Missing data
• Artifact
• Homogeneous area
Source image and mask: NSxMS pixels
Control image and mask: NCxMC pixels
Solution space: nxm NMI coefficients
!"# = &' + &)&*
& = −,- . /012- .3
456
& istheentropy- . istheprobabilitydensityfunction
H ∈ J 0. . 255 for S and C0. . 65535 for J
19
Example HPC Application - GTC16
Summary• Global Localization as coarse re-alignment
– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536
• Exhaustive search
– Solution: leverage of the K80 specifications• 12 GB of memory
• 1 thread per solution
• Less than 25 seconds - 61K solutions
for a 131K pixel image
Kernel specifications
occupancy 100%
threads / block 128
stack frame 264192
total memory / block 33.81 MB
total memory / SM 541.06 MB
total memory / GPU 7.03 GB
memory % 61.06%
spill stores – spill loads 0 – 0
registers 27
smem / block 0
smem / SM 0
smem % 0.00%
cmem[0] – cmem[2] 448 – 20- 1 solution / thread
20
• Registration Control 382x100
Tactical 258x67
Example HPC Application - GTC16
Input Image
Source Selection
Global Localization
Registration
Resection
MetadataOutput Image
GPU
Preprocessing
CPU
Pipeline
Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP
21
• Registration Control 382x100
Tactical 258x67Tactical & Control 4571x1555
Example HPC Application - GTC16
Input Image
Source Selection
Global Localization
Registration
Resection
MetadataOutput Image
GPU
Preprocessing
CPU
Pipeline
Core Libraries:• NITRO• GDAL• Proj.4• libpq (Postgres)• OpenCV• CUDA• OpenMP
• Objective– Refine the localization
• Method– Use higher resolution ~400 times– Keypoint matching
22
Example HPC Application - GTC16
Tiepoint list
Control Image
Descriptor
Keypoint listdetect frommetric
Search Windows
detect describeSource Image Keypoint list Descriptor
Descriptors: 11x11 intensity values
Search windows: 73x73 pixels
• Registration Workflow
23
• Similarity Metric– Normalized Mutual Information
– Small “images” but numerous Keypoints• Numerous keypoints
– up to 65536 with GPU SURF detector• Image / Descriptor size
– 11 x 11 intensity values to describe• Search area
– 73 x 73 control sub-image• Solution space
– 63 x 63 = 3969 / keypoint
Application
Descriptors: 11x11 intensity values
Search windows: 73x73 pixels
Solution spaces: 63x63 NMI coefficients
!"# = &' + &)&*
& = −,- . /012- .3
456
& istheentropy- . istheprobabilitydensityfunction
H ∈ J 0. . 255 for S and C0. . 65535 for J
…
…
…
24
Example HPC Application - GTC16
Summary• Registration refine the re-alignment
– Problematic: joint histogram computation for each solution• No compromise on the number of bins - 65536
• Exhaustive search
– Solution: leverage of the K80 specifications• 12 GB of memory
• 1 block per solution
• Leverage the number of values of the descriptors
121 (maximum) << 65536
• Less than 100 seconds - 65K keypoints
260M NMI coefficients
• About 10K keypoints in less than 20 seconds
List of indices for source
List of indices for the corresponding subset controlJoint histogram
=
KernelFind the best match for all keypoints
1 block per keypointOptimize for the 63 x 63 search windows
64 threads / blocks – 1 idle each threads compute a “row” of solutions
Sparse joint histogram65536 bins but only 121 values
Leverage the 11 x 11 descriptor sizeCreate 2 lists (length 121) of intensity valuesUpdate joint histogram count from listsLoop over lists to retrieve aggregate count Set aggregate count to 0 after first retrieval
25
Outline
• Background
• Example HPC Application
• Study Results
• Lessons Learned / Future Work
26
Compile Time Results
0
100
200
300
400
500
600
700
800
900
1000
0
500
1000
1500
2000
2500
OFF 30 35 37 50 52 60 61 30 - 52 30 - 61
size
of b
inar
y fil
es in
MB
time
in s
econ
dsCompute Capability Specifications
CUDA 7.5 CUDA 8.0 CUDA7.5 CUDA 8.0
27
Run Time Results
0
50
100
150
200
D1 Ave Run Time (sec)
CPU CUDA 7.5 CUDA8
0
50
100
150
200
D2 Ave Run Time (sec)
CPU CUDA 7.5 CUDA 8
0
50
100
150
200
D3 Ave Run Time (sec)
CPU CUDA 7.5 CUDA 8
0
50
100
150
200
D4 Ave Run Time (sec)
CPU CUDA 7.5 CUDA 8
28
K80 - Kernel Time Results in Seconds with nvprof
10
15
20
25
30
35
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
D1 D2 D3 D4
Step 2 Kernel Timings vs CUDA version (7.5 and 8)
average min max std std
0.1
0.15
0.2
0.25
0.3
0.35
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
D1 D2 D3 D4
Step 1 Kernel Timings vs CUDA version (7.5 and 8)
average min max std std
29
Run Time Results
020406080
100120140160180200
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
K40 K80 M4000 P6000
D1 - Step 2 Kernel (sec)
average min max std std
020406080
100120140160180200
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
K40 K80 M4000 P6000
D2 - Step 2 Kernel (sec)
average min max std std
020406080
100120140160180200
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
K40 K80 M4000 P6000
D3 - Step 2 Kernel (sec)
average min max std std
020406080
100120140160180200
CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8 CUDA 7.5 CUDA 8
K40 K80 M4000 P6000
D4 - Step 2 Kernel (sec)
average min max std std
30
Outline
• Background
• Example HPC Application
• Study Results
• Lessons Learned / Future Work
31
Lessons Learned
• GPU isolation: Ran into issue with swapping out P6000 and K40. – nvidia-smi swapped GPU ID for K40 and M4000.– This caused nvidia-docker to ignore NV_GPU value– UUID vs Index – Our Application can set the GPU index for multi-GPU environment
• (default to 0)
32
Future Work
• Move off Desktop machines to full testing platform with dedicated hardware with multiple GPU types
• Investigate Docker Registry & Docker Swarm for managing containers
• Enhance Database analysis to autogenerate reports
• Generalize the process to containerize any GPU application to profile with this architecture
Thank you!
34
Customer Resources
DELL C4130
GPU 4 x K80
RAM 256GB
Cores 48
SSD Storage 400GB
0
5
10
15
20
25
30
35
40
45
50
D1 D2 D3 D4
Run time with 6 threads (sec)
CPU CUDA 7.5