Take GPU Power to the Maximum for Vision and Depth Sensor...
Transcript of Take GPU Power to the Maximum for Vision and Depth Sensor...
Maximizing GPU Power for Vision and Depth Sensor Processing
From NVIDIA's Tegra K1 to GPUs on
the Cloud
Chen Sagiv Eri Rubin
SagivTech Ltd.
• Mobile Revolution
• Mobile – Cloud Concept
• 3D Imaging
• Two use case • SceneNet on Tegra K1
• Depth Sensing on Tegra K1
• SagivTech Streaming Infrastructure
• Take home Tips for Tegra K1
Today’s Talk
• Established in 2009 and headquartered in Israel
• Core domain expertise: GPU Computing and Computer Vision
• What we do: - Technology - Solutions - Projects - EU Research - Training
• GPU expertise:
- Hard core optimizations - Efficient streaming for single or multiple GPU systems - Mobile GPUs
SagivTech Snapshot
• In 1984, this was cutting-edge science fiction in The Terminator
• 30 years later, science fiction is becoming a reality!
Mobile Revolution is happening now !
The Combined Model: Mobile & Cloud Computing
• Understanding, interpretation and interaction with our surroundings via mobile device
• Demand for immense processing power for implementation of computationally-intensive algorithms in real time with low latency
• Computation tasks are divided between the device and the server
• With CUDA – it’s simply easier!
Mobile – Cloud Concept
• Acquisition – Depth Sensors
• Processing – modeling, segmentation, recognition, tracking
• Visualization – Digital Holography
3D Imaging is happening now !
• If you’ve been to a concert recently, you’ve probably seen how many people take videos of the event with mobile phone cameras
• Each user has only one video – taken from one angle and location and of only moderate quality
Mobile Crowdsourcing Video Scene Reconstruction
Leverage the power of multiple mobile phone cameras
to create a high-quality 3D video experience that is
sharable via social networks
The Idea behind SceneNet
Creation of the 3D Video Sequence
The scene is photographed by
several people using their cell
phone camera
The video data is
transmitted via the
cellular network to a
High Performance
Computing server.
Following time
synchronization, resolution
normalization and spatial
registration, the several videos
are merged into a 3-D video
cube.
TIME
The Event Community
A 3-D video event is created.
The 3-D video event
will be available on
the internet as public
or private event.
The event will create a
community, where each
member may provide another
piece of the puzzle and view
the entire information.
TIME
VIEW
SHARE
SEARCH
GPU Computing in SceneNet
Video Registration
&
3D Reconstruction
Computational
Acceleration
Bilateral Filter Acceleration on Tegra K1
Bilateral Filter Acceleration on Tegra K1
Bilateral Filter Acceleration on Tegra K1
Bilateral Filter Acceleration on Tegra K1
Speedup GPU 4 CPU Threads 1 CPU Thread Image Size
x60 2.8ms 170ms 630ms 256 x 256
x57 12ms 690ms 2550ms 512 x 512
x60 45ms 2720ms 10300ms 1024 x 1024
• The Mission: Running a depth sensing technology on a mobile platform
• The Challenge: First time on Tegra K1
• Extreme optimizations on a CPU-GPU platform to allow the device to handle other tasks in parallel
• The Expertise:
• Mantis Vision – the 3D core technology and Structured light algorithms
• SagivTech – the GPU computing expertise
• The bottom line: Depth sensing in running in real time in parallel to other compute intensive applications !
First Depth Sensing Module for Mobile Devices – on Tegra K1
• In one word: Easy!
• Started with the most similar platform - GTX630, based on the GK208.
• Took only a few hours to transfer all the code.
• What's our secret ?
Migrating from Discrete Kepler to K1
Our Infra is composed of a set of modules
SagivTech Infra Stack
STInfraSys
STInfraGPU
STStreamingGPU
STMultiGPU STCudaK
ernels
STCuda
Functions
STGL
Interop
for (int ....) {
START_BLOCK_TIME();
... Calculate some stuff ….
TAKE_BLOCK_SUB_TIME("2. First Part");
... Calculate some stuff ….
TAKE_BLOCK_SUB_TIME("3. Second Part");
}
Timing Code Sample Simple One Line of code to time a block
Timers:
---------
BENCHMARK: Recent Avg Global Avg Max time Count
---------------------------------------------------------------------------------------------------------------
|MyFunc.1. First Part 142.594 142.659 156.859 100
|MyFunc.2. Calculation 1706.63 1720.07 1987.78 100
Timing Code Sample
Simple One Line of code to time a block
The major functionalities provided by the NDArray are: • Initialize a NDArray of any arbitrary size • Bind to an existing device/host pre-allocated pointer • Copy to/from host/device. • Load and Save functionality to/from file. Especially
useful for regression purposes • Most of the functionality of the NDArray is done in an
asynchronously manner
NDArray
• STL style code, no need to free and alloc
• Async is hidden from the user
NDArray Code Sample
st::CArray1D<int> arr_h1;
st::CArray1D<int> arr_d1(iArrayLength, false, 512);
arr_h1.Init(iArrayLength);
arr_h1.Fill(11);
arr_h1.CopyTo(arr_d1);
Single line regression system
Regression Code Sample
st::RegressionParameters par = st::System::GetInstance().GetRegParams();
par.mode = regressionMode;
st::System::GetInstance().SetRegressionParams(par);
if(!ST_REGRESSION(h_cmpNDArr)) return 1;
return 0;
ST MultiGPU Real World Use Case
Four GPUs Four pipes Utilization: 96%+
FPS: 20.46
Scaling: 3.79 – Near linear Scaling!
Note NO gaps in the profiler
GPU streaming
• Need to remember that Android is overlaid on a Linux base
• Code development and testing (including CUDA) can be done on any PC
• Profiling on Logan – NVProf for Logan – can be ported to your PC
Key Points for Developing on the K1
• There is a strong separation between the Android system and the NDK
• A CUDA developer doesn’t need to become an Android developer
• From the Android developer viewpoint this is simply a library
• An Android developer doesn’t need to become a CUDA developer
Key Points for Developing on the K1
• Only 1 SMX (compared to 15 on the k20x)
• Only one RAM, shared by the CPU and the GPU
• Shared memory is similar in behavior to shared memory in Kepler 2
• LDG - very useful, easy optimization
• We used Thrust and moved to CUB (for streams)
• Will be possible to use existing library infrastructure on Logan
Take Home Tips for CUDA on Tegra K1
• Development methodology is similar to discrete GPU development
• No dynamic parallelism
• No hyper Q
• Don’t underestimate Tegra’s CPU - the challenge is to divide work between the various components
Take Home Tips for CUDA on Tegra K1
This project is partially funded by the European Union under the 7th Research Framework, programme FET-Open SME, Grant agreement no. 309169.
Mobile Crowdsourcing Video Scene Reconstruction
T h a n k Yo u F o r m o r e i n f o r m a t i o n p l e a s e c o n t a c t
N i z a n S a g i v
n i z a n @ s a g i v t e c h . c o m
+ 9 7 2 5 2 8 1 1 3 4 5 6