Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400...
-
Upload
duongthien -
Category
Documents
-
view
216 -
download
0
Transcript of Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400...
QUASAR (GPU Programming Language) on GDaaS Accelerates Coding
from Months to Days
2
OUTLINE
1
2
3
CAUSE
THE
OFFER
HOW DOES
IT WORK
4
5
6
DEMO
RESULTS
CONCLUSION
GPUs are
everywhere
4
Each
HW platform
requires a new
implementation
Long
development
lead times
Strong coupling
between algorithm
development and
implementation
hellip ALMOST EVERYWHERE
Low level
coding experts
are required
5
OBSERVATION
While breakthrough results are achieved still limited
usage in research
mdashScientific articles mentioning CUDA 90K
mdashScientific articles mentioning a specific scripting language
400K-1900K
A continuous investment in easy access
mdashOptimized libraries cuFFTcuDNN cuBLAShellip
mdashTools Digits
High level access increases potential market by a factor 7
Limited to specific applications
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
2
OUTLINE
1
2
3
CAUSE
THE
OFFER
HOW DOES
IT WORK
4
5
6
DEMO
RESULTS
CONCLUSION
GPUs are
everywhere
4
Each
HW platform
requires a new
implementation
Long
development
lead times
Strong coupling
between algorithm
development and
implementation
hellip ALMOST EVERYWHERE
Low level
coding experts
are required
5
OBSERVATION
While breakthrough results are achieved still limited
usage in research
mdashScientific articles mentioning CUDA 90K
mdashScientific articles mentioning a specific scripting language
400K-1900K
A continuous investment in easy access
mdashOptimized libraries cuFFTcuDNN cuBLAShellip
mdashTools Digits
High level access increases potential market by a factor 7
Limited to specific applications
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
GPUs are
everywhere
4
Each
HW platform
requires a new
implementation
Long
development
lead times
Strong coupling
between algorithm
development and
implementation
hellip ALMOST EVERYWHERE
Low level
coding experts
are required
5
OBSERVATION
While breakthrough results are achieved still limited
usage in research
mdashScientific articles mentioning CUDA 90K
mdashScientific articles mentioning a specific scripting language
400K-1900K
A continuous investment in easy access
mdashOptimized libraries cuFFTcuDNN cuBLAShellip
mdashTools Digits
High level access increases potential market by a factor 7
Limited to specific applications
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
4
Each
HW platform
requires a new
implementation
Long
development
lead times
Strong coupling
between algorithm
development and
implementation
hellip ALMOST EVERYWHERE
Low level
coding experts
are required
5
OBSERVATION
While breakthrough results are achieved still limited
usage in research
mdashScientific articles mentioning CUDA 90K
mdashScientific articles mentioning a specific scripting language
400K-1900K
A continuous investment in easy access
mdashOptimized libraries cuFFTcuDNN cuBLAShellip
mdashTools Digits
High level access increases potential market by a factor 7
Limited to specific applications
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
5
OBSERVATION
While breakthrough results are achieved still limited
usage in research
mdashScientific articles mentioning CUDA 90K
mdashScientific articles mentioning a specific scripting language
400K-1900K
A continuous investment in easy access
mdashOptimized libraries cuFFTcuDNN cuBLAShellip
mdashTools Digits
High level access increases potential market by a factor 7
Limited to specific applications
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
6
On GPU desktop as a service (GDAAS)
HIGH LEVEL
PROGRAMMING
LANGUAGE
IDE
amp RUNTIME
OPTIMIZATION
KNOWLEDGE
BASE amp
LIBRARIES
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
7
rsquoS VALUE PROPOSITION
Larger
market and
faster take
up
Days iso of
months
to get started
Algorithm
development
decoupled from
implementation
Lowering
barrier
of entry
Earlier
product
launch
Development
cycle reduction
by a factor of
3 to 10
Faster development
Reducing
RampD and
maintenance
costs
lines of codes
reduction by a
factor of 2 to 3
Same
performance
Efficient
code
Early access
to highest
performance
Future proof
code can
also target
other GPU
models
Future
proof
Better
products
Distinctive
tools for
coding
and design
analysis and
exploration
Better
algorithms
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
8
HOW DOES IT WORK
Y = sum(A + B C + D)
Code analysis and target-dependent lowering
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
9
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
10
HOW DOES IT WORK
Y = sum(A + B C + D)
function $outscalar = __kernel__
kernel$1(AveccoluncheckedBvecuncheckedCscalar
Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)
$binsvecunchecked=shared(blkdim)
$accum0=0
for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)
pos=$m
$accum0+=(A[pos]+(B[pos]C)+D[pos])
end
$bins[blkpos]=$accum0
syncthreads
$bit=1
while ($bitltblkdim)
if (mod(blkpos(2$bit))==0)
$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])
endif
syncthreads
$bit=2
continue
end
if (blkpos==0)
$out+=$bins[0]
endif
end
$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)
Code analysis and target-dependent lowering
__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)
shmem shmem
shmem_init(ampshmem)
int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx
Matrix o35 bins
scalar accum0
int m _Lpos bit
bins = shmem_allocltscalargt(ampshmemblkdim)
accum0 = 00f
for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))
accum0 += vector_get_atltscalargt(_PA m) +
vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)
vector_set_atltscalargt(bins blkpos accum0)
__syncthreads()
for (bit = 1 bit lt blkdim bit = 2)
if (mod(blkpos(2 bit)) == 0)
scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)
scalar t15 = vector_get_atltscalargt(bins blkpos)
vector_set_atltscalargt(bins blkpos (t15 + t05))
if (blkpos == 0)
atomicAdd(ret vector_get_atltscalargt(bins 0))
Parallel reduction algorithm using
shared memory
Automatic generation of a kernel
function
Compile-time handling of
boundary checks
Automatic generation of
CUDAOpenCLC++ code
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
11
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
DEVELOPMENT
Runtime information
High level scripting
Ideal for rapid prototyping
Compact readable code
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
12
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
Optimization
hints
Automatic
detection of
parallelism
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
13
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA
COMPILATION
HARDWARE
DEVELOPMENT
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
14
rsquoS WORKFLOW
OPTIMAL RUNTIME
EXECUTION
Runtime
information
SCRIPTING
LANGUAGE
CODE
ANALYSIS amp
LOWERING
INPUT
DATA HARDWARE
COMPILATION
HW-setup load memory state
scheduling
DEVELOPMENT
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
QUASAR ON
GDaaS
BENEFITS
Anyness (screen
device GPU power)
Hourly model (1-4 GPUs)
Monthly Quasar licenses
Instant app distribution
Today M60
Coming Multi-GPU
1
2
3
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
DEMO
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
17
050
100150200250300350400
Filter 32 taps withglobal memory
2D spatial filter32x32 separable
with global memory
Wavelet filter withglobal memory
CUDA-LOC QUASAR-LOC
RESULTS
Lines of code
0
05
1
15
2
25
3
35
4
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-Dev Time QUASAR-Dev time
Development Time
0
1000
2000
3000
4000
5000
6000
Filter 32 taps withglobal memory
2D spatial filter 32x32separable with global
memory
Wavelet filter withglobal memory
CUDA-time (ms) QUASAR-time (ms)
Execution time (ms)
Implementation of MRI
reconstruction algorithm in lt14 days
using QUASAR
versus 3 months using CUDA
More efficient code and shorter
development times while keeping
same performance
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
18
QUASAR APPLICATIONS
Quasar on GDaaS Accelerates Coding
from Months to Days
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
19
CONCLUSION
High level scripting language
Ideal for rapid prototyping
Fast development
Maintainable compact code
Optimal usage of heterogeneous hardware (multi-core GPUs)
Context aware execution
Build once execute on any system
Different hardware different optimization
Future proof code
Better faster and smarter development thanks to GDaaS
Quasar on GDaaS Accelerates Coding
from Months to Days
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom
wwwgdaascomquasar
wwwgepuraio
Visit us on booth 826
LEAVE YOUR BUSINESS CARD TO
REQUEST YOUR FREE TRIAL OR
GO TO wwwgdaascom