Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400...

QUASAR (GPU Programming Language) on GDaaS Accelerates Coding

from Months to Days

2

OUTLINE

1

2

3

CAUSE

THE

OFFER

HOW DOES

IT WORK

4

5

6

DEMO

RESULTS

CONCLUSION

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation

hellip ALMOST EVERYWHERE

Low level

coding experts

are required

5

OBSERVATION

While breakthrough results are achieved still limited

usage in research

mdashScientific articles mentioning CUDA 90K

mdashScientific articles mentioning a specific scripting language

400K-1900K

A continuous investment in easy access

mdashOptimized libraries cuFFTcuDNN cuBLAShellip

mdashTools Digits

High level access increases potential market by a factor 7

Limited to specific applications

6

On GPU desktop as a service (GDAAS)

HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7

rsquoS VALUE PROPOSITION

Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK

Y = sum(A + B C + D)

Code analysis and target-dependent lowering

9

HOW DOES IT WORK


function $outscalar = __kernel__

kernel$1(AveccoluncheckedBvecuncheckedCscalar

Dveclsquounchecked$datadimsintblkposintblkdimintblkidxint)

$binsvecunchecked=shared(blkdim)

$accum0=0

for $m=(blkpos+(blkidxblkdim))(64blkdim)($datadims-1)

pos=$m

$accum0+=(A[pos]+(B[pos]C)+D[pos])

end

$bins[blkpos]=$accum0

syncthreads

$bit=1

while ($bitltblkdim)

if (mod(blkpos(2$bit))==0)

$bins[blkpos]=($bins[blkpos]+$bins[blkpos+$bit])

endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end

$out=parallel_do([($blksz[1641])$blksz]ABCDnumel(A)kernel$1)


Parallel reduction algorithm using

shared memory

Automatic generation of a kernel

function

Compile-time handling of

boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end



__global__ void kernel(scalar ret Vector _PA Vector _PB scalar _PC Vector _PD int _P_datadims)

shmem shmem

shmem_init(ampshmem)

int blkpos = threadIdxx blkdim = blockDimx blkidx = blockIdxx

Matrix o35 bins

scalar accum0

int m _Lpos bit

bins = shmem_allocltscalargt(ampshmemblkdim)

accum0 = 00f

for (m = (blkpos + (blkidx blkdim)) m lt= _P_datadims - 1 m += (64 blkdim))

accum0 += vector_get_atltscalargt(_PA m) +

vector_get_at_checkedltscalargt(_PB _Lpos) _PC + vector_get_at_checkedltscalargt(_PD m)

vector_set_atltscalargt(bins blkpos accum0)

__syncthreads()

for (bit = 1 bit lt blkdim bit = 2)

if (mod(blkpos(2 bit)) == 0)

scalar t05 = vector_get_at_safeltscalargt(bins blkpos + bit)

scalar t15 = vector_get_atltscalargt(bins blkpos)

vector_set_atltscalargt(bins blkpos (t15 + t05))

if (blkpos == 0)

atomicAdd(ret vector_get_atltscalargt(bins 0))


shared memory


function


boundary checks

Automatic generation of

CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information

High level scripting

Ideal for rapid prototyping

Compact readable code

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION

HW-setup load memory state

scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)

Hourly model (1-4 GPUs)

Monthly Quasar licenses

Instant app distribution

Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400

Filter 32 taps withglobal memory

2D spatial filter32x32 separable

with global memory

Wavelet filter withglobal memory

CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4


2D spatial filter 32x32separable with global

memory


CUDA-Dev Time QUASAR-Dev time

Development Time

0

1000

2000

3000

4000

5000

6000



memory


CUDA-time (ms) QUASAR-time (ms)

Execution time (ms)

Implementation of MRI

reconstruction algorithm in lt14 days

using QUASAR

versus 3 months using CUDA

More efficient code and shorter

development times while keeping

same performance

18

QUASAR APPLICATIONS

Quasar on GDaaS Accelerates Coding

from Months to Days

19

CONCLUSION

High level scripting language


Fast development

Maintainable compact code

Optimal usage of heterogeneous hardware (multi-core GPUs)

Context aware execution

Build once execute on any system

Different hardware different optimization

Future proof code

Better faster and smarter development thanks to GDaaS


from Months to Days

wwwgdaascomquasar

wwwgepuraio

Visit us on booth 826

LEAVE YOUR BUSINESS CARD TO

REQUEST YOUR FREE TRIAL OR

GO TO wwwgdaascom

2

OUTLINE

1

2

3

CAUSE

THE

OFFER

HOW DOES

IT WORK

4

5

6

DEMO

RESULTS

CONCLUSION

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation


Low level

coding experts

are required

5

OBSERVATION


usage in research



400K-1900K



mdashTools Digits



6


HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

GPUs are

everywhere

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation


Low level

coding experts

are required

5

OBSERVATION


usage in research



400K-1900K



mdashTools Digits



6


HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

4

Each

HW platform

requires a new

implementation

Long

development

lead times

Strong coupling

between algorithm

development and

implementation


Low level

coding experts

are required

5

OBSERVATION


usage in research



400K-1900K



mdashTools Digits



6


HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

5

OBSERVATION


usage in research



400K-1900K



mdashTools Digits



6


HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

6


HIGH LEVEL

PROGRAMMING

LANGUAGE

IDE

amp RUNTIME

OPTIMIZATION

KNOWLEDGE

BASE amp

LIBRARIES

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

7


Larger

market and

faster take

up

Days iso of

months

to get started

Algorithm

development

decoupled from

implementation

Lowering

barrier

of entry

Earlier

product

launch

Development

cycle reduction

by a factor of

3 to 10

Faster development

Reducing

RampD and

maintenance

costs

lines of codes

reduction by a

factor of 2 to 3

Same

performance

Efficient

code

Early access

to highest

performance

Future proof

code can

also target

other GPU

models

Future

proof

Better

products

Distinctive

tools for

coding

and design

analysis and

exploration

Better

algorithms

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

8

HOW DOES IT WORK



9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

9

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shared memory


function


boundary checks

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

10

HOW DOES IT WORK






$accum0=0


pos=$m


end


syncthreads

$bit=1




endif

syncthreads

$bit=2

continue

end

if (blkpos==0)

$out+=$bins[0]

endif

end




shmem shmem



Matrix o35 bins

scalar accum0

int m _Lpos bit


accum0 = 00f





__syncthreads()






if (blkpos == 0)



shared memory


function


boundary checks


CUDAOpenCLC++ code

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

11

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

DEVELOPMENT

Runtime information




SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

12

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

Optimization

hints

Automatic

detection of

parallelism

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

13

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA

COMPILATION

HARDWARE

DEVELOPMENT

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

14

rsquoS WORKFLOW

OPTIMAL RUNTIME

EXECUTION

Runtime

information

SCRIPTING

LANGUAGE

CODE

ANALYSIS amp

LOWERING

INPUT

DATA HARDWARE

COMPILATION


scheduling

DEVELOPMENT

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

QUASAR ON

GDaaS

BENEFITS

Anyness (screen

device GPU power)




Today M60

Coming Multi-GPU

1

2

3

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

DEMO

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

17

050

100150200250300350400



with global memory


CUDA-LOC QUASAR-LOC

RESULTS

Lines of code

0

05

1

15

2

25

3

35

4



memory



Development Time

0

1000

2000

3000

4000

5000

6000



memory



Execution time (ms)



using QUASAR




same performance

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

18

QUASAR APPLICATIONS


from Months to Days

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

19

CONCLUSION



Fast development






Future proof code



from Months to Days

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

wwwgdaascomquasar

wwwgepuraio




GO TO wwwgdaascom

Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400...

Documents

Transcript of Quasar on GDaaSon-demand.gputechconf.com/.../s6789-bart-goossens-quasar.pdf · 250 300 350 400...