Parallel Computing on the GPU
-
Upload
tilani-gunawardena-phdunibas-bscpera-fheauk-amiesl -
Category
Education
-
view
149 -
download
1
Transcript of Parallel Computing on the GPU
Parallel Computing on the GPU
Tilani Gunawardena
Goals• How to program heterogeneous parallel
computing system and achieve– High performance and energy efficiency– Functionality and maintainability– Scalability across future generations
• Technical subjects– Principles and patterns of parallel algorithms– Programming API, tools and techniques
Tentative Schedule– Introduction– GPU Computing and CUDA Intro – CUDA threading model– CUDA memory model– CUDA performance– Floating Point Considerations– Application Case Study
Recommended Textbook/Notes• D. Kirk and W. Hwu, “Programming Massively
Parallel Processors – A Hands-on Approach,”
• http://www.nvidia.com/(Communities CUDA Zone)
• Would you rather plow a field with two strong oxen or 1024 chickens??
How to Dig a Hole Faster??
1. Dig Faster2. Buy a More Productive Shovel3. Hire more diggers best approach
Problems:1. How to manage them?2. Will they get in each other’s way3. Will more diggers help to dig hole deeper instead of
just wider?
1. Dig Faster : Processor should run with faster clock to spend a shorter amount of time on each step of a computation (limit: power consumption on a chip : increase clock speed increase power consumption)
2. Buy a More Productive Shovel: Processor do more work on each clock cycle(How much instruction level parallelism per clock cycle)
3. Hire more diggers best approach
Parallelism• Solve Large Problems by breaking them into
small pieces• Then run smaller pieces at the same time
Modern GPU• 1000’s of ALUs• 100’s of processors• Tens of thousands of concurrent threads
• Modern GPU– Ex:GeForce GTX Titan X
CUDA cores: 30728000 million transistors12GB GDDR5 MemoryMemory Bandwidth: 336(GB/s)65000 concurrent threads
Feature size of Processors over time
As feature size decrease Transistors • get smaller• run faster• use less power• put more of them on a
chip
• As transistors improved , processor designers would then increase clock rates of processors , running them faster and faster every year
• Why don’t we keep increasing clock speed?• Have transistors stopped getting smaller+ faster?
– Problem: heat
• Even though transistors are continuing to get smaller and faster and consume less energy per transistor … Problem is running billion transistors generate lot of heat and we can not keep all these processors cool
• Can not make single processor faster and faster(processors that we cant keep cool)
• Processor designers– Smaller, more efficient processors in terms of power– Larger number of efficient processors(rather than faster less efficient processors)
• What kind of Processors we build?• CPU
– Complex control hardware– Flexibility in performance– Expensive in terms of power
• GPU– Simpler control hardware– More haradware for Computation– Potentially more power efficient– More restrictive Programming
model
Latency vs Throughput• Latency-Amount of time to complete a
task(time , seconds)• Throughput-Task completed per unit
time(Jobs/Hour) Your goals are not aligned with post office goals
Your goal: Optimize for Latency(want to spend a little time)
Post office: Optimize for throughput(number of customers they serve per a day)
CPU: Optimize for latency(minimize the time elapsed of one particular task)
GPU: Chose to Optimize for throughput
Bandwidth• How fast to devise can send data over a single
cable
Bandwidth vs Throughput vs Latency– Bandwidth is the maximum amount of data that
can travel through a 'channel'.– Throughput is how much data actually does travel
through the 'channel' successfully.– Latency is a function of how long it takes the data
to get sent all the way from the start point to the end
Latency vs Bandwidth
• Drive from Colombo to Kandy(100km)– Car(5 people, 60km/h) – Bus(60 people, 20km/h)
• Calculate– Latency?– Throughput?
GPUs from the point of view of the software developer ?
• Importance in programing in parallel– 8 core ivy bridge processor(intel)– 8-wide AVX vector operations/Core– 2 threads/core (hyperthreading)
128 way parallelism
In this processor if you run a complete serial, C program with no parallelism at all, you are going to use less than 1% of the capability of this machine.
Introduction• Microprocessor based on CPU drove rapid performance increases and
cost reduction in computer applications for more than 2 decades.
– The users demand even more improvements once they become accustomed to these improvements creating a positive cycle for the computer industry.
• This drive has slowed since 2003 due to power consumption issues that limited the increase of the clock frequency and the level of productive activities that can be performed in each clock period within a single CPU.
– All microprocessor vendors have switched to multi-core and many-core models where multiple processing unit are used in each chip to increase the processing power.
• Vast majority of SA are written as sequential programs – The expectation is that program run faster with each new
generation of microprocessors. This is no longer valid from this day onward.
– No performance improvement– Reducing the growth opportunities of computer industries.
• SA will continue to enjoy performance improvement as parallel programs, in which multiple threads of execution cooperate to achieve the funcionality faster.
18
• Parallel programming is by no means new– HPC community has been developing parallel programs
for decades.– But these programs run on large scale, expensive
computers and only a few elite application justify the use of these costs. In practice limiting the parallel programming to a small number of appication developers.
• Now that all new microprocessors are parallel computers, the number of applications that need to be developed as parallel programs has increased.
GPU as Parallel Computers
• Since 2003 a class of many-cores processors called GPUs have led the race for floating point performance.
While the performance of general purpose microprocessor has slowed, the GPU have continued to improve.
Many application developers are motivated to move the computationally intensive parts of their software to GPU for execution.
Why there is this Large Gap?
• The answer lies in the differences in the fundamental design philosophies between the two types of processors.
Latency oriented cores
Throughput oriented cores
CPU: Latency Oriented Design• CPU is optimized for sequential code performance• Large caches – Convert long latency memory accesses to short
latency cache accesses • Sophisticated control – Branch prediction for reduced branch latency – Data forwarding for reduced data latency
• Powerful ALU– Reduced operation latency
• GPU is optimized for the execution of massive number of threads.
• Small caches– To boost memory throughput
• Simple control– No branch prediction
– No data forwarding • Energy efficient ALUs
– Many, long latency but heavily pipelined for high throughput
• Require massive number of threads to tolerate latencies
GPU: Throughput Oriented Design
Winning Applications Use Both CPU and GPU
• CPUs for sequential parts where latency matters – CPUs can be 10+X faster than GPUs for sequential
code • GPUs for parallel parts where throughput wins – GPUs can be 10+X faster than CPUs for parallel
code
Applications