Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval...
-
date post
21-Dec-2015 -
Category
Documents
-
view
219 -
download
2
Transcript of Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval...
Software Performance Tuning Project – Final Presentation
Prepared By: Eyal Segal Koren Shoval
Advisors: Liat Atsmon Koby Gottlieb
WavPack – Description
• WavPack is a an open source audio compression format.– Allows lossless audio compression.
• Compresses WAV files to WV files– Average compression ratio is 30-70%.
• Support for windows and mobile devices.– Cowon A3 PMP, iRiver, iPod, Nokia phones, and more.
Project Goals
• Enhance the Wavpack performance by:– Working and analyzing with Intel® VTune™ Performance
Analyzer.– Studying and applying instructions of Intel®’s new
processors.– Implementing multi-threading techniques in order to
achieve high performance.
• Return the source code to the community.
Algorithm Description
• Input file is processed in blocks of 512kb.– A global context exists for all blocks.– Blocks are divided into sub blocks.
• 24,000 samples equivalent to 0.5 second of wav at CD quality.
– Encodes each block and writes to output.– Updates context data for next block.
Is Lossless & Stereo
Is Lossless & Stereo
Configuration stereo/mono
bps { 8,16,24,32}, pass count, etc.
Configuration stereo/mono
bps { 8,16,24,32}, pass count, etc.Go over the buffer.
take a block of 24,000 samplesGo over the buffer.
take a block of 24,000 samples
Read buffer of 512kb from Input FileRead buffer of 512kb from Input File
Transform l.ch & r.ch to mid, diffTransform l.ch & r.ch to mid, diff
… moreoptions
…more options
Perform wavpack decorralation algorithm
on the buffer
Perform wavpack decorralation algorithm
on the buffer
Write the resulted buffer to the output.
This is thecompression stage.
Write the resulted buffer to the output.
This is the compression stage.
1st part of the wavpack algorithm
1st part of the wavpack algorithm
2nd Part of the wavpack algorithm
2nd Part of the wavpack algorithm
This is why parallelizing of the
entire flow fails
This is why parallelizing of the
entire flow fails
Calculate additional
information for compression
Calculate additional
information for compression
Perform the compression bit
by bit
Perform the compression bit
by bit
Count ones and zeros until
change occurs
Count ones and zeros until
change occurs
Each subset of bytes depends on an
indeterminate subset of the previous
bytes.
Each subset of bytes depends on an
indeterminate subset of the previous
bytes.
ContextGlobal Information
Passed down to each function
ContextGlobal Information
Passed down to each function
… moreoptions
…more options
Init
x Pass count
Finish
Testing Environment
• Hardware– Core i7 2.66GHz CPU, Quad6600 2.4GHz.– 4GB of RAM.
• Software– Windows XP/Vista.– Visual studio 2008.– Intel VTune Toolkit.– Compiled with Microsoft compiler.
• Tests are done on a 330Mb WAV file.
Original Implementation
• Single threaded application– Read from disk.– Encode.– Write to disk directly.
• Old MMX Instructions are used.
• Processing of 330Mb Wav file takes about 30 seconds.
OptimizationsParallel IO/CPU
OptimizationsParallel IO/CPU
• General– Separate read, write and processing operations into several threads.
• Flow– Use the main thread to read input file.
• Create “jobs” and submit them into a work queue.
– Use an additional thread to process the “jobs”.• Output is redirected to memory instead of disk.
– Another thread writes the processed output to the disk.
OptimizationsParallel IO/CPU – cont.
• Benchmark– VTune analysis showed the following results
– Average running time is about 29 seconds.– Speedup is 1.026.
• Refers to original results.
• Conclusions– No significant improvement.– I/O operations take considerably less time than the blocks processing.
• Reads are done long before the processing is done.• Writing thread is almost never busy.
Optimizations Multi Threaded Processing
Optimizations Multi Threaded Processing
• General– Obstacle: Each block is dependent on the previous processed block.
• Parallelizing entire flow is impossible.
– Multithreading parts of the algorithm.• Locate parts of the code where the program spends most of the time.• Parallelize several functions in these parts.
• Implementation– Using “Thread Pool”.– Work is separated to left and right channel.
• At each channel, each sample is dependent on the previous sample.• Can’t use more than two threads.
– Each thread uses different memory area.• Results must be combined after work is done.
Is Lossless & Stereo
Is Lossless & Stereo
Processingthread
more options…
Workerthread 2
Fill two new“ Thread Args” structures .
One with left channel data andone with the Right.
Fill two new“ Thread Args” structures .
One with left channel data and one with the Right.
Submit each work to the “Thread Pool”Submit each work to the “Thread Pool”
Wait on the “OnComplete” mutexWait on the “OnComplete” mutex
worker thread 1
Wait for work to arrive into the “Thread Pool”and start the work.
Wait for work to arrive into the “Thread Pool”and start the work.
Perform Wavpack decorrelation algorithm
on the buffer
Perform Wavpack decorrelation algorithm
on the buffer
Write the resulted buffer to the output.
This is thecompression stage.
Write the resulted buffer to the output.
This is the compression stage.
Calculate additional
information for compression
Calculate additional
information for compression
Perform the compression bit
by bit
Perform the compression bit
by bit
Count ones and zeros until
change occurs
Count ones and zeros until
change occurs
x Pass count
Return to “Thread Pool”Return to “Thread Pool”
RightChannel
Wait for work to arrive into the “Thread Pool”
and start the work.
Wait for work to arrive into the “Thread Pool”
and start the work.
Return to “Thread Pool”
Return to “Thread Pool”
LeftChannel
Interleave left & right channels data to one
output buffer
Interleave left & right channels data to one
output buffer
Create a duplicates of each shared
data structure to avoid cache
conflicts
Create a duplicates of each shared
data structure to avoid cache
conflicts
Optimizations Multi Threaded Processing – cont.
• Benchmark– VTune analysis showed the following results
– Average running time is about 25 seconds.– Speedup is 1.167.
• Refers to original results.
• Conclusions– About 17% of the running time is parallelized. – Total improvement –
• Due to overhead improvement is a little bit smaller.
0.17 30 5.1sec
Optimizations Moving to SIMD
Optimizations Moving to SIMD
• General– Locate mathematical calculations and loops.
• Where the program spends most of the time. – Use 128bit width instructions.– Convert four operations of 32bit to one of 128bit.
• Theoretically, performance can be x4 faster.• In practice, there is overhead (load, store).
• Implementation– Re-factor the code as a basis for adding SIMD operations.– Loop unrolling.
• Make sure to complete the “leftovers” of the loop.
– Re-implement using SIMD code.
Optimizations Moving to SIMD – cont.
• Benchmark– VTune analysis showed the following results
– Average running time is about 28 seconds.– Speedup is 1.043.
• Refers to original results.
• Conclusions– Mathematical calculations can be mainly done with SSE2, SSE3.– SSE4 instructions were not useful for this application. – Improvement alone isn’t significant.
• More significant when combined with Multi Threading Optimization.
Optimizations Implementation Improvements
Optimizations Implementation Improvements
• General– We found several hot spots of the program that we couldn’t improve
using the mentioned methods.• Branch misprediction.
– Re-implement in a more efficient way.
• Implementation– Focused on one main function.
• Lots of branch mispredictions.• 16bit Integer was used as buffered output.
– Removed most of the branch instructions.– Re-implemented same logic with 64bit Integer buffer.
• Largest register size.• SIMD would require too much overhead.
Optimizations Implementation Improvements – cont.
• Benchmark– VTune analysis showed the following results
– Average running time is about 28 seconds.– Speedup is 1.06.
• Refers to original results.
• Conclusions– Branch instructions and branch mispredictions were reduced.– Improvement in performance – almost 2 seconds less.– Implementation is centered in one method.
• Easy to re-factor.• Requires no major architecture changes.
Summary
• The most significant optimization was multi threading code sections.– 16% speedup.
• The most insignificant was the multithreaded I/O.– 2.6% speedup.
Summary – Cont.• Benchmark
– VTune analysis showed the following results
– Average running time is about 22 seconds.– Total speedup we achieved is 1.335.
• The program runs faster by 33.5%.
Summary – Cont.• Conclusions
– Multithreading is something to be considered in the architectural stages of the application.
• In this application, the performance improvement does not worth the development and maintenance effort.
– SIMD Optimizations should only be used in specific cases.• Harder to use and understand the code.
– Decreasing branch mispredictions and cache misses is a better way to improve performance.
• Refactoring only specific methods. • Easier to implement and usually simplifies the code.• Using VTune and similar analysis tools is a good practice.
– Leveraging new CPU instructions should be the compiler’s responsibility.
• Don’t really need developer to do this job.• Code gets clattered.
Sources • WavPack official website
– http://www.wavpack.com • Intel® VTune™ Performance Analyzer• Sourceforge website
– http://sourceforge.net/• Software lab website
– http://softlab.technion.ac.il/• MSDN
– http://msdn.microsoft.com• Wikipedia
– http://en.wikipedia.org/wiki/• Intel website
– http://www.intel.com/