Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval...

Software Performance Tuning Project – Final Presentation

Prepared By: Eyal Segal Koren Shoval

Advisors: Liat Atsmon Koby Gottlieb

WavPack – Description

• WavPack is a an open source audio compression format.– Allows lossless audio compression.

• Compresses WAV files to WV files– Average compression ratio is 30-70%.

• Support for windows and mobile devices.– Cowon A3 PMP, iRiver, iPod, Nokia phones, and more.

Project Goals

• Enhance the Wavpack performance by:– Working and analyzing with Intel® VTune™ Performance

Analyzer.– Studying and applying instructions of Intel®’s new

processors.– Implementing multi-threading techniques in order to

achieve high performance.

• Return the source code to the community.

Algorithm Description

• Input file is processed in blocks of 512kb.– A global context exists for all blocks.– Blocks are divided into sub blocks.

• 24,000 samples equivalent to 0.5 second of wav at CD quality.

– Encodes each block and writes to output.– Updates context data for next block.

Is Lossless & Stereo


Configuration stereo/mono

bps { 8,16,24,32}, pass count, etc.

Configuration stereo/mono

bps { 8,16,24,32}, pass count, etc.Go over the buffer.

take a block of 24,000 samplesGo over the buffer.

take a block of 24,000 samples

Read buffer of 512kb from Input FileRead buffer of 512kb from Input File

Transform l.ch & r.ch to mid, diffTransform l.ch & r.ch to mid, diff

… moreoptions

…more options

Perform wavpack decorralation algorithm

on the buffer

Perform wavpack decorralation algorithm

on the buffer

Write the resulted buffer to the output.

This is thecompression stage.


This is the compression stage.

1st part of the wavpack algorithm

1st part of the wavpack algorithm

2nd Part of the wavpack algorithm

2nd Part of the wavpack algorithm

This is why parallelizing of the

entire flow fails

This is why parallelizing of the

entire flow fails

Calculate additional

information for compression



Perform the compression bit

by bit


by bit

Count ones and zeros until

change occurs


change occurs

Each subset of bytes depends on an

indeterminate subset of the previous

bytes.

Each subset of bytes depends on an

indeterminate subset of the previous

bytes.

ContextGlobal Information

Passed down to each function

ContextGlobal Information

Passed down to each function

… moreoptions

…more options

Init

x Pass count

Finish

Testing Environment

• Hardware– Core i7 2.66GHz CPU, Quad6600 2.4GHz.– 4GB of RAM.

• Software– Windows XP/Vista.– Visual studio 2008.– Intel VTune Toolkit.– Compiled with Microsoft compiler.

• Tests are done on a 330Mb WAV file.

Original Implementation

• Single threaded application– Read from disk.– Encode.– Write to disk directly.

• Old MMX Instructions are used.

• Processing of 330Mb Wav file takes about 30 seconds.

OptimizationsParallel IO/CPU

OptimizationsParallel IO/CPU

• General– Separate read, write and processing operations into several threads.

• Flow– Use the main thread to read input file.

• Create “jobs” and submit them into a work queue.

– Use an additional thread to process the “jobs”.• Output is redirected to memory instead of disk.

– Another thread writes the processed output to the disk.

OptimizationsParallel IO/CPU – cont.

• Benchmark– VTune analysis showed the following results

– Average running time is about 29 seconds.– Speedup is 1.026.

• Refers to original results.

• Conclusions– No significant improvement.– I/O operations take considerably less time than the blocks processing.

• Reads are done long before the processing is done.• Writing thread is almost never busy.

Optimizations Multi Threaded Processing

Optimizations Multi Threaded Processing

• General– Obstacle: Each block is dependent on the previous processed block.

• Parallelizing entire flow is impossible.

– Multithreading parts of the algorithm.• Locate parts of the code where the program spends most of the time.• Parallelize several functions in these parts.

• Implementation– Using “Thread Pool”.– Work is separated to left and right channel.

• At each channel, each sample is dependent on the previous sample.• Can’t use more than two threads.

– Each thread uses different memory area.• Results must be combined after work is done.



Processingthread

more options…

Workerthread 2

Fill two new“ Thread Args” structures .

One with left channel data andone with the Right.

Fill two new“ Thread Args” structures .

One with left channel data and one with the Right.

Submit each work to the “Thread Pool”Submit each work to the “Thread Pool”

Wait on the “OnComplete” mutexWait on the “OnComplete” mutex

worker thread 1

Wait for work to arrive into the “Thread Pool”and start the work.

Wait for work to arrive into the “Thread Pool”and start the work.

Perform Wavpack decorrelation algorithm

on the buffer

Perform Wavpack decorrelation algorithm

on the buffer


This is thecompression stage.


This is the compression stage.






by bit


by bit


change occurs


change occurs

x Pass count

Return to “Thread Pool”Return to “Thread Pool”

RightChannel

Wait for work to arrive into the “Thread Pool”

and start the work.

Wait for work to arrive into the “Thread Pool”

and start the work.

Return to “Thread Pool”

Return to “Thread Pool”

LeftChannel

Interleave left & right channels data to one

output buffer

Interleave left & right channels data to one

output buffer

Create a duplicates of each shared

data structure to avoid cache

conflicts

Create a duplicates of each shared

data structure to avoid cache

conflicts

Optimizations Multi Threaded Processing – cont.




• Conclusions– About 17% of the running time is parallelized. – Total improvement –

• Due to overhead improvement is a little bit smaller.

0.17 30 5.1sec

Optimizations Moving to SIMD

Optimizations Moving to SIMD

• General– Locate mathematical calculations and loops.

• Where the program spends most of the time. – Use 128bit width instructions.– Convert four operations of 32bit to one of 128bit.

• Theoretically, performance can be x4 faster.• In practice, there is overhead (load, store).

• Implementation– Re-factor the code as a basis for adding SIMD operations.– Loop unrolling.

• Make sure to complete the “leftovers” of the loop.

– Re-implement using SIMD code.

Optimizations Moving to SIMD – cont.




• Conclusions– Mathematical calculations can be mainly done with SSE2, SSE3.– SSE4 instructions were not useful for this application. – Improvement alone isn’t significant.

• More significant when combined with Multi Threading Optimization.

Optimizations Implementation Improvements

Optimizations Implementation Improvements

• General– We found several hot spots of the program that we couldn’t improve

using the mentioned methods.• Branch misprediction.

– Re-implement in a more efficient way.

• Implementation– Focused on one main function.

• Lots of branch mispredictions.• 16bit Integer was used as buffered output.

– Removed most of the branch instructions.– Re-implemented same logic with 64bit Integer buffer.

• Largest register size.• SIMD would require too much overhead.

Optimizations Implementation Improvements – cont.




• Conclusions– Branch instructions and branch mispredictions were reduced.– Improvement in performance – almost 2 seconds less.– Implementation is centered in one method.

• Easy to re-factor.• Requires no major architecture changes.

Summary

• The most significant optimization was multi threading code sections.– 16% speedup.

• The most insignificant was the multithreaded I/O.– 2.6% speedup.

Summary – Cont.• Benchmark

– VTune analysis showed the following results

– Average running time is about 22 seconds.– Total speedup we achieved is 1.335.

• The program runs faster by 33.5%.

Summary – Cont.• Conclusions

– Multithreading is something to be considered in the architectural stages of the application.

• In this application, the performance improvement does not worth the development and maintenance effort.

– SIMD Optimizations should only be used in specific cases.• Harder to use and understand the code.

– Decreasing branch mispredictions and cache misses is a better way to improve performance.

• Refactoring only specific methods. • Easier to implement and usually simplifies the code.• Using VTune and similar analysis tools is a good practice.

– Leveraging new CPU instructions should be the compiler’s responsibility.

• Don’t really need developer to do this job.• Code gets clattered.

Sources • WavPack official website

– http://www.wavpack.com • Intel® VTune™ Performance Analyzer• Sourceforge website

– http://sourceforge.net/• Software lab website

– http://softlab.technion.ac.il/• MSDN

– http://msdn.microsoft.com• Wikipedia

– http://en.wikipedia.org/wiki/• Intel website

– http://www.intel.com/

http://www.wavpack.com/

http://sourceforge.net/

http://softlab.technion.ac.il/

http://msdn.microsoft.com/

http://en.wikipedia.org/wiki/

http://www.intel.com/

Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval...

Documents

Transcript of Software Performance Tuning Project – Final Presentation Prepared By: Eyal Segal Koren Shoval...