Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Environment (MARE)
-
Upload
qualcomm-developer-network -
Category
Technology
-
view
472 -
download
2
description
Transcript of Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Environment (MARE)
Power-Efficient Programming Using Qualcomm® Multicore Asynchronous Runtime Environment
Pablo Montesinos, Qualcomm Technologies, Inc.
2
How can we write native parallel applications that use
all available cores?
3
How can we easily write native parallel applications that use
all available cores?
4
How can we easily write native parallel applications that use
all available cores in a battery-powered device?
5
How can we easily write native parallel applications that use
all the available compute units in a battery-powered device?
6
Qualcomm Multicore Asynchronous
Runtime Environment
MARE
7
What is Qualcomm MARE? The effective solution for efficient mobile computing
Qualcomm MARE is a programming model and a runtime system that provides simple yet powerful abstractions and building blocks for writing concurrent, power-efficient software − Simple C++ API allows developers to express concurrency
− Enables heterogeneous execution to fully utilize a mobile SoC
− User-level library that runs on any Android device
Current version is v0.11, available at http://developers.qualcomm.com/mare
8
MARE Workflow Understand your algorithms and focus on the application logic, not on the hardware
1. Map your application logic to predefined MARE building blocks (patterns)
2. If the current MARE patterns do not capture part of your application: − Use MARE tasks and groups
− If your algorithm can be generalized as a pattern, let us know so that we can add it to our pattern collection
3. Link your app with Qualcomm MARE Runtime − Runtime schedules code in all available compute units
9
MARE Patterns A pattern is a commonly occurring combination of task relationships and data accesses
Pattern Name Description
mare::pfor_each Processes the elements of a collection in parallel
mare::pscan Performs and in-place parallel prefix operation for all elements of a collection
mare::ptransform Performs a map operation on all elements of a collection, returns a new collection
mare::pdivide_and_conquer Divides problem into subproblems, solves them, and merges their solutions in parallel
mare::pipeline A sequence of processing stages that can execute concurrently on a data stream
10
MARE Patterns A pattern is a commonly occurring combination of task distribution and data access
Pattern Name Description
mare::pfor_each Processes the elements of a collection in parallel
mare::pscan Performs and in-place parallel prefix operation for all elements of a collection
mare::ptransform Performs a map operation on all elements of a collection, returns a new collection
mare::pdivide_and_conquer Divides problem into subproblems, solves them, and merges their solutions in parallel
mare::pipeline A sequence of processing stages that can execute concurrently on a data stream
11
MARE pfor_each Pattern Boost the performance of your application by changing just one line
Exploit data parallelism in loops
Easy replacement for traditional “for” loops
MARE automatically splits the iteration space based on dynamic system load
12
Example: Vector Addition Advanced algorithms split iterations across cores for best performance and power
void foo(vector const& a, vector const& b, vector &c) { for(size_t i = 0; i < b.size(); ++i ) { c[i] = alpha * a[i] + b[i]; }); }
void foo(vector const& a, vector const& b, vector &c) { mare::pfor_each(0, b.size(), [&](size_t i) { c[i] = alpha * a[i] + b[i]; }); }
13
MARE Pipeline Pattern Use the pipeline pattern in streaming applications
A pipeline is a linear, unidirectional chain of stages (no feedback loops allowed) − A stage is a function that is executed repeatedly over a stream of data.
− Each stage iteration consumes the output of the previous and produces an output
Two types of stages: − Serial stages execute their iterations sequentially
− Parallel stages execute their iterations concurrently
14
MARE Pipeline Features Useful for computational photography algorithms
Programmers can specify the following parameters for each stage: − Iteration lag: minimum number of iterations that a stage runs ahead of its successor
− Degree of concurrency: number of consecutive stage iterations that can run in parallel
− Iteration rate: rate of iterations between two consecutive stages
Use these parameters to size the sliding window − The sliding window is a fixed-size buffer between stages
− Limits memory usage and improves locality
15
MARE Pipeline Example A pipeline that ages, blurs and scales and image
Read image row
Read image row Read image row Read image row Age and blur Scale and Save
16
MARE Pipeline Example Define stage functions using lambda expressions, function pointers or callable objects
// Create pipeline with pipeline instance specific data of type File* mare::pipeline<FileInfo*> pipe; // Add a serial stage pipe.add_stage(mare::serial_stage(), read_image_row); // Add a parallel stage, (lag = 2, doc = 4) to age and blur image pipe.add_stage(mare::parallel_stage(4), mare::iteration_lag(2), age_and_blur_row); // Add a serial stage to scale and save the image, runs 2x more iterations than previous stage pipe.add_stage(mare::serial_stage(), mare::iteration_rate(1, 2), scale_and_save_row); // Launch pipeline and process all rows pipe.launch_and_wait(&finfo, finfo.get_source_height());
17
MARE API in a Nutshell Enable expression of parallelism for large classes of applications
Two intuitive concepts: − Tasks are units of work that can be asynchronously executed
− Groups are sets of tasks that can be canceled or waited on
And a simple but powerful API: − Create CPU/GPU tasks and groups
− Setup dependencies between tasks
− Add tasks to one or more groups
− Launch tasks
− Cancel tasks and groups
− Wait for tasks and groups
− Finish after tasks and groups
18
Hello World! #include <stdio.h> #include <mare/mare.h> int main() { mare::runtime::init(); // Initialize MARE runtime auto hello = mare::create_task([]{ printf("Hello ”); }); // Create task that prints “Hello ” auto world = mare::create_task([]{ printf(“World!\n”); }); // Create task that prints “World!” hello >> world; // Ensure that “World!” Prints after “Hello ” mare::launch(hello); // Launch hello task mare::launch(world); // Launch world task mare::wait_for(world); // Wait for world to complete mare::runtime::shutdown(); // Shutdown the MARE runtime and exit return 0; }
19
MARE GPU Compute Seamless integration of CPU and GPU execution
Create GPU tasks by using OpenCL kernels as tasks bodies − The MARE Runtime dispatches them to the graphics driver
− MARE takes care of OpenCL boiler plate code
Automatic data movement between CPU and GPU based on usage patterns − Use mare::buffer to let the runtime manage data across devices
− Programmer can also explicitly synchronize storage between CPU and GPU
Adding GPU patterns to patterns library, current version supports mare::pfor_each
Both GPU and CPU tasks use the same API − MARE runtime manages the dependencies between all types of tasks
20
Task Cancelation Discard unwanted tasks by canceling them
Use mare::cancel(task) to cancel a task
What does it mean? − If the task hasn’t started running, it will never run. All its successors will get canceled too
− If the task has already finished, cancelation means nothing
− If the task is running, it’s up to the programmer to decide what to do.
Successfully canceling a task causes its successors to also get canceled − We call this cancelation propagation
21
Easy Non-blocking Parallelization in MARE Unleash asynchrony throughout your application using mare::finish_after
mare::wait_for is an easy way to create dynamic dependencies between tasks, however: − It might block − If you have many outstanding mare::wait_for in your app, performance may suffer
mare::finish_after allows the creation of dynamic dependencies without blocking − Enables high-performance continuation-passing-style parallelization
Many algorithms will benefit from its use, for example divide and conquer.
22
Advantages of Using Qualcomm MARE
Simple Productive Efficient
Tasks are a natural way to express parallelism Familiar C++ programming Uniform multithreading and heterogeneous programming
Focus on application logic, not on thread management
Task mapping and dependencies allow the MARE runtime to make intelligent scheduling decisions, optimizing both power and performance. Parallel and heterogeneous execution improves power and thermal efficiency.
23
Advantages of Using Qualcomm MARE
Simple Productive Efficient
Tasks are a natural way to express parallelism Familiar C++ programming Uniform multithreading and heterogeneous programming
Focus on application logic, not on thread management
Task mapping and dependencies allow the MARE runtime to make intelligent scheduling decisions, optimizing both power and performance. Parallel and heterogeneous execution improves power and thermal efficiency.
24
Achieve Power Efficiency Using MARE Runtime uses a holistic view of the application structure to make better scheduling decisions
Operating systems see applications as unstructured streams of instructions
Power and thermal management are therefore reactive
MARE uses a proactive approach that saves energy, reduces peak power, and avoids thermal throttling: − MARE makes scheduling decisions based on the task graph and the state of the system
− MARE provides APIs so that programmer can help runtime make these decisions
25
MARE Power API Only available in Qualcomm Snapdragon
Static power management: − User chooses amongst four predefined power modes: normal, efficient, perf_burst and saver. − mare::power::request_mode(mode, duration)
Dynamic Power Management: − Tries to minimize energy consumption preserving user-defined Quality of Service
− Works well with “main loop” based applications (games, streams, …) − mare::power:set_goal(desired, tolerance) // Before the main loop − mare::regulate(measured) // Within the main loop − mare::power::clear_goal() // After the main loop
26
Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues
Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy
27
Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues
Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy
Cor
e Po
wer
(mW
)
Cor
e Te
mpe
ratu
re (°
C)
1 core @ 2.1 GHz 1 core @ 2.1 GHz MARE
4 cores @ 1 GHz MARE
4 cores @ 1 GHz
Time Time
10C
5s 2.8s
28
Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues
Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy
Cor
e Po
wer
(mW
)
Cor
e Te
mpe
ratu
re (°
C)
1 core @ 2.1 GHz 1 core @ 2.1 GHz MARE
4 cores @ 1 GHz MARE
4 cores @ 1 GHz
Time Time
10oC
5s 2.8s
29
Case Study Partner Seth Bernsen, GM, Thundersoft America.
30
Qualcomm MARE The effective solution for efficient mobile computing
MARE patterns are an easy and powerful way to make your algorithms parallel
MARE tasks and groups abstractions enable the parallelization of irregular algorithms
MARE’s heterogeneous runtime allows you to exploit the whole SoC, not just the CPU − MARE v0.11 supports CPU and GPU
− DSP is on the works, stay tuned
MARE’s advanced power and thermal management uses programmer input to proactively save energy, reduce peak power, and avoid thermal throttling
Current version is v0.11, and it’s available at http://developers.qualcomm.com/mare
31
For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners
Thank you
FOLLOW US ON:
32
Canceling a Running Task Cooperative cancelation enables cancelation of running tasks
MARE won’t kill tasks that are canceled while executing
Running tasks may check whether they have been canceled
mare::abort_on_cancel() checks whether the task or any of the groups the task belongs to have been canceled. If so: − It does not return to the task
− MARE propagates the cancelation to its successors
− MARE chooses a new task to execute
33
Removes unwanted shaky motion from videos.
Complex process, with several stages: − Estimate the global inter-fame motion vectors
− Smooth the vectors
− Compute the transformation matrix
− Use the matrix to warp and stabilize frames.
Electronic Image Stabilization (EIS)
34
Pipeline Example - Backup Slide Define stage functions using lambda expressions, function pointers or callable objects using context = mare::pipeline<File*>::context; auto stage0 = [context& ctx] { File* input_file = ctx.get_data(); return do_something0(ctx.get_iter_id(), input_file); }; using st_input = mare::stage_input<size_t>; auto stage1 = [context& ctx, st_input& in] { return do_something1(ctx.get_iter_id(), in[0]); }; using db_input = mare::stage_input<double>; auto stage2_body = [context& ctx, db_input& in]->char{ return do_something2(ctx.get_iter_id(), in[1]); };