Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Environment (MARE)

Post on 24-May-2015

473 views 2 download

Tags:

description

The need to support both today’s multicore performance and tomorrow’s heterogeneous computing has become increasingly important. Qualcomm® Multicore Asynchronous Runtime Environment (MARE) provides powerful and easy-to-use abstractions to write parallel software. This session will provide a deep dive into the concepts of power-efficient programming and how to use Qualcomm MARE APIs to get energy and thermal benefits for Android apps. Qualcomm Multicore Asynchronous Runtime Environment is a product of Qualcomm Technologies, Inc. Learn more about Qualcomm Multicore Asynchronous Runtime Environment: https://developer.qualcomm.com/MARE Watch this presentation on YouTube: https://www.youtube.com/watch?v=RI8yXhBb8Hg

Transcript of Power-Efficient Programming Using Qualcomm Multicore Asynchronous Runtime Environment (MARE)

Power-Efficient Programming Using Qualcomm® Multicore Asynchronous Runtime Environment

Pablo Montesinos, Qualcomm Technologies, Inc.

2

How can we write native parallel applications that use

all available cores?

3

How can we easily write native parallel applications that use

all available cores?

4

How can we easily write native parallel applications that use

all available cores in a battery-powered device?

5

How can we easily write native parallel applications that use

all the available compute units in a battery-powered device?

6

Qualcomm Multicore Asynchronous

Runtime Environment

MARE

7

What is Qualcomm MARE? The effective solution for efficient mobile computing

Qualcomm MARE is a programming model and a runtime system that provides simple yet powerful abstractions and building blocks for writing concurrent, power-efficient software − Simple C++ API allows developers to express concurrency

− Enables heterogeneous execution to fully utilize a mobile SoC

− User-level library that runs on any Android device

Current version is v0.11, available at http://developers.qualcomm.com/mare

8

MARE Workflow Understand your algorithms and focus on the application logic, not on the hardware

1. Map your application logic to predefined MARE building blocks (patterns)

2. If the current MARE patterns do not capture part of your application: − Use MARE tasks and groups

− If your algorithm can be generalized as a pattern, let us know so that we can add it to our pattern collection

3. Link your app with Qualcomm MARE Runtime − Runtime schedules code in all available compute units

9

MARE Patterns A pattern is a commonly occurring combination of task relationships and data accesses

Pattern Name Description

mare::pfor_each Processes the elements of a collection in parallel

mare::pscan Performs and in-place parallel prefix operation for all elements of a collection

mare::ptransform Performs a map operation on all elements of a collection, returns a new collection

mare::pdivide_and_conquer Divides problem into subproblems, solves them, and merges their solutions in parallel

mare::pipeline A sequence of processing stages that can execute concurrently on a data stream

10

MARE Patterns A pattern is a commonly occurring combination of task distribution and data access

Pattern Name Description

mare::pfor_each Processes the elements of a collection in parallel

mare::pscan Performs and in-place parallel prefix operation for all elements of a collection

mare::ptransform Performs a map operation on all elements of a collection, returns a new collection

mare::pdivide_and_conquer Divides problem into subproblems, solves them, and merges their solutions in parallel

mare::pipeline A sequence of processing stages that can execute concurrently on a data stream

11

MARE pfor_each Pattern Boost the performance of your application by changing just one line

Exploit data parallelism in loops

Easy replacement for traditional “for” loops

MARE automatically splits the iteration space based on dynamic system load

12

Example: Vector Addition Advanced algorithms split iterations across cores for best performance and power

void foo(vector const& a, vector const& b, vector &c) { for(size_t i = 0; i < b.size(); ++i ) { c[i] = alpha * a[i] + b[i]; }); }

void foo(vector const& a, vector const& b, vector &c) { mare::pfor_each(0, b.size(), [&](size_t i) { c[i] = alpha * a[i] + b[i]; }); }

13

MARE Pipeline Pattern Use the pipeline pattern in streaming applications

A pipeline is a linear, unidirectional chain of stages (no feedback loops allowed) − A stage is a function that is executed repeatedly over a stream of data.

− Each stage iteration consumes the output of the previous and produces an output

Two types of stages: − Serial stages execute their iterations sequentially

− Parallel stages execute their iterations concurrently

14

MARE Pipeline Features Useful for computational photography algorithms

Programmers can specify the following parameters for each stage: − Iteration lag: minimum number of iterations that a stage runs ahead of its successor

− Degree of concurrency: number of consecutive stage iterations that can run in parallel

− Iteration rate: rate of iterations between two consecutive stages

Use these parameters to size the sliding window − The sliding window is a fixed-size buffer between stages

− Limits memory usage and improves locality

15

MARE Pipeline Example A pipeline that ages, blurs and scales and image

Read image row

Read image row Read image row Read image row Age and blur Scale and Save

16

MARE Pipeline Example Define stage functions using lambda expressions, function pointers or callable objects

// Create pipeline with pipeline instance specific data of type File* mare::pipeline<FileInfo*> pipe; // Add a serial stage pipe.add_stage(mare::serial_stage(), read_image_row); // Add a parallel stage, (lag = 2, doc = 4) to age and blur image pipe.add_stage(mare::parallel_stage(4), mare::iteration_lag(2), age_and_blur_row); // Add a serial stage to scale and save the image, runs 2x more iterations than previous stage pipe.add_stage(mare::serial_stage(), mare::iteration_rate(1, 2), scale_and_save_row); // Launch pipeline and process all rows pipe.launch_and_wait(&finfo, finfo.get_source_height());

17

MARE API in a Nutshell Enable expression of parallelism for large classes of applications

Two intuitive concepts: − Tasks are units of work that can be asynchronously executed

− Groups are sets of tasks that can be canceled or waited on

And a simple but powerful API: − Create CPU/GPU tasks and groups

− Setup dependencies between tasks

− Add tasks to one or more groups

− Launch tasks

− Cancel tasks and groups

− Wait for tasks and groups

− Finish after tasks and groups

18

Hello World! #include <stdio.h> #include <mare/mare.h> int main() { mare::runtime::init(); // Initialize MARE runtime auto hello = mare::create_task([]{ printf("Hello ”); }); // Create task that prints “Hello ” auto world = mare::create_task([]{ printf(“World!\n”); }); // Create task that prints “World!” hello >> world; // Ensure that “World!” Prints after “Hello ” mare::launch(hello); // Launch hello task mare::launch(world); // Launch world task mare::wait_for(world); // Wait for world to complete mare::runtime::shutdown(); // Shutdown the MARE runtime and exit return 0; }

19

MARE GPU Compute Seamless integration of CPU and GPU execution

Create GPU tasks by using OpenCL kernels as tasks bodies − The MARE Runtime dispatches them to the graphics driver

− MARE takes care of OpenCL boiler plate code

Automatic data movement between CPU and GPU based on usage patterns − Use mare::buffer to let the runtime manage data across devices

− Programmer can also explicitly synchronize storage between CPU and GPU

Adding GPU patterns to patterns library, current version supports mare::pfor_each

Both GPU and CPU tasks use the same API − MARE runtime manages the dependencies between all types of tasks

20

Task Cancelation Discard unwanted tasks by canceling them

Use mare::cancel(task) to cancel a task

What does it mean? − If the task hasn’t started running, it will never run. All its successors will get canceled too

− If the task has already finished, cancelation means nothing

− If the task is running, it’s up to the programmer to decide what to do.

Successfully canceling a task causes its successors to also get canceled − We call this cancelation propagation

21

Easy Non-blocking Parallelization in MARE Unleash asynchrony throughout your application using mare::finish_after

mare::wait_for is an easy way to create dynamic dependencies between tasks, however: − It might block − If you have many outstanding mare::wait_for in your app, performance may suffer

mare::finish_after allows the creation of dynamic dependencies without blocking − Enables high-performance continuation-passing-style parallelization

Many algorithms will benefit from its use, for example divide and conquer.

22

Advantages of Using Qualcomm MARE

Simple Productive Efficient

Tasks are a natural way to express parallelism Familiar C++ programming Uniform multithreading and heterogeneous programming

Focus on application logic, not on thread management

Task mapping and dependencies allow the MARE runtime to make intelligent scheduling decisions, optimizing both power and performance. Parallel and heterogeneous execution improves power and thermal efficiency.

23

Advantages of Using Qualcomm MARE

Simple Productive Efficient

Tasks are a natural way to express parallelism Familiar C++ programming Uniform multithreading and heterogeneous programming

Focus on application logic, not on thread management

Task mapping and dependencies allow the MARE runtime to make intelligent scheduling decisions, optimizing both power and performance. Parallel and heterogeneous execution improves power and thermal efficiency.

24

Achieve Power Efficiency Using MARE Runtime uses a holistic view of the application structure to make better scheduling decisions

Operating systems see applications as unstructured streams of instructions

Power and thermal management are therefore reactive

MARE uses a proactive approach that saves energy, reduces peak power, and avoids thermal throttling: − MARE makes scheduling decisions based on the task graph and the state of the system

− MARE provides APIs so that programmer can help runtime make these decisions

25

MARE Power API Only available in Qualcomm Snapdragon

Static power management: − User chooses amongst four predefined power modes: normal, efficient, perf_burst and saver. − mare::power::request_mode(mode, duration)

Dynamic Power Management: − Tries to minimize energy consumption preserving user-defined Quality of Service

− Works well with “main loop” based applications (games, streams, …) − mare::power:set_goal(desired, tolerance) // Before the main loop − mare::regulate(measured) // Within the main loop − mare::power::clear_goal() // After the main loop

26

Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues

Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy

27

Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues

Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy

Cor

e Po

wer

(mW

)

Cor

e Te

mpe

ratu

re (°

C)

1 core @ 2.1 GHz 1 core @ 2.1 GHz MARE

4 cores @ 1 GHz MARE

4 cores @ 1 GHz

Time Time

10C

5s 2.8s

28

Proactively Saving Energy and Lowering Temperature Parallelism is key for managing power/thermals issues

Increasing number of cores and lowering the frequency allows us to get the same performance with lower energy

Cor

e Po

wer

(mW

)

Cor

e Te

mpe

ratu

re (°

C)

1 core @ 2.1 GHz 1 core @ 2.1 GHz MARE

4 cores @ 1 GHz MARE

4 cores @ 1 GHz

Time Time

10oC

5s 2.8s

29

Case Study Partner Seth Bernsen, GM, Thundersoft America.

30

Qualcomm MARE The effective solution for efficient mobile computing

MARE patterns are an easy and powerful way to make your algorithms parallel

MARE tasks and groups abstractions enable the parallelization of irregular algorithms

MARE’s heterogeneous runtime allows you to exploit the whole SoC, not just the CPU − MARE v0.11 supports CPU and GPU

− DSP is on the works, stay tuned

MARE’s advanced power and thermal management uses programmer input to proactively save energy, reduce peak power, and avoid thermal throttling

Current version is v0.11, and it’s available at http://developers.qualcomm.com/mare

31

For more information on Qualcomm, visit us at: www.qualcomm.com & www.qualcomm.com/blog

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries. Other products and brand names may be trademarks or registered trademarks of their respective owners

Thank you

FOLLOW US ON:

32

Canceling a Running Task Cooperative cancelation enables cancelation of running tasks

MARE won’t kill tasks that are canceled while executing

Running tasks may check whether they have been canceled

mare::abort_on_cancel() checks whether the task or any of the groups the task belongs to have been canceled. If so: − It does not return to the task

− MARE propagates the cancelation to its successors

− MARE chooses a new task to execute

33

Removes unwanted shaky motion from videos.

Complex process, with several stages: − Estimate the global inter-fame motion vectors

− Smooth the vectors

− Compute the transformation matrix

− Use the matrix to warp and stabilize frames.

Electronic Image Stabilization (EIS)

34

Pipeline Example - Backup Slide Define stage functions using lambda expressions, function pointers or callable objects using context = mare::pipeline<File*>::context; auto stage0 = [context& ctx] { File* input_file = ctx.get_data(); return do_something0(ctx.get_iter_id(), input_file); }; using st_input = mare::stage_input<size_t>; auto stage1 = [context& ctx, st_input& in] { return do_something1(ctx.get_iter_id(), in[0]); }; using db_input = mare::stage_input<double>; auto stage2_body = [context& ctx, db_input& in]->char{ return do_something2(ctx.get_iter_id(), in[1]); };