Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

28
IMPROVING $PORT PERFORMANCE ON $ARCH PLATFORM-BASED PERFORMANCE TUNING OF WEBKIT (PORT=QT ARCH=MIPS74KF) Embedded Linux Conference April 29 — May 1, 2014 Adrián Pérez de Castro

description

By Adrián Pérez The MIPS processor cores are widely used in embedded platforms, including TVs and set-top-boxes. In most of those platforms dedicated graphics hardware exists but it may be specialized for its use in audio and video signal processing: rendering of web content has to be done in software. We implemented optimizations for the software-based QPainter renderer to improve the performance of Qt —including QtWebKit— in MIPS processors. The target platform was the modern 74kf cores, which include new SIMD instructions suitable for graphics operations (alpha blending, color space conversion and JPEG image decoding), and also for non-graphics operations: string functions were also improved. Our figures estimate that web pages are rendered up to 30% faster using hand-coded assembler fast-paths for those operations.

Transcript of Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

Page 1: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

IMPROVING $PORTPERFORMANCE ON $ARCH

PLATFORM-BASED PERFORMANCE TUNING OF WEBKIT(PORT=QT ARCH=MIPS74KF)

Embedded Linux ConferenceApril 29 — May 1, 2014

Adrián Pérez de Castro

Page 3: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

THE CHALLENGEMAKE A QTWEBKIT-BASED BROWSER USEABLE

ON LIMITED HARDWAREMIPS 74Kf @500 MHz

RAM: 256 MBNo GPU

Page 4: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

MIPS74KF“Classic” MIPS32

+FPU

+MMU

+DSP

Page 5: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

DSP?No. Not really a DSP.

SIMD instructions suitable for signal processing.

Page 6: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

CAN WE USE THIS TO IMPROVE PERFORMANCE?

CHALLENGE ACCEPTED

Page 7: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

THE PLANPROFILE → OPTIMIZE → VALIDATE

Page 8: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

WHAT TO OPTIMIZEVideo/audio decoding.

Image operations.

Page 9: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

WHERE TO OPTIMIZE?Can we improve the platform overall,

not just WebKit?

Yes!

QtWebKit uses the Qt drawing functions.

A/V decoding uses GStreamer, which uses Orc.

Good candidates for SIMD code.

Page 10: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LIMITATIONSNo Valgrind.

No GDB.No perf.

No performance counters.

↓qemu + gdbserver.gperftools.

CLOCK_PROCESS_CPUTIME_ID

Page 11: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

ROLL YOUR OWN TOOLS(WITH HELP FROM EXISTING ONES)

Page 12: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

GNU HAMMER^WTIME!# Use full path to avoid using the shell's time builtin# One line per run with user/system time and page faults/usr/bin/time -a -o timings.txt \ -f '%U %S %F %x %C' $COMMAND

# For example, measuring the qtdemux GStreamer component/usr/bin/time -a -o timings.txt \ -f '%U %S %F %x %C' gst-launch -q \ filesrc=file.mp4 ! qtdemux ! video/x-h264 ! fakesink

Page 13: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

TIMINGBeware of CLOCK_PROCESS_CPUTIME_ID's resolution!#define CLOCK_MAX_RESOLUTION_DELTA (10000.0 * 1e-9)bool usePosixClock() { static bool checked = false; static bool useposix; if (!checked) { if (posixClockAvailable()) { double res_theorical = posixClockTheoricalResolution(); double res_empirical = posixClockEmpiricalResolution(); useposix = fabs(res_theorical - res_empirical) <= CLOCK_MAX_RESOLUTION_DELTA; } else { useposix = false; } checked = true; } return useposix;}

clock.cc

Page 14: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

WEBSNAP% g++ -DMAIN -o clock clock.cc% ./clockCLOCK_PROCESS_CPUTIME_ID is supportedResolution (advertised/empirical): 0.0000000010/0.0000002460sSampled resolution: 0.0000005470sPrinting the lines above took 0.0000483550s

% LD_PRELOAD=/usr/lib/libprofiler.so \ ./websnap http://igalia.com 1000 pprofLoading 100% Layout completedLoad successfullibprofile.so detected (0x7f77468e8f90, 0x7f77468e8fd0), output 'pprof'Profiling started, code: 0x1, timeout: 0PROFILE: interrupts/evictions/bytes = 634/537/22168http://igalia.com 1000 6.2709987870s

% mkdir out && ./runtests 1000 < urls.txt

github.com/aperezdc/websnap

Page 15: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

...AND BEYONDAd-hoc Python/Bash scripts:

Fix library paths in profiler output.Data munging.

Measurements comparison.Generate CSV files.Report generation.

Page 16: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

SOME RESULTS(DETAILED)

Page 17: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1→UTF16: V0// "dst" array (uint16_t*)// "src" array (uin8_t*)

while (len--) *dst++ = (uchar) *src++;

Page 18: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1→UTF16: V1; a0: "dst" array (uint16_t*); a1: "src" array (uint8_t *); a2: "len"

1: lbu t1, 0 (a1) addiu a2, a2, -1 ; len-- sh t1, 0 (a0) addiu a0, a0, 2 ; dst++ bnez a2, 1b addiu a1, a1, 1 ; src++

Page 19: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1→UTF16: V21: lw t1, (a1) ; t1 = ABCD ; ; TODO: extract bytes from t1 to t2/t3, padding ; them with zeroes: t2 = 0A0B, t3 = 0C0D ; addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2

Page 20: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1→UTF16: V31: lw t1, (a1) ; t1 = ABCD srl t2, t1, 24 ; t2 = 000A sll t2, t2, 16 ; t2 = 0A00 sll t1, t1, 8 ; t1 = BCD0 srl t4, t1, 24 ; t4 = 000B or t2, t2, t4 ; t2 = 0A0B sll t1, t1, 8 ; t1 = CD00 srl t1, t1, 16 ; t1 = 00CD andi t4, t1, 0xFF00 ; t4 = 00C0 sll t4, t4, 8 ; t4 = 0C00 or t3, t1, t4 ; t3 = 0C0D addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2

Page 21: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1→UTF16: V4; DSP instructions can unpack bytes directly :-)

1: lw t1, (a1) ; t1 = ABCD

preceu.ph.qbl t2, t1 ; t2 = 0A0B preceu.ph.qbr t3, t1 ; t3 = 0C0D

addiu a1, a1, 4 ; src++ addiu a2, a2, -4 ; len-- sw t2, 0 (a0) sw t3, 4 (a0) bnez a2, 1b addiu a0, a0, 8 ; dst += 2

Page 22: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

LATIN-1 → UTF-16

Page 23: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

ALPHA BLENDING

Page 24: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

UTF-16 STRICMP()

Page 25: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

RESULTS

Speedup histogram

Page 26: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

UP TO 30% FASTER RENDERINGThanks to:

Orc backend using MIPS DSP instructionsQImage composition operations

Color conversion (RGB16/888→ARGB32)Alpha premultiplication and blendingString conversions and comparisons

Page 27: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

UPSTREAM STATUSOrc backend complete upstream

Initial work based on Qt 4.8Most of the code is already in Qt 5.2

Rest in the next releaseNo backport to Qt 4.8

Page 28: Improving Performance of a WebKit Port MIPS Platform (ELC 2014)

THANK YOUFOR YOUR ATTENTION

perezdecastro.org+AdrianPerezDeCastro

@aperezdc