Expression evaluation on protobuf data

Faster expression evaluation on Protobuf data

Oscar MollMIT DB GroupWork done as an intern at Google (Summer ‘16)

Protocol buffers = data definition language + serialization spec + compiler + libraries

// Protobuf Data Definition

message Person { string name = 1; int32 id = 2; string email = 3;}

// example c++ user code Person john; // Person is implemented as combo of// of generated C++ && shared base// classes

if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();

}

Protocol buffer backward compatibility- Adding fields

// Protobuf Data Definition adds// with new field added

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// same old c++ user code // using stale Data definition// still works correctly Person john; // Person is implemented as combo of// of generated C++ && shared base// classes


}

Protocol buffer backward compatibility - Deprecating fields

// Protobuf Data Definition removes// some old field.

message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}

// byte overhead is dependent only on fields actually present

// same old c++ user code // using stale Data definition// still works correctly // even though email-missing Person john;


}

Expression evaluation on Protobuf data

message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}

message NestedProto { int32 bx = 4; ...}

Let foo be a TestProto

We want to eval:{"foo.x > foo.bar.bx", "foo.y > foo.x"};

E.g."x:1 bar { bx:10 } y:20" => {1 > 10, 20 > 10} => {false, true}

TestProto::ParseFromString example (and baseline)

test::TestProto f;auto b = f.ParseFromString(data);if (!b) { e.Invalid("parse failed"); return;}

output_buffer[0] = f.x() > f.bar().bx();output_buffer[1] = f.y() > f.x();

Type information received at runtimeExpression list also received at runtime

ParseFromString:❏ work proportional to every (nested) value in

the protobuf❏ Also limited by interface (eg. error checking)❏ allocates memory dynamically for every

variable sized value.



{"foo.x > foo.bar.bx", "foo.y > foo.x"};

We would prefer to access only what we need from the serialized buffer.

Which fields are needed is expression dependent.

We can do this as follows...

The protobuf wireformat is a concatenation of binary tag-value pairs



tag 1

tag <len>

tag 20

// example datax:1 bar { bx:10 } y:20

tag 10

last binding wins:must check every tag at the top level

Implementing a general protobuf wireformat scannervoid GeneralProtobufScan(input, end, action_table) { while (input < end) { (field_label, type) = ParseTag(input, end); switch (action_table[field_label]) { case Action::kStore: switch (type) {

// how to parse varints, nested messages ... } case Action::kSkip: switch (type) {

// how to skip varints, nested messages... } // Other actions (like counting, summing …) } }}

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; … // padding repeated NestedProto pad = 8;

}

message NestedProto { optional int32 bx = 4; ...}

Action table:(x)4 => store(bar)6 => recurse(y)7 => store(pad) 8 => skip


Fast expression evaluation via Scanning

# padding elts

ParseFromString* (ns) Per pad elt Scanner

(ns) Per pad elt Speedup

0 72 - 33 - 2.2

8 163 11.4 53 2.5 3.1

64 535 7.2 187 2.4 2.9

512 2578 4.9 1197 2.3 2.2

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;

// vary amount of padding // which we need to (partially) parse to skip repeated NestedProto pad = 8;}

CPU: Intel Sandybridge (2x8 cores) 2600 MHz dL1:32KB dL2:256KB dL3:20MB. Measuring 1/throughput *protoc set to optimize for speed.


Perf stat for fixed32 microbenchmark:

● 4.48 instructions per cycle (ie... high utilization of cpu) ● 30% of all instructions are branches (ie... likely not work efficient)● 0.03% mispredicted (ie… checks are predictable)● At 2 GB/s: Pretty good if data comes from a 1-10Gbit ethernet. But:

○ Far from per core memory bandwidth (~7-12 GB/s)○ Single core would have lower throughput than single PCIe SSD (4GB/s)○ Single core would have lower throughput than 40Gbit ethernet link

Generic scanner efficiency. How much faster is still useful?


// using fixed32 helps measure overhead repeated NestedProto→fixed32 pad = 8; }

The Protobuf wireformat is a concatenation of (compressed tag, compressed value) pairs

message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; ...}

message NestedProto { optional int32 bx = 4; ...}

Tag = Varint(field_label << 3 | type) Encoded(Value)

Varint(4 << 3 | 0) = 32 Varint(10)x:1 bar { bx:10 } y:20

Varint encoding: maps smaller integers to less bytes. 1 bit/byte metadata overhead. 1→0b 0000 0001 // first byte intact

128→0b 1000 0000 // first 7 bits in first byte →0b 0000 0001 // 8th bit in second byte.

Overheads of generic scanningvoid GeneralProtobufScan(input, end, action_table) { while (input < end) {

tag = Varint::Parse(input, end); wiretype = tag & 0b111; field_label = tag >> 3; switch (action_table[field_label]) { case Action::kStore: switch (wiretype) {

// parse varints, nested messages ... } case Action::kSkip: switch (wiretype) {

// Amount of skip work varies per type

len = Varint::Parse(input, end);input += len // actual work

} } }}

// kTagX = encode(4,0) = (4 << 3 | 0) = 32 if (input < end && *input == kTagX) { // parse this one and save it}

// kTagBar = encode(6,3) = (6 << 3 | 2) = 50if (input < end && *input == kTagBar){ // recurse to get bar.bx}

// kTagY = encode(7,0) = (7 << 3 | 0) = 56if (input < end && *input == kTagY) { // parse this one and save it}

// kTagPad = encode(8,2) = (8 << 3 | 2) = 66while (input < end && *input == kTagPad) { // ParseVarint len // skip}

if (input == end) { // yay return ErrorCode::kOk;}

Removing overhead via codegen


repeated NestedProto pad = 8;}

x bar <len> ... y pad

if (input < end && *input == kTagX) { // parse this one and save it}

if (input < end && *input == kTagBar){}

if (input < end && *input == kTagY) {}

while (input < end && *input == kTagPad) {}

if (input == end) { } else { return GeneralProtobufScan(

input, end, actions); }

Handling unexpected inputs

message TestProto { optional int32 x = 4; optional int32 old = 5; optional NestedProto bar = 6; optional int32 y = 7;

repeated NestedProto pad = 8;}

x bar <len> ... yold pad

Faster expression evaluation on protobuf via llvm codegen

# padding elts

Scanner (ns) Per pad elt

Codegen (ns) Per pad elt

Speedup (over

scanner)0 33 - 17 - 1.98 53 2.5 22 0.63 2.4

64 187 2.4 61 0.69 3.1512 1197 2.3 342 0.63 3.5


// varying padding repeated NestedProto pad = 8;}


Updated stats for fixed32 benchmark

Scanner no longer far of per-core memory bandwidth for this microbenchmark

scanner codegeninstructions per cycle 4.48 ipc 2.68 ipc

branches/instructions 30% 30%

branch misprediction 0.03% 0.06%

exercised bandwidth 2GB/s 7GB/s

● Example of applying LLVM to a problem outside compilers

● In retrospect: some hallmarks of a good candidate problem to apply JITting to:○ Highly predictable paths given we know some more runtime information

○ Eg. Mostly legal inputs, with stronger properties than guaranteed by the spec.

○ Dealing correctly with input unpredictability by leveraging a fallback path.

● It converts runtime tag decoding to static number of compares (+ compile time encoding)

● It inlines handlers

● Coalesces some length checks

○ It achieves much lower latency and CPU cost than the C++ protobuf library or the standalone scanner (7x and 3.5x respectively) in our microbenchmarks

Summary

My takeaways on using LLVM codegen as an outsider to compilers.

● Well documented instruction set and tutorial (Kaleidoscope)● Builder API is very helpful.● Type checks on LLVM code very useful.● Some hiccups with using some LLVM instructions (coming from C)

○ GEP, alloca (vs scoped local variable), i1 vs boolean (True < False in signed i1).

● Generating debug information requires me to learn different API (didn’t do it) ● Tracking signed vs unsigned integers becomes my problem.● Can link to C code in host process, but not obvious how to do inlining of

useful pre-existing routines written in C

LLVM JIT library takeaways ● JIT latency overhead: For these microbenchmarks ~3ms plain codegen +

mem2reg, up to ~15ms with extra passes such as function inline, simplify-cfg.○ 15ms → JIT would be the bottleneck for data below 100MB

● Hard for a beginner to know which passes should be run, in what order (multiple times?)

● Can be hard to intuit when passes fail to optimize critical points (eg, mem2reg: iteration variable not in register)

● Very tempting to make the codegen phase generate inlined code where it matters (but harder to debug, and maintain).

Would it be easier to use C as a JIT initially?

● Often found myself making the generated LLVM code catch up to a reference C code.

● Could still “JIT“ compile it in process by using llvm + libclang● Gets you debug symbols for free (does it?)● Gets you signed vs. unsigned, implicit conversions between integer widths.● Can continue using existing inlined header-only utilities (eg. ParseVarint32)● Learn LLVM best practices from Clang.● Is there an easy to use builder class for C code in Clang?

Expression evaluation on protobuf data

Documents

Transcript of Expression evaluation on protobuf data