Expression evaluation on protobuf data
-
Upload
oscar-ricardo-moll -
Category
Documents
-
view
146 -
download
4
Transcript of Expression evaluation on protobuf data
Faster expression evaluation on Protobuf data
Oscar MollMIT DB GroupWork done as an intern at Google (Summer ‘16)
Protocol buffers = data definition language + serialization spec + compiler + libraries
// Protobuf Data Definition
message Person { string name = 1; int32 id = 2; string email = 3;}
// example c++ user code Person john; // Person is implemented as combo of// of generated C++ && shared base// classes
if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();
}
Protocol buffer backward compatibility- Adding fields
// Protobuf Data Definition adds// with new field added
message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}
// same old c++ user code // using stale Data definition// still works correctly Person john; // Person is implemented as combo of// of generated C++ && shared base// classes
if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();
}
Protocol buffer backward compatibility - Deprecating fields
// Protobuf Data Definition removes// some old field.
message Person { string name = 1; int32 id = 2; string email = 3; int32 new_field = 4;}
// byte overhead is dependent only on fields actually present
// same old c++ user code // using stale Data definition// still works correctly // even though email-missing Person john;
if(john.ParseFromString(data_buffer)){id = john.id();name = john.name();
}
Expression evaluation on Protobuf data
message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}
message NestedProto { int32 bx = 4; ...}
Let foo be a TestProto
We want to eval:{"foo.x > foo.bar.bx", "foo.y > foo.x"};
E.g."x:1 bar { bx:10 } y:20" => {1 > 10, 20 > 10} => {false, true}
TestProto::ParseFromString example (and baseline)
test::TestProto f;auto b = f.ParseFromString(data);if (!b) { e.Invalid("parse failed"); return;}
output_buffer[0] = f.x() > f.bar().bx();output_buffer[1] = f.y() > f.x();
Type information received at runtimeExpression list also received at runtime
ParseFromString:❏ work proportional to every (nested) value in
the protobuf❏ Also limited by interface (eg. error checking)❏ allocates memory dynamically for every
variable sized value.
message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}
message NestedProto { int32 bx = 4; ...}
{"foo.x > foo.bar.bx", "foo.y > foo.x"};
We would prefer to access only what we need from the serialized buffer.
Which fields are needed is expression dependent.
We can do this as follows...
The protobuf wireformat is a concatenation of binary tag-value pairs
message TestProto { int32 x = 4; NestedProto bar = 6; int32 y = 7; ...}
message NestedProto { int32 bx = 4; ...}
tag 1
tag <len>
tag 20
// example datax:1 bar { bx:10 } y:20
tag 10
last binding wins:must check every tag at the top level
Implementing a general protobuf wireformat scannervoid GeneralProtobufScan(input, end, action_table) { while (input < end) { (field_label, type) = ParseTag(input, end); switch (action_table[field_label]) { case Action::kStore: switch (type) {
// how to parse varints, nested messages ... } case Action::kSkip: switch (type) {
// how to skip varints, nested messages... } // Other actions (like counting, summing …) } }}
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; … // padding repeated NestedProto pad = 8;
}
message NestedProto { optional int32 bx = 4; ...}
Action table:(x)4 => store(bar)6 => recurse(y)7 => store(pad) 8 => skip
{"foo.x > foo.bar.bx", "foo.y > foo.x"};
Fast expression evaluation via Scanning
# padding elts
ParseFromString* (ns) Per pad elt Scanner
(ns) Per pad elt Speedup
0 72 - 33 - 2.2
8 163 11.4 53 2.5 3.1
64 535 7.2 187 2.4 2.9
512 2578 4.9 1197 2.3 2.2
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;
// vary amount of padding // which we need to (partially) parse to skip repeated NestedProto pad = 8;}
CPU: Intel Sandybridge (2x8 cores) 2600 MHz dL1:32KB dL2:256KB dL3:20MB. Measuring 1/throughput *protoc set to optimize for speed.
{"foo.x > foo.bar.bx", "foo.y > foo.x"};
Perf stat for fixed32 microbenchmark:
● 4.48 instructions per cycle (ie... high utilization of cpu) ● 30% of all instructions are branches (ie... likely not work efficient)● 0.03% mispredicted (ie… checks are predictable)● At 2 GB/s: Pretty good if data comes from a 1-10Gbit ethernet. But:
○ Far from per core memory bandwidth (~7-12 GB/s)○ Single core would have lower throughput than single PCIe SSD (4GB/s)○ Single core would have lower throughput than 40Gbit ethernet link
Generic scanner efficiency. How much faster is still useful?
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;
// using fixed32 helps measure overhead repeated NestedProto→fixed32 pad = 8; }
The Protobuf wireformat is a concatenation of (compressed tag, compressed value) pairs
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7; ...}
message NestedProto { optional int32 bx = 4; ...}
Tag = Varint(field_label << 3 | type) Encoded(Value)
Varint(4 << 3 | 0) = 32 Varint(10)x:1 bar { bx:10 } y:20
Varint encoding: maps smaller integers to less bytes. 1 bit/byte metadata overhead. 1→0b 0000 0001 // first byte intact
128→0b 1000 0000 // first 7 bits in first byte →0b 0000 0001 // 8th bit in second byte.
Overheads of generic scanningvoid GeneralProtobufScan(input, end, action_table) { while (input < end) {
tag = Varint::Parse(input, end); wiretype = tag & 0b111; field_label = tag >> 3; switch (action_table[field_label]) { case Action::kStore: switch (wiretype) {
// parse varints, nested messages ... } case Action::kSkip: switch (wiretype) {
// Amount of skip work varies per type
len = Varint::Parse(input, end);input += len // actual work
} } }}
// kTagX = encode(4,0) = (4 << 3 | 0) = 32 if (input < end && *input == kTagX) { // parse this one and save it}
// kTagBar = encode(6,3) = (6 << 3 | 2) = 50if (input < end && *input == kTagBar){ // recurse to get bar.bx}
// kTagY = encode(7,0) = (7 << 3 | 0) = 56if (input < end && *input == kTagY) { // parse this one and save it}
// kTagPad = encode(8,2) = (8 << 3 | 2) = 66while (input < end && *input == kTagPad) { // ParseVarint len // skip}
if (input == end) { // yay return ErrorCode::kOk;}
Removing overhead via codegen
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;
repeated NestedProto pad = 8;}
x bar <len> ... y pad
if (input < end && *input == kTagX) { // parse this one and save it}
if (input < end && *input == kTagBar){}
if (input < end && *input == kTagY) {}
while (input < end && *input == kTagPad) {}
if (input == end) { } else { return GeneralProtobufScan(
input, end, actions); }
Handling unexpected inputs
message TestProto { optional int32 x = 4; optional int32 old = 5; optional NestedProto bar = 6; optional int32 y = 7;
repeated NestedProto pad = 8;}
x bar <len> ... yold pad
Faster expression evaluation on protobuf via llvm codegen
# padding elts
Scanner (ns) Per pad elt
Codegen (ns) Per pad elt
Speedup (over
scanner)0 33 - 17 - 1.98 53 2.5 22 0.63 2.4
64 187 2.4 61 0.69 3.1512 1197 2.3 342 0.63 3.5
message TestProto { optional int32 x = 4; optional NestedProto bar = 6; optional int32 y = 7;
// varying padding repeated NestedProto pad = 8;}
{"foo.x > foo.bar.bx", "foo.y > foo.x"};
Updated stats for fixed32 benchmark
Scanner no longer far of per-core memory bandwidth for this microbenchmark
scanner codegeninstructions per cycle 4.48 ipc 2.68 ipc
branches/instructions 30% 30%
branch misprediction 0.03% 0.06%
exercised bandwidth 2GB/s 7GB/s
● Example of applying LLVM to a problem outside compilers
● In retrospect: some hallmarks of a good candidate problem to apply JITting to:○ Highly predictable paths given we know some more runtime information
○ Eg. Mostly legal inputs, with stronger properties than guaranteed by the spec.
○ Dealing correctly with input unpredictability by leveraging a fallback path.
● It converts runtime tag decoding to static number of compares (+ compile time encoding)
● It inlines handlers
● Coalesces some length checks
○ It achieves much lower latency and CPU cost than the C++ protobuf library or the standalone scanner (7x and 3.5x respectively) in our microbenchmarks
Summary
My takeaways on using LLVM codegen as an outsider to compilers.
● Well documented instruction set and tutorial (Kaleidoscope)● Builder API is very helpful.● Type checks on LLVM code very useful.● Some hiccups with using some LLVM instructions (coming from C)
○ GEP, alloca (vs scoped local variable), i1 vs boolean (True < False in signed i1).
● Generating debug information requires me to learn different API (didn’t do it) ● Tracking signed vs unsigned integers becomes my problem.● Can link to C code in host process, but not obvious how to do inlining of
useful pre-existing routines written in C
LLVM JIT library takeaways ● JIT latency overhead: For these microbenchmarks ~3ms plain codegen +
mem2reg, up to ~15ms with extra passes such as function inline, simplify-cfg.○ 15ms → JIT would be the bottleneck for data below 100MB
● Hard for a beginner to know which passes should be run, in what order (multiple times?)
● Can be hard to intuit when passes fail to optimize critical points (eg, mem2reg: iteration variable not in register)
● Very tempting to make the codegen phase generate inlined code where it matters (but harder to debug, and maintain).
Would it be easier to use C as a JIT initially?
● Often found myself making the generated LLVM code catch up to a reference C code.
● Could still “JIT“ compile it in process by using llvm + libclang● Gets you debug symbols for free (does it?)● Gets you signed vs. unsigned, implicit conversions between integer widths.● Can continue using existing inlined header-only utilities (eg. ParseVarint32)● Learn LLVM best practices from Clang.● Is there an easy to use builder class for C code in Clang?