Stephan Ewen Flink committer co-founder / CTO @ data Artisans @StephanEwen Apache Flink.
Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink
-
Upload
flink-forward -
Category
Data & Analytics
-
view
207 -
download
0
Transcript of Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink
![Page 1: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/1.jpg)
CODE GENERATION IN SERIALIZERS AND COMPARATORS OF APACHE FLINKGÁBOR HORVÁTH
![Page 2: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/2.jpg)
PARADIGM SHIFT IN BIG DATA PLATFORMS
•Applications used to be I/O bound (Network, Disk)• InfiniBand, SSDs reduced I/O overhead significantly•CPU increasingly became a bottleneck•Even in I/O bound applications, reduced CPU usage might mean reduced electricity costs
![Page 3: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/3.jpg)
SERIALIZATION IN FLINK
•Several methods: Avro, Kryo, Flink •Flink serialization is more efficient than Kryo•Not to mention the default Java serialization
•Crucial, not just for I/O, operating on serialized data•Still some room for improvements
![Page 4: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/4.jpg)
SERIALIZATION IN FLINK
![Page 5: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/5.jpg)
INEFFICIENCIES OF CURRENT FLINK SERIALIZERS
• Fields accessed using reflection• Each iteration might dispatch to a different method, inhibits
inlining• Null checks and null and subclass flags• Extra code to deal with subclasses• Hard to unroll the loop, upper bound is not a compile time
constant
for (int i = 0; i < numFields; i++) { Object o = fields[i].get(value); if (o == null) { target.writeBoolean(true); } else { target.writeBoolean(false); fieldSerializers[i].serialize(o, target); }}
NOSPECIALIZATION
![Page 6: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/6.jpg)
SEVERAL SERIALIZER RELATED INNOVATIONS IN APACHE FLINK
•Object reusing overloads•Delicate type system•Code generation (not mainline yet, this talk’s topic)• Fix the inefficiencies of Flink serializers
![Page 7: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/7.jpg)
RUNTIME CODE GENERATION
• Focus on POJOs (Plain Old Java Objects)• Best ROI due to eliminating reflection
• Specialization• No reflection for serialization (direct field access code
generated)• No null checks, subclass handling for primitive types• No subclass handling for final types• Unrolled loops, better for inlining
• Janino as runtime compiler, FreeMarker as template engine
![Page 8: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/8.jpg)
QUESTIONNAIRE
•Who has written a custom serializer to improve performance?•Who has written a custom comparator to improve
performance?•Who used Tuples instead of POJOs only to improve
performance?
OVER(soon)
Who wants performance close to Tuples with null value support?
![Page 9: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/9.jpg)
LET’S SEE THE NUMBERS!
6X PERFORMANCE IMPROVEMENT
Rest of Flink Job Serializers/Comparators
![Page 10: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/10.jpg)
NINE MEN’S MORRIS BENCHMARK
•Calculates game-theoretical values of game states• Iterative job•Group by, reduce, outer joins, flat maps, and filter•Heavy use of POJOs•Real world complexity
![Page 11: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/11.jpg)
LET’S SEE THE NUMBERS!
•Measured on ReducePerformance, WordCountPojo and Nine Men’s Morris on local machine•Measured ReducePerformance and Nine Men’s Morris on a cluster•The results were consistent
![Page 12: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/12.jpg)
LET’S SEE THE NUMBERS! (LOCAL MACHINE)
0
10
20
30
40
50
60
Serializer: Flink Handwritten Generated HandwrittenComparator: Flink Flink Generated Generated
![Page 13: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/13.jpg)
CLOSE TO HAND WRITTEN SERIALIZERS
•About 20% speedup compared to Flink serializers•Some gap left to handwritten• Smarter getLength• Flattening•Null and subclass flags•Better handling of primitives (less
boxing/unboxing, inlining)• Janino might generate a bit slower code
![Page 14: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/14.jpg)
HOW DOES THIS WORK?
![Page 15: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/15.jpg)
HIGH LEVEL OVERVIEW: THE TRADITIONAL WAY
POJOObject
Serialized
POJO
TypeInfo
SerializerPOJO
Class
Instantiate
![Page 16: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/16.jpg)
HIGH LEVEL OVERVIEW: THE NEW WAY
POJOObject
Generated
Serializer
Serialized
POJO
TypeInfo
FreeMarker
Template
JaninoSerialize
rGenerat
or
POJOClass
ClassLoader
![Page 17: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/17.jpg)
![Page 18: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/18.jpg)
HOW TO LOAD GENERATED CODE?
•We need to serialize serializers•First step of deserialization: load the class•Which ClassLoader to use?•Custom ClassLoader to the rescue!
Source
CodeClass
Loader
![Page 19: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/19.jpg)
MULTIPLE NODES/JVMS?
JVMA
JVMB
Serializer
?Serializer
![Page 20: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/20.jpg)
MULTIPLE NODES/JVMS?
JVMA
JVMB
Wrapper
Serializer
Serializer
![Page 21: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/21.jpg)
LET’S TRY IT OUT!
Class cast exception:
SerializerA cannot be cast to SerializerA.
![Page 22: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/22.jpg)
![Page 23: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/23.jpg)
![Page 24: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/24.jpg)
LETS CACHE AND TRY IT OUT!
Class cast exception:
UserObjectA cannot be cast to UserObjcetA.
![Page 25: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/25.jpg)
![Page 26: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/26.jpg)
LETS CACHE AND INVALIDATE AND TRY IT OUT!
![Page 27: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/27.jpg)
ACTUALLY... THERE ARE COUPLE OF MORE
•Janino bugs•Compatibility with Scala POJO like classes•Generated code harder to debug•…
![Page 28: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/28.jpg)
WHAT’S NEXT?
• Versioning serialization format•Replace reflection where performance matters• d.sortPartition("f0.author", Order.DESCENDING);
•Better utilization of getLength information• Eliminate redundant null/subclass flags• Beating Tuples!
![Page 29: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/29.jpg)
DISTANT FUTURE
•Vision: more JVM independent optimizations!•Columnar serialization format (end to end optimization)• Final goal: Faster than naive handwritten serializers!
•Customized NormalizedKeySorter•Lots of opportunities due to the delicate type system
![Page 30: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/30.jpg)
CONCLUSION
•Significant performance improvement•Ground work for lots of possible performance improvements•ClassLoader issues are not newcommer friendly•Not part of mainline Flink yet, happy to receive reviews • Jira: FLINK-3599
![Page 31: Gábor Horváth - Code Generation in Serializers and Comparators of Apache Flink](https://reader035.fdocuments.us/reader035/viewer/2022070603/587138aa1a28abf0568b6401/html5/thumbnails/31.jpg)
ACKNOWLEDGEMENT
•Huge thanks to GSoC:•Márton Balassi•Gábor Gévay
•Thanks to data Artisans for brainstorming•Thanks for your attention!