Avrilia Floratou (University of Wisconsin – Madison) Jignesh M. Patel (University of Wisconsin –...

Click here to load reader

download Avrilia Floratou (University of Wisconsin – Madison) Jignesh M. Patel (University of Wisconsin – Madison) Eugene J. Shekita (While at IBM Almaden Research.

of 24

Transcript of Avrilia Floratou (University of Wisconsin – Madison) Jignesh M. Patel (University of Wisconsin –...

  • Slide 1

Avrilia Floratou (University of Wisconsin Madison) Jignesh M. Patel (University of Wisconsin Madison) Eugene J. Shekita (While at IBM Almaden Research Center) Sandeep Tata (IBM Almaden Research Center) Column-Oriented Storage Techniques for MapReduce 1 Slide 2 Motivation DatabasesMapReduce Column Oriented Storage Performance Programmability Fault tolerance 2 Slide 3 Challenges 3 How to incorporate columnarstorage into an existing MR system (Hadoop) without changing its core parts? How can columnar-storage operate efficiently on top of a DFS (HDFS)? Is it easy to apply well-studied techniques from the database field to the Map-Reduce framework given that: It processes one tuple at a time. It does not use a restricted set of operators. It is used to process complex data types. Slide 4 Outline Column-Oriented Storage Lazy Tuple Construction Compression Experimental Evaluation Conclusions 4 Slide 5 Column-Oriented Storage in Hadoop NameAgeInfo Joe23hobbies: {tennis} friends: {Ann, Nick} David32friends: {George} John45hobbies: {tennis, golf} Smith65hobbies: {swimming} friends: {Helen} 1 st node 2 nd node Eliminate unnecessary I/O NameAgeInfo Joe23hobbies: {tennis} friends: {Ann, Nick} David32friends: {George} NameAgeInfo John45hobbies:{tennis, golf} Smith65hobbies: {swimming} friends: {Helen} Name Joe David Age 23 32 Info hobbies: {tennis} friends:{Ann, Nick} friends: {George} Name John Smith Age 45 65 Info hobbies: {tennis, golf} hobbies: {swimming} friends: {Helen} Introduce a new InputFormat : ColumnInputFormat (CIF) 5 Slide 6 Replication and Co-location HDFS Replication Policy Node ANode BNode CNode D NameAgeInfo Joe23hobbies: {tennis} friends: {Ann, Nick} David32friends: {George} John45hobbies: {tennis, golf} Smith65hobbies: {swimming} friends: {Helen} Name Joe David Age 23 32 Info hobbies: {tennis} friends:{Ann, Nick} friends: {George} Name Joe David Name Joe David Age 23 32 Age 23 32 Info hobbies: {tennis} friends: {Ann,Nick} friends: {George} Info hobbies: {tennis} friends:{Ann, Nick} friends: {George} CPP Introduce a new column placement policy (CPP) 6 Slide 7 Example AgeName Record if (age < 35) return name 23 32 45 30 50 Joe David John Mary Ann Map Method 23Joe 32David What if age > 35? Can we avoid reading and deserializing the name field? 7 Slide 8 Outline Column-Oriented Storage Lazy Tuple Construction Compression Experiments Conclusions 8 Slide 9 Lazy Tuple Construction Deserialization of each record field is deferred to the point where it is actually accessed, i.e. when the get() methods are called. Mapper ( NullWritable key, Record value) { String name; int age = value.get(age); if (age < 35) name = value.get(name); } Mapper ( NullWritable key, LazyRecord value) { String name; int age = value.get(age); if (age < 35) name = value.get(name); } 9 Slide 10 Skip List (Logical Behavior) R1R2R10R20R99 R100... R90... R1 R20R90R100... R10 Skip 100 Records Skip 10 10 R1R2R10R20R90R99 R1R10R20R90 R1R100 Slide 11 Example Age Joe Jane David Name Skip10 = 1002 Skip100 = 9017 Skip 10 = 868 Mary 10 rows 100 rows Skip Bytes Ann 23 39 45 30 if (age < 35) return name 11 John 0 1 2 102 Slide 12 Example Age hobbies: tennis friends : Ann, Nick Null friends : George Info Skip10 = 2013 Skip100 = 19400 Skip 10 = 1246 hobbies: tennis, golf 10 rows 100 rows 23 39 45 30 if (age < 35) return hobbies 12 Slide 13 Outline Column-Oriented Storage Lazy Record Construction Compression Experiments Conclusions 13 Slide 14 Compression # Records in B1 # Records in B2 LZO/ZLIB compressed block RID : 0 - 9 LZO/ZLIB compressed block RID : 10 - 35 B1 B2 Null Skip10 = 210 Skip100 = 1709 Skip 10 = 304 0: {tennis, golf} 10 rows 100 rows Dictionary hobbies : 0 friends : 1 Compressed Blocks Dictionary Compressed Skip Lists Skip Bytes Decompress 0 : {tennis} 1 : {Ann, Nick} 1: {George} 14 Slide 15 Outline Column-Oriented Storage Lazy Record Construction Compression Experiments Conclusions 15 Slide 16 RCFile Metadata Joe, David John, Smith 23, 32 {hobbies: {tennis} friends: {Ann, Nick}}, {friends:{George}} {hobbies: {tennis, golf}}, {hobbies: {swimming} friends: {Helen}} Row Group 1 Row Group 2 NameAgeInfo Joe23hobbies: {tennis} friends: {Ann, Nick} David32friends: {George} John45hobbies: {tennis, golf} Smith65hobbies: {swimming} friends: {Helen} 45, 65 16 Slide 17