Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark...
Transcript of Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark...
![Page 1: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/1.jpg)
Shark
Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica,
Scott Shenker
Hive on Spark
![Page 2: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/2.jpg)
Agenda
• Intro to Spark • Apache Hive • Shark • Shark’s Improvements over Hive • Demo • Alpha status • Future directions
![Page 3: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/3.jpg)
What Spark Is
• Not a wrapper around Hadoop • Separate, fast, MapReduce-‐like engine – In-‐memory data storage for very fast iterative queries
– Powerful optimization and scheduling – Interactive use from Scala interpreter
• Compatible with Hadoop’s storage APIs – Can read/write to any Hadoop-‐supported system, including HDFS, HBase, SequenceFiles, etc
![Page 4: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/4.jpg)
Project History
• Started in summer of 2009 • Open sourced in early 2010 • In use at UC Berkeley, UCSF, Princeton, Klout, Conviva, Quantifind, Yahoo! Research
![Page 5: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/5.jpg)
Spark Programming Model
Resilient distributed datasets (RDDs) Distributed collections of Scala objects Can be cached in memory across cluster nodes • Manipulated like local Scala collections Automatically rebuilt on failure
![Page 6: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/6.jpg)
Example: Log Mining
Load error messages from a log into memory, then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDD Transformed RDD
Action
Result: full-‐text search of Wikipedia in <1 sec (vs 20 sec for on-‐disk data)
Result: scaled to 1 TB data in 5-‐7 sec (vs 170 sec for on-‐disk data)
![Page 7: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/7.jpg)
Apache Hive
• Data warehouse solution developed at Facebook
• SQL-‐like language called HiveQL to query structured data stored in HDFS
• Queries compile to Hadoop MapReduce jobs
![Page 8: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/8.jpg)
![Page 9: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/9.jpg)
Hive Principles
• SQL provides a familiar interface for users • Extensible types, functions, and storage formats • Horizontally scalable with high performance on large datasets
![Page 10: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/10.jpg)
Hive Applications
• Reporting • Ad hoc analysis • ETL for machine learning …
![Page 11: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/11.jpg)
Hive Downsides • Not interactive – Hadoop startup latency is ~20 seconds, even for small jobs
• No query locality – If queries operate on the same subset of data, they still run from scratch
– Reading data from disk is often bottleneck
• Requires separate machine learning dataflow
![Page 12: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/12.jpg)
Shark Motivation • Exploit temporal locality: working set of data can often fit in memory to be reused between queries • Provide low latency on small queries • Integrate distributed UDF’s into SQL
![Page 13: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/13.jpg)
Introducing Shark • Shark = Spark +
Hive • Run HiveQL queries through Spark with Hive UDF, UDAF, SerDe • Utilize Spark’s in-‐memory RDD caching and flexible language capabilities
![Page 14: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/14.jpg)
Shark in the AMP Stack
Mesos
Spark
Private Cluster Amazon EC2
… Had
oop
MPI
Bagel (Pregel on Spark)
Shark …
Debug Tools
Streaming Spark
![Page 15: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/15.jpg)
Shark
• Physical operators using Spark • Rely on Spark’s fast execution, fault tolerance, and in-‐memory RDD’s
• Reuse as much Hive code as possible – Convert logical query plan generated from Hive into Spark execution graph
• Compatible with Hive – Run HiveQL queries on existing HDFS data using Hive metadata, without modifications
![Page 16: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/16.jpg)
![Page 17: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/17.jpg)
0.1 alpha: 84% Hive tests passing (575 out of 683)
http://github.com/amplab/shark
![Page 18: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/18.jpg)
0.1 alpha • Experimental SQL/RDD integration • User selected caching with CTAS • Columnar RDD cache • Some performance improvements
![Page 19: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/19.jpg)
Caching
• User selected caching with CTAS
• CREATE TABLE mytable_cached AS SELECT * FROM mytable WHERE count > 10;
• mytable_cached becomes an in-‐memory materialized view that can be used to answer queries.
![Page 20: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/20.jpg)
SQL/Spark Integration
• Allow users to implement sophisticated algorithms as UDFs in Spark
• Query processing UDFs are streamlined val rdd = sc.sql2rdd( "select foo, count(*) c from pokes group by foo")println(rdd.count)println(rdd.mapRows(_.getInt(“c”)).reduce(_+_))
![Page 21: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/21.jpg)
Performance optimizations
• Hash-‐based shuffle (speeds up group-‐bys) – Hadoop does sort-‐based shuffle which can be expensive.
• Limit push-‐down in order by – select * from pokes order by foo limit 10
• Columnar cache
![Page 22: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/22.jpg)
Large Caches in JVM • Java objects have large overhead (2-‐4x the actual data size) • Boxing/Unboxing is slow • GC is a bottleneck if we have many long-‐lived objects
![Page 23: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/23.jpg)
Example: Caching Rows Int, Double, Boolean 25 10.7 True
24 + 24 + 16 = 64 bytes* Should be ~ 13 bytes
*Estimated on 64 bit VM. Implementations may vary.
![Page 24: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/24.jpg)
Possible Solutions 1. Store deserialized Java objects 2. Serialize objects to in-‐memory
byte array 3. Use memory slab external to
JVM 4. Store data in primitive arrays
![Page 25: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/25.jpg)
Deserialized Java Objects
+ Fast (requires little CPU) (Up to 5X speedup) -‐ Poor memory usage (3X worse) -‐ Lots of Garbage Collection overhead
![Page 26: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/26.jpg)
Serialize to Bytes + Best memory efficiency + Efficient GC -‐ CPU usage to serialize/deserialize each obj
![Page 27: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/27.jpg)
Store in External Slab + Efficient memory + No GC -‐ CPU usage to serialize/deserialize each obj -‐ More difficult to implement
![Page 28: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/28.jpg)
Primitive Column Arrays
+ Efficient memory (similar to byte arrays)
+ Efficient GC + No extra CPU overhead
![Page 29: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/29.jpg)
Column vs Row Storage
1
Column Row 2 3 1
john mike sally
4.1 3.5 6.4
john 4.1
2 mike 3.5
3 sally 6.4
![Page 30: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/30.jpg)
Columnar Benefits • Better compression • Fewer objects to track • Only materialize necessary columns => Less CPU
![Page 31: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/31.jpg)
Columnar in Shark • Store all primitive columns in primitive arrays on each node • Less deserialization overhead • Space efficient and easier to garbage collect
![Page 32: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/32.jpg)
Result Achieved efficient storage space without sacrificing performance
![Page 33: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/33.jpg)
Limit Pushdown in Sort
SELECT * FROM table ORDER BY key LIMIT 10;
![Page 34: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/34.jpg)
Main Idea • The limit can be applied before sorting all data to increase performance. • Each mapper computes its Top K and then a single reducer merges these
![Page 35: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/35.jpg)
Possible Implementations
1. PriorityQueue (Heap) to keep running top k in one traversal on each mapper.
O(n logk)
![Page 36: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/36.jpg)
Possible Implementations
2. QuickSelect algorithm. Essentially quicksort, but prunes elements on one side of pivot when possible.
O(n)
![Page 37: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/37.jpg)
Result • We chose to use Google Guava’s implementation of QuickSelect. • We observed much better performance than Hive which applies the limit after sorting on each reducer.
![Page 38: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/38.jpg)
Demo
• 3-‐node EC2 large cluster (2 virtual cores, 7GB of RAM)
• Synthetic TPC-‐H data (2x scale factor)
![Page 39: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/39.jpg)
Some numbers while we wait for Hive to finish running…
• Brown/Stonebraker benchmark 70GB1
– Also used on Hive mailing list2
• 10 Amazon EC2 High Memory Nodes (30GB of RAM/node)
• Naively cache input tables • Compare Shark to Hive 0.7
1 http://database.cs.brown.edu/projects/mapreduce-‐vs-‐dbms/ 2 https://issues.apache.org/jira/browse/HIVE-‐396 1
![Page 40: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/40.jpg)
Benchmarks: Query 1
SELECT * FROM grep WHERE field LIKE ‘%XYZ%’;
• 30GB input table
![Page 41: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/41.jpg)
Benchmark: Query 2
• 5 GB input table
SELECT pagerank, pageURL FROM rankings WHERE pagerank > 10;
![Page 42: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/42.jpg)
Status
• Passing 575 Hive tests out of 683.
• You can help us decide what to implement next.
![Page 43: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/43.jpg)
Unsupported Hive Features
• ADD FILE (use Hadoop distributed cache to distribute user-‐defined scripts)
• STREAMTABLE hint not supported • Outer join with non-‐equi join filter • Table statistics (piggyback file sink) • Table with buckets • ORDER BY with multiple reducers has a bug (will be fixed this week!)
• MapJoin has a serialization bug (will be fixed this week!)
![Page 44: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/44.jpg)
Unsupported Hive Features
• Automatically convert joins to mapjoins • Sort-‐merge mapjoins • Partitions with different input formats • Unique joins • Writing to two or more tables in a single SELECT INSERT
• Archiving data • virtual columns (INPUT__FILE__NAME, BLOCK__OFFSET__INSIDE__FILE)
• Merge small files from output
![Page 45: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/45.jpg)
What’s coming up (in a month)?
• Performance improvements (groupbys, joins)
• Demo of Shark on a 100-‐node cluster in SIGMOD 2012 (May 22, Scottsdale, AZ) mining Wikipedia visit log data
![Page 46: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/46.jpg)
What’s coming up (a year)?
• Shark-‐specific query optimizer • Automatic tuning of parameters (e.g. number of reducers)
• Off-‐heap caching (e.g. EhCache) • Automatic caching based on query analysis • Shared caching infrastructure
![Page 47: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/47.jpg)
Conclusion
• Shark is a warehouse solution compatible with Apache Hive
• Can be orders of magnitude faster • We are looking for people to try and we are happy to provide support
![Page 48: Hive%on%Spark%files.meetup.com/3138542/2012-04-23_Shark_alpha_release.pdf · 2012-04-24 · Shark CliffEngle,%Antonio%Lupher,Reynold%Xin, Matei%Zaharia,%Michael%Franklin,%Ion%Stoica,](https://reader035.fdocuments.us/reader035/viewer/2022070913/5fb488a0dca7f80d7c5f6ace/html5/thumbnails/48.jpg)
Thank you!
Questions?
http://shark.cs.berkeley.edu (will be updated tonight)