Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins
description
Transcript of Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins
![Page 1: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/1.jpg)
Chris Olston Benjamin ReedUtkarsh Srivastava
Ravi Kumar Andrew Tomkins
Pig Latin: A Not-So-Foreign Language For Data Processing
Research
![Page 2: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/2.jpg)
Data Processing Renaissance
Internet companies swimming in data• E.g. TBs/day at Yahoo!
Data analysis is “inner loop” of product innovation
Data analysts are skilled programmers
![Page 3: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/3.jpg)
Data Warehousing …?
Scale Often not scalable enough
$ $ $ $Prohibitively expensive at web scale• Up to $200K/TB
SQL• Little control over execution method• Query optimization is hard• Parallel environment• Little or no statistics• Lots of UDFs
![Page 4: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/4.jpg)
New Systems For Data Analysis
Map-Reduce
Apache Hadoop
Dryad
. . .
![Page 5: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/5.jpg)
Map-Reduce
Inputrecords
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
map
map
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Outputrecords
reduce
reduce
Just a group-by-aggregate?
![Page 6: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/6.jpg)
The Map-Reduce Appeal
ScaleScalable due to simpler design• Only parallelizable operations• No transactions
$ Runs on cheap commodity hardware
Procedural Control- a processing “pipe”SQL
![Page 7: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/7.jpg)
Disadvantages
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union Split
M R
M M R M
Chains
2. Common operations must be coded by hand• Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden inside map-reduce functions• Difficult to maintain, extend, and optimize
![Page 8: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/8.jpg)
Pros And Cons
Need a high-level, general data flow language
![Page 9: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/9.jpg)
Enter Pig Latin
Pig Latin
Need a high-level, general data flow language
![Page 10: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/10.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin example
• Salient features
• Implementation
![Page 11: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/11.jpg)
Example Data Analysis Task
User Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Find the top 10 most visited pages in each category
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Visits Url Info
![Page 12: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/12.jpg)
Data Flow
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10 urls
![Page 13: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/13.jpg)
In Pig Latinvisits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
![Page 14: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/14.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin example
• Salient features
• Implementation
![Page 15: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/15.jpg)
Step-by-step Procedural ControlTarget users are entrenched procedural programmers
The step-by-step method of creating a program in Pig is much cleaner and simpler to use than the single block method of SQL. It is easier to keep track of what your variables are, and where you are in the process of analyzing your data.
Jasmine NovakEngineer, Yahoo!
• Automatic query optimization is hard • Pig Latin does not preclude optimization
With the various interleaved clauses in SQL, it is difficult to know what is actually happening sequentially. With Pig, the data nesting and the temporary tables get abstracted away. Pig has fewer primitives than SQL does, but it’s more powerful.
David CiemiewiczSearch Excellence, Yahoo!
![Page 16: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/16.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability
Operates directly over files
![Page 17: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/17.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Quick Start and Interoperability
Schemas optional; Can be assigned dynamically
![Page 18: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/18.jpg)
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(urlVisits);
urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);
visitCounts = join visitCounts by url, urlInfo by url;gCategories = group visitCounts by category;topUrls = foreach gCategories generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
User-Code as a First-Class Citizen
User-defined functions (UDFs) can be used in every construct• Load, Store• Group, Filter, Foreach
![Page 19: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/19.jpg)
• Pig Latin has a fully-nestable data model with:– Atomic values, tuples, bags (lists), and maps
• More natural to programmers than flat tuples• Avoids expensive joins• See paper
Nested Data Model
yahoo ,financeemailnews
![Page 20: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/20.jpg)
Outline
• Map-Reduce and the need for Pig Latin
• Pig Latin example
• Novel features
• Implementation
![Page 21: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/21.jpg)
Implementation
cluster
Hadoop Map-Reduce
Pig
SQL
automaticrewrite +optimize
or
or
user
Pig is open-source.http://incubator.apache.org/pig
![Page 22: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/22.jpg)
Compilation into Map-Reduce
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10(urls)
Map1
Reduce1Map2
Reduce2
Map3
Reduce3
Every group or join operation forms a map-reduce boundary
Other operations pipelined into map and reduce phases
![Page 23: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/23.jpg)
Usage
• First production release about a year ago
• 150+ early adopters within Yahoo!
• Over 25% of the Yahoo! map-reduce user base
![Page 24: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/24.jpg)
Related Work
• Sawzall– Data processing language on top of map-reduce– Rigid structure of filtering followed by aggregation
• DryadLINQ– SQL-like language on top of Dryad
• Nested data models– Object-oriented databases
![Page 25: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/25.jpg)
Future Work
• Optional “safe” query optimizer– Performs only high-confidence rewrites
• User interface– Boxes and arrows UI– Promote collaboration, sharing code fragments and
UDFs
• Tight integration with a scripting language– Use loops, conditionals of host language
![Page 26: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/26.jpg)
Arun MurthyPi SongSanthosh SrinivasanAmir Youssefi
Shubham ChopraAlan GatesShravan NarayanamurthyOlga Natkovich
Credits
![Page 27: Chris Olston Benjamin Reed Utkarsh Srivastava Ravi Kumar Andrew Tomkins](https://reader036.fdocuments.us/reader036/viewer/2022081520/56813a8b550346895da28613/html5/thumbnails/27.jpg)
Summary
• Big demand for parallel data processing– Emerging tools that do not look like SQL DBMS– Programmers like dataflow pipes over static files
• Hence the excitement about Map-Reduce
• But, Map-Reduce is too low-level and rigid
Pig LatinSweet spot between map-reduce and SQL