CS346: Advanced Databases Alexandra I. Cristea [email protected] MapReduce and Hadoop.
-
Upload
sylvia-cox -
Category
Documents
-
view
215 -
download
1
Transcript of CS346: Advanced Databases Alexandra I. Cristea [email protected] MapReduce and Hadoop.
CS346: Advanced DatabasesAlexandra I. [email protected]
MapReduce and Hadoop
Outline
Reading: find resources online, or pick fromData Intensive Text Processing with MapReduce Chapters 1-3
Jimmy Lin, Chris Dyer, Morgan&Claypoolwww.coreservlets.com/hadoop-tutorial/ Marty HallHadoop: The Definitive Guide Tom White, O’Reilly Media Chapter
1-3; 16 (part of); 17 (part of); 20 (part of);
Outline: Data is big and getting bigger. New tools are emerging¨ Hadoop: A file system and processing paradigm (MapReduce)¨ Hbase: A way of storing and retrieving large amounts of data¨ Pig and Hive: High-level abstractions to make Hadoop easier
CS346 Advanced Databases2
3 CS346 Advanced Databases
¨ Data is growing faster than our ability to store or index it
¨ There are 3 Billion Telephone Calls in USA each day, 30 Billion emails daily, 1 Billion SMS, IMs.
¨ Scientific data: NASA's observation satellitesgenerate billions of readings each per day.
¨ IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!
¨ Whole genome sequences for many species now available: each megabytes to gigabytes in size
Why: Data is Massive
Other examples: massive data
¨ High-energy physics community: 2005 : PB databases– Now, Large Hadron Collider near Geneva, worlds largest particle
accelerator, recreating Big Bang conditions ~ 15 PB per year ¨ Google: in 2008 processing 20 PB a day!¨ eBay: in 2009 8.5 PB of user data, 170 trillion records, 150
billion new records per day¨ Facebook: 2.5 PB of user data, 15 TB growth per day
¨ >> Petabyte datasets the norm!
CS346 Advanced Databases4
However: bottleneck disk access
¨ Moore’s law: Disk capacity: 1980’ tens of MB -> now: few TB (several orders of magnitude growth)
¨ Latency: 2x improvement in the last quarter century¨ bandwidth: 50x
¨ >>90’s 1.37MB storage, transferred at 4.4 MB/s, read in 5min¨ >>Now, 1 TB storage, transferred at 100 MB/s, read in 2.5h¨ >>Writing is even slower!
CS346 Advanced Databases5
6 CS346 Advanced Databases
Massive Data Management
Must perform queries on this massive data:¨ Scientific research (monitor environment, species)¨ System management (spot faults, drops, failures)¨ Customer research (association rules, new offers) ¨ For revenue protection (phone fraud, service abuse)¨ Natural Language Processing (for unstructured (user) data)Else, why even collect this data?
Solution: Parallel Processing
¨ Using many machines (hardware)¨ Parallel access¨ Issue: hardware failure
– RAID e.g. uses redundant copies– HDFS uses a different approach
¨ Issue: data combination– MapReduce abstracts the R/W problem, transforming it into a
computation over sets of keys and values¨ Hadoop:
– Reliable, scalable platform for storage and analysis
CS346 Advanced Databases7
Hadoop
¨ Hadoop is an open-source architecture for handling big data– First developed by Doug Cutting (named after a toy elephant)– Google’s original implementations are not publicly available– Currently managed by the Apache Software Foundation
¨ Hadoop now:– More than just MapReduce (so more than a batch query processor)
CS346 Advanced Databases8
(some) Hadoop Tools and Products¨ Many tools/products now reside on top of Hadoop
– HBase: (non-relational) distributed database Key-value store Uses HDFS online read/write access Batch R/W
– YARN: allows any distributed program to run on Hadoop– Hive: data warehouse infrastructure developed at Facebook– Pig: high-level language that compiles to Hadoop, from Yahoo– Mahout: machine learning algorithms in Hadoop
CS346 Advanced Databases9
Hadoop in business use
¨ Hadoop widely used in technology-based businesses:– 2008 top-level project and Apache– Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay, Last.fm,
New York Times
– Offered as part of: Amazon, Cloudera, Microsoft Azure, MapR, Hortonworks, EMC, IBM, Microsoft, Oracle
– 2008 1TB in 209s; 2014 4.27 TB per minute – ongoing competition
CS346 Advanced Databases10
Hadoop Cluster
¨ A Hadoop cluster implements the MapReduce framework– Many commodity (off the shelf) machines, with fast network– Placed in physical proximity (allow fast communication)– Typically rack-mounted hardware
¨ Expect and tolerate failures– Disks have MTBF of 10 years– When you have 10,000 disks...– ...expect 3 to fail per day– Jobs can last minutes to hours– So system must cope with failure!
Data is replicated, tasks that fail are retried The developer does not have to worry about this
CS346 Advanced Databases11
Building Blocks – Data Locality
Source: Barroso and Urs Hölzle (2009)
Hadoop philosophy
¨ “Scale-out, not scale-up”– Don’t upgrade, but add more hardware to the system, – End of Moore’s law means CPUs not getting faster– Individual disk size is not growing fast– So add more machines/disks (scale-out)– Allow hardware addition/removal mid-job
CS346 Advanced Databases13
Hadoop philosophy - continuation
¨ “Move code to data, not vice-versa”– Data is big, distributed while code is fairly small– So do the processing locally where the data resides– May have to move results across the network though
CS346 Advanced Databases14
Hadoop versus the RDBMS
¨ Hadoop and RDBMS are not in direct competition– Solve different problems on different kinds of data
¨ Hadoop: data processing on huge, distributed data (TB-PB)– Batch approach: data is not modified frequently, results take time– No guarantees of resilience, no real-time response, no locking– Data is not in relations, but key-values
¨ RDBMS: resilient, reliable processing of large data (MB-GB)– Provide high level-language (SQL) to deal with structured data– Hit a ceiling when scaling up beyond 10s of TB
¨ But the gaps between the two are narrowing– Lots of work to make Hadoop look like DB (Hive, Hbase...)– Hadoop & RDBMS can coexist: DB front-end, Hadoop log analysis
CS346 Advanced Databases15
CS346 Advanced Databases16
Hadoop versus the RDBMS
The differences are blurring
Running Hadoop
¨ Different parts of the Hadoop ecosystem have incompatibilities– Require certain versions to play well together
¨ Led to Hadoop distributions (like Linux distributions)– Curated releases e.g. cloudera.com/hadoop– Available as a Linux package, or virtual machine image
¨ How to run Hadoop?– Run on your own (multi-core) machine (for development/testing)– Use a local cluster that you have access to– Go to the cloud ($$$): Amazon S3, Cloudera, Microsoft Azure
See Jonny Foss’s instructions
CS346 Advanced Databases17
HDFS: the Hadoop Distributed File System¨ Hadoop Distributed File System is an important part of Hadoop
– Good for storing truly massive data ¨ Some HDFS numbers:
– Suitable for files in the TB, PB range– Can store millions-billions of files– Suits 100MB+ minimum size per file
¨ Assumptions about the data– Assume that the data will be written once, read many times– Assume no dynamic updates: append only– Optimize for streaming (sequential) reads, not random access
¨ Not good for low-latency reads, many small files, multiple writers
CS346 Advanced Databases18
Files and Blocks
¨ Files are broken into blocks, just like traditional file systems– But each block is much larger: 64MB or 128MB (instead of 512 bytes)– Ensure time to seek << time to transfer– Compare 10ms access, 100MB/s read
Seek = 1% * transfer time => block size 100 MB (default 128 MB)– Files smaller than the block don’t occupy it all!!– Metadata is stored separately
CS346 Advanced Databases19
Files and Blocks
¨ Blocks are replicated across different datanodes– Default replication level is 3, all managed by namenode
CS346 Advanced Databases
20
HDFS Daemons¨ Namenode:
– Master– manages the file systems namespace– Map from file name to where data is stored, like other file systems– Can be a single point of failure in the system (SPOF)
¨ Datanodes: – workers– stores and retrieves data blocks– Each datanode reports to namenode
¨ Secondary namenode: does housekeeping (checkpointing, logging)– Not directly a backup for the namenode!– Not a namenode
¨ Client running the processes isn’t aware of configuration!CS346 Advanced Databases21
HDFS Daemons
CS346 Advanced Databases22
Last time:
¨ Big data¨ Hadoop basics, philosophy, clusters, racks¨ Started on HDFS¨ HDFS main elements (daemons)
¨ Next: – continuation HDFS – MapReduce
CS346 Advanced Databases23
Replication and Reliability
¨ Namenode is “rack aware”: knows how machines are arranged– Second replica is on same rack as the first, but different machine– Third replica is on a different rack– Balances performance (failover time) vs. reliability (independence)
¨ Namenode does not directly read/write data– Client gets data location from namenode– Client interacts directly with datanode to read/write data
¨ Namenode keeps all block metadata in (fast) memory– Puts constraint on number of files stored: millions of large files– Future iterations of Hadoop expect to remove these constraints
CS346 Advanced Databases24
HDFS features
¨ Block Caching– Datanodes read blocks from disk; frequently accessed files may
be explicitly cached in datanode’s memory¨ HDFS Federation
– Since 2.x release series, allows a cluster to scale by adding namenodes, each managing a portion of the filesystem namespace.
– Managed with:– ViewFileSystem and viewfs://URIs
CS346 Advanced Databases25
Using HDFS file system
¨ HDFS gives similar control to a traditional file system– Paths in the form of directories below a root– Can ls (list directory), cat (read file), cd, rm, cp, etc.– put: copy file from local file system to HDFS– get: copy file from HDFS to local file system– File permissions similar to Unix/Linux– hadoop fs - help - detailed help on every command
¨ Some HDFS-specific commands: change file replication level– dfs.replication– Can rebalance data: ensure datanodes are similarly loaded– Java API to read/write HDFS files
¨ Original use for HDFS: store data for MapReduceCS346 Advanced Databases26
MapReduce and Big Data
¨ MapReduce is a popular paradigm for analyzing massive data– When the data is much too big for one machine– Allows the parallelization of computations over many machines
¨ Introduced by Jeffrey Dean and Sanjay Ghemawat 2004– MapReduce model implemented by MapReduce system at Google– Hadoop MapReduce implements same ideas
¨ Allows a large computation to be distributed over many machines– Brings the computation to the data, not vice-versa– System manages data movement, machine failures, errors– User just has to specify what to do with each piece of data
CS346 Advanced Databases27
Motivating MapReduce
¨ Many computations over big data follow a common outline:– The data formed of many (many) simple records– Iterate over each record and extract a value– Group together intermediate results with same properties– Aggregate these groups to get final results– Possibly, repeat this process with different functions
¨ MapReduce framework abstracts this outline– Iterate over records = Map– Aggregate the groups = Reduce
CS346 Advanced Databases28
What is MapReduce?
¨ MapReduce draws inspiration from functional programming– Map: apply the “map” function to every piece of data– Reduce: form the mapped data into groups and apply a function
¨ Designed for efficiency– Process the data in whatever order it is stored, avoid random access
Random access can be very slow over large data– Split the computation over many machines
Can Map the data in parallel, and Reduce each group in parallel– Resilient to failure: if a Map or Reduce task fails, just run it again
Requires that tasks are idempotent: can repeat on same input
CS346 Advanced Databases29
MapReduce approach and terminology¨ Entire dataset processed for each query (or large portion)¨ Batch query processor: one query for all data >> brute force approach
¨ MapReduce job = unit of work from client = input data, MapReduce program, configuration information
¨ Hadoop divides jobs into tasks: map & reduce tasks (scheduled by YARN, running on nodes in clusters)
¨ Hadoop divides input into (input) splits; runs map task split, i.e. user-defined map fct. record in split
CS346 Advanced Databases30
Programming in MapReduce
¨ Data is assumed to be in the form of (key, value) pairs– E.g. (key = “CS346”, value = “Advanced Databases”)– E.g. (key = “111-222-3333”, value = “(male, 29 years, married…)”
¨ Abstract view of programming MapReduce. Specify:– Map function: take a (k, v) pair, output some number of (k’,v’) pairs– Reduce function: take all (k’, v’) pairs with same k’ key, and output a
new set of (k’’, v’’) pairs– The “type” of output (key, value) pairs can be different to the input
¨ Many other options/parameters in practice:– Can specify a “partition” function for how to map k’ to reducers– Can specify a “combine” function that aggregates output of Map– Can share some information with all nodes via distributed cache
CS346 Advanced Databases31
mapmap map map
Shuffle and Sort: aggregate values by keys
reduce reduce reduce
k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6
ba 1 2 c c3 6 a c5 2 b c7 8
a 1 5 b 2 7 c 2 3 6 8
r1 s1 r2 s2 r3 s3
MapReduce schematic
32
“Hello World”: Word Count
¨ The generic MapReduce computation that’s always used…– Count occurrences of each word in a (massive) document collection
Map(String docid, String text): for each word w in text: Emit(w, 1);
lintool.github.io/MapReduce-course-2013s/syllabus.html
private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { WORD.set(itr.nextToken()); context.write(WORD, ONE); } } }
private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasNext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); } }
Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);
33
“Hello World”
CS346 Advanced Databases34
MapReduce and Graphs
¨ MapReduce is a powerful way of handling big graph data– Graph: a network of nodes linked by edges– Many big graphs: the web, (social network) friendship, citations
Often have millions of nodes, billions of edges Facebook: > 1billion nodes, 100 billion edges
¨ Many complex calculations needed over large graphs– Rank importance of nodes (for web search)– Predict which links will be added soon / suggest links (social nets)– Label nodes based on classification over graphs (recommendation)
¨ MapReduce allows computation over big graphs– Represent each edge as a value in a key-value pair
CS346 Advanced Databases35
MapReduce example: compute degree¨ The degree of a node is the number of edges incident on it
– Here, assume undirected edges
¨ To compute degree in MapReduce:– Map: for edge (E, (v, w)) output (v, 1), (w, 1)– Reduce: for (v, (c1, c2, … cn)) output (v, i=1
n ci )¨ Advanced: could use “combine” to compute partial sums
– E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1))
CS346 Advanced Databases36
A B
CD
(E1, (A,B))(E2, (A, C))(E3, (A, D))(E4, (B,C))
(A, 1)(B, 1)(A, 1)(C, 1)(A, 1)(D, 1)(B, 1)(C, 1)
Map(A, (1, 1, 1))(B, (1, 1))(C, (1, 1))(D, (1))
Shuffle
(A, 3)(B, 2)(C, 2)(D, 1)
Reduce
MapReduce Criticism (circa 2008)
Two prominent DB leaders (DeWitt and Stonebraker) complained:¨ MapReduce is a step backward in database access:
– Schemas are good– Separation of the schema from the application is good– High-level access languages are good
¨ MapReduce only allows poor implementations– Brute force and only brute force (no indexes, for example)
¨ MapReduce is missing features– Bulk loader, indexing, updates, transactions…
¨ MapReduce is incompatible with DBMS toolsMuch subsequent debate and development to remedy these
Source: Blog post by DeWitt and Stonebraker (http://craig-henderson.blogspot.co.uk/2009/11/dewitt-and-stonebrakers-mapreduce-major.html)37
Relational Databases vs. MapReduce
¨ Relational databases:– Multipurpose: analysis and transactions; batch and interactive– Data integrity via ACID transactions [see later]– Lots of tools in software ecosystem (for ingesting, reporting, etc.)– Supports SQL (and SQL integration, e.g., JDBC)– Automatic SQL query optimization
¨ MapReduce (Hadoop):– Designed for large clusters, fault tolerant– Data is accessed in “native format”– Supports many developing query languages (but not full SQL)– Programmers retain control over performance
Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)38
Database operations in MapReduce
For SQL-like processing in MapReduce, need relational operations¨ PROJECT in MapReduce is easy
– Map over tuples, emit new tuples with appropriate attributes– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing
¨ SELECT in MapReduce is easy– Map over tuples, emit only tuples that meet criteria– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing
CS346 Advanced Databases39
Last time:
¨ HDFS features¨ Using HDFS¨ MapReduce philosophy, terminology, background, ‘Hello
World’ (counting words in large files), Criticism, comparison to DBMS, emulating DB operations (project, select)
¨ Today: ¨ continuing with emulation of DB operations (GROUP BY,
JOINs)¨ HBase, Pig
CS346 Advanced Databases40
Group by… Aggregation
¨ Example: What is the average time spent per URL?– Given data for each visit to a URL giving the time spent
¨ In SQL: SELECT url, AVG(time) FROM visits GROUP BY url;¨ In MapReduce:
– Map over tuples, emit time, keyed by url– MapReduce automatically groups by keys– Compute average in reducer– Optimize with combiners
Not possible to put averages directly Think about why not!
41
Join Algorithms in MapReduce
¨ Joins are more difficult to do well– Could do a join as a Cartesian product followed by a select– But: This will kill your system for even moderate data sizes
¨ Will exploit some “extensions” of MapReduce– These allow extra ways to access data (e.g. distributed cache)
¨ Several approaches to join in MapReduce– Reduce-side join– Map-side join– In-memory join
42
Reduce-side Join
¨ Basic idea: group by join key– Map over both sets of tuples– Emit tuple as the value with join key as the intermediate key– Hadoop brings together tuples sharing the same key– Perform actual join in reducer– Similar to “sort-merge join” (but in parallel)
¨ Different variants, depending on how the join goes:– 1-to-1 joins– 1-to-many and many-to-many joins
43
Reduce-side Join: 1-to-1
R1
R4
S2
S3
R1
R4
S2
S3
keys valuesMap
R1
R4
S2
S3
keys values
Reduce
Note: need extra work if we want attributes ordered!
44
Reduce-side Join: 1-to-many
R1
S2
S3
R1
S2
S3
S9
keys valuesMap
R1 S2
keys values
Reduce
S9
S3 …
Need extra work to get the tuple from R out first45
Reduce-side Join: many to many
¨ Follow similar outline in the many to many case– Need enough memory to store all tuples from one relation
¨ Not particularly efficient– End up sending all the data over the network in the shuffle step
CS346 Advanced Databases46
Map-side Join: Basic Idea
Assume two datasets are sorted by the join key:R1
R2
R3
R4
S1
S2
S3
S4
A sequential scan through both datasets to join(equivalent to a merge join)
Doesn’t seem to fit MapReduce model?47
Map-side Join: Parallel Scans
¨ If datasets are sorted by join key, then just scan over both¨ How can we accomplish this in parallel?
– Partition and sort both datasets with the same ordering¨ In MapReduce:
– Map over one dataset, read from other corresponding partition Requires reading from (distributed) data in Map
– No reducers necessary (unless to repartition or resort)¨ Requires data to be organized just how we want it
– If not, fall back to reduce-side join R1
R2
R3
R4
S1
S2
S3
S4
48
Map-side Join
CS346 Advanced Databases49
S T
In-Memory (Memory-backed) Join
¨ Basic idea: load one dataset into memory, stream over the other– Works if R << S, and R fits into memory– Equivalent to a hash join
¨ MapReduce implementation– Distribute R to all nodes: use the distributed cache– Map over S, each mapper loads R in memory, hashed by join key– For every tuple in S, look up join key in R– No reducers, unless for regrouping or resorting tuples
¨ Striped variant (like single-loop join): if R is too big for memory– Divide R into R1, R2, R3, … s.t. each Rn fits into memory– Perform in-memory join: n, Rn S⋈– Take the union of all join results
50
Summary: Relational Processing in Hadoop¨ MapReduce algorithms for processing relational data:
– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce
– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer
– Multiple strategies for relational joins Prefer In-memory over map-side over reduce-side Reduce-side is most general, in-memory is most restricted
¨ Complex operations will need multiple MapReduce jobs– Example: top ten URLs in terms of average time spent– Opportunities for automatic optimization
51
HBase
¨ HBase (Hadoop Database) is a column-oriented data store– 2006 Chad Walters and Jim Kellerman at Powerset (NL search for
web – now owned by Microsoft)– An example of a “NoSQL” database: not the full relational model– Open source, written in Java– Does allow update operations (unlike HDFS…)
¨ HBase designed to handle very large tables– Billions of rows, millions of columns– Inspired by “BigTable”, internal to Google
CS346 Advanced Databases52
Suitability of HBase
¨ HBase suits applications when– Don’t need full power of relational database– Need a large enough cluster (5+ nodes)– Data is very large (obviously) – 100M to Billions of rows
Typical use case: crawled webpages and attributes– Don’t need real-time response: can be slow to respond (latency)– Have many clients– Access pattern is mostly selects or range scan by key– Suits when the data is sparse (many attributes, mostly null)– Don’t want to do group by/join etc.
CS346 Advanced Databases53
HBase data model
¨ The HBase data model is similar to relational model:– Data is stored in tables, which have rows– Each row is identified/referenced by a unique key value– Rows have columns, which are grouped into column families
¨ Data (bytes) is stored in cells– Each cell is indentified by (row, column-family, column)– Limited support for secondary indexes on non-key values– Cell contents are versioned: multiple values are stored (default: 3)– Optimized to provide access to most recent version– Can access old versions by timestamp
CS346 Advanced Databases54
HBase data storage¨ Rows are kept in sorted order of key; columns can be added
on the fly, as long as family exists¨ Example of (logical) data layout:
¨ Data is stored in Hfiles, usually under HDFS– Empty cells are not explicitly stored – allows very sparse data
CS346 Advanced Databases55
HFiles
¨ Since HDFS does not allow updates, need to use some tricks¨ Data is stored in HFiles (still stored in HDFS)¨ Newly added data is stored in a Write Ahead Log (WAL)¨ Delete markers are used to indicate records to delete¨ When data is accessed, the HFile and WAL are merged
¨ HBase periodically applies compaction to the Hfiles¨ Minor compaction: merge together multiple hfiles (fast)¨ Major compaction: more extensive merging and deletion
¨ Management of data relies on a “distributed coordination service”¨ Provided by Zookeeper (similar to Google’s Chubby)¨ Maps names to locations
CS346 Advanced Databases56
HBase column families and columns
¨ Columns are grouped into families to organize data– Referenced as family:column e.g. user:first_name
¨ Family definitions are static: rarely added to or changed– Expect a small number of families
¨ Columns are not static, can be updated dynamically– Can have millions of columns per family
CS346 Advanced Databases57
HBase application example
¨ Use HBase to store and retrieve a large number of articles¨ Example Schema: two sets of column families
– Info, containing columns ‘title’, ‘author’, ‘date’– Content, containing column ‘post’
¨ Can then access data– Get: retrieve a single row (or columns from a row, other versions)– Scan: retrieve a range of rows– Edit and delete data
CS346 Advanced Databases58
HBase conclusions
¨ HBase best suited to storing/retrieving large amounts of data– E.g. managing a very large blogging network– Facebook uses HBase to store users’ messages (since 2010)
www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919
¨ Need to think about how to design the data storage– E.g. one row per blog, or one row per article– “Tall-narrow” design (1 row per article) works well
Fits better with the way HBase structures HFiles Scales better when blogs have many articles
¨ Can use Hadoop for heavy duty processing– HBase can be the input (and output) for a Hadoop job
CS346 Advanced Databases59
Hive and Pig
¨ Hive: data warehousing application in Hadoop– Query language is HQL, variant of SQL– Tables stored on HDFS with different encodings– Developed by Facebook, now open source
¨ Pig: large-scale data processing system– Scripts are written in Pig Latin, a dataflow language– Programmer focuses on data transformations– Developed by Yahoo!, now open source
¨ Common idea:– Provide higher-level language to facilitate large-data processing– Higher-level language “compiles down” to Hadoop jobs
60
Pig
¨ Pig is a “platform for analyzing large datasets”– High-level (declarative) language (Pig Latin)– Compiled in MapReduce for execution on Hadoop cluster– Developed at Yahoo, used by Twitter, Netflix...
¨ Aim: make MapReduce coding easier for non-programmers– Data analysts, data scientists, statisticians...
¨ Various use-cases suggested:– Extract, Transform, Load (ETL): analyze large log data (clean, join)– Analyze “raw” unstructured data, multiple sources e.g. user logs
CS346 Advanced Databases61
Pig concepts
¨ Field: a piece of data¨ Tuple: an ordered set of fields
– Example: (10.4, 5, word, 4, field1)¨ Bag: collection of tuples
– { (10.4, 5, word, 4, field1), (this, 1, blah) }¨ Similar to tables in a relational DB
– But don’t require that all tuples in a bag have the same arity– Can be nested: a tuple can contain a bag, (a, {(1), (2), (3), (4)})
¨ Standard set of datatypes available:– int, long, float, double, chararray (string), bytearray (blob)
CS346 Advanced Databases62
Pig Latin
¨ Pig Latin language somewhere between SQL and imperative¨ LOAD data AS schema;
– t = LOAD ‘mylog’ AS (userId:chararray, timestamp:long, query:chararray);¨ DUMP displays results to screen; STORE saves to disk
– DUMP t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)...¨ GROUP tuples BY field;
– Create new tuples, one for each different value of field – E.g. g = GROUP t BY userId;– Will generate a bag of timestamp and query tuples for each user– DUMP g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})
CS346 Advanced Databases63
Pig: Foreach
¨ FOREACH bag GENERATE data : iterate over all elements in a bag– r = FOREACH t GENERATE timestamp– DUMP r : (12:34), (12:36), (12:37)
¨ GENERATE can also apply various built-in functions to data– s = FOREACH g GENERATE group, COUNT(t)– DUMP s : (u1, 2), (u3, 1)
¨ Several built-in functions to manipulate data– TOKENIZE: break strings into words– FLATTEN: remove structure, e.g. convert bag of bags into a bag– Can also use User Defined Functions (UDFs) in Java, Python...
¨ The “word count” problem can be done easily with these tools– All commands correspond to simple Map, Reduce or MR tasks
CS346 Advanced Databases64
t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})
Joins in Pig
¨ Pig supports join between two bags¨ JOIN bag1 BY field1, bag2 BY field2
– Performs an equijoin, with the condition field1=field2¨ Can perform the join on a tuple of fields
– E.g. join on (date, time): only join if both match¨ Implemented via join algorithms seen earlier
CS346 Advanced Databases65
Pig: Example
User Url Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Url Category PageRank
cnn.com News 0.9
bbc.com News 0.8
flickr.com Photos 0.7
espn.com Sports 0.9
Visits Url Info
Task: Find the top 10 most visited pages in each category
Pig Slides adapted from Olston et al. (SIGMOD 2008)66
Pig Script for example query
1. visits = load ‘/data/visits’ as (user, url, time);2. gVisits = group visits by url;3. visitCounts = foreach gVisits generate url, count(visits);4. urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);5. visitCounts = join visitCounts by url, urlInfo by url;6. gCategories = group visitCounts by category;7. topUrls = foreach gCategories generate top(visitCounts,10);8. store topUrls into ‘/data/topUrls’;
Pig Slides adapted from Olston et al. (SIGMOD 2008)67
Load VisitsLoad Visits
Group by urlGroup by url
Foreach urlgenerate count
Foreach urlgenerate count Load Url InfoLoad Url Info
Join on urlJoin on url
Group by category
Group by category
Foreach categorygenerate top10(urls)
Foreach categorygenerate top10(urls)
Pig Query Plan for Hadoop Execution
Map1
Reduce1Map2
Reduce2
Map3
Reduce3
Pig Slides adapted from Olston et al. (SIGMOD 2008)68
Hive
¨ Hive is a data warehouse built on top of Hadoop– Originated at Facebook in 2007, now part of Apache Hadoop– Provides SQL-like language called HiveQL
¨ Hive gives simple interface for queries and analysis– Access to files stored via HDFS, HBase– Does not give fast “real-time” response – inherent from Hadoop– Minimum response time may be minutes: designed to scale
¨ Example use case at Netflix: log data analysis– 0.6TB of log data per day, analyzed by 50+ nodes– Test quality: how well is the network performing?– Statistics: how many streams/day, errors/session etc.
CS346 Advanced Databases69
HiveQL to Hive
¨ Hive: translates HiveQL query to a set of MR jobs and executes¨ To support persistent schemas, keeps metadata in a RDBMS
– Known as the metastore (implemented by Apache Derby DBMS)
CS346 Advanced Databases70
Hive concepts
¨ Hive presents a view of data similar to relational DB– Database is a set of tables– Tables formed from rows with the same schema (attributes)– Row of a table: a single record– Column in a row: an attribute of the record
CS346 Advanced Databases71
HiveQL examples: Create and Load
¨ CREATE TABLE posts (user STRING, post STRING, time BIGINT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’STORED AS TEXTFILE;
¨ LOAD DATA LOCAL INPATH ‘data/user-posts.txt’ OVERWRITE INTO TABLE posts;
¨ SELECT COUNT(1) FROM posts;Total MapReduce jobs = 1Launching Job 1 out of 1 [...]Total MapReduce CPU Time Spent: 2 seconds 640 msec4Time taken:14.204 seconds
CS346 Advanced Databases72
HiveQL examples: querying
¨ SELECT * FROM posts WHERE user=“u1”;– Similar to SQL syntax
¨ SELECT * FROM posts WHERE time<=1343182133839 LIMIT 2;– Only return the first 2 matching results
¨ GROUP BY and HAVING allow aggregation as in SQL– SELECT category, count(1) AS cnt FROM items GROUP BY category HAVING cnt > 10;
¨ Can also specify how results are sorted– ORDER BY (totally ordered) and SORT BY (sorted by each reducer)
¨ Can specify how tuples are allocated to reducers– Via DISTRIBUTE BY keyword
CS346 Advanced Databases73
Hive: Bucketing and Partitioning
¨ Can use one column to partition data– Each partition stored in a separate file– E.g. partition by country– No difference in syntax, but querying on partitioned attribute is fast
¨ Can cluster data by buckets: randomly hash data into buckets– Allows parallelization in MapReduce: one mapper per bucket– Use buckets to evaluate query on a sample (one bucket)
CS346 Advanced Databases74
Summary
CS346 Advanced Databases75
¨ Large, complex ecosystem for data management around Hadoop– We have barely scratched the surface of this world
¨ Began with Hadoop and HDFS for MapReduce– HBase for storage/retrieval of large data– Hive and Pig for more high-level programming abstractions
Reading: www.coreservlets.com/hadoop-tutorial/Data Intensive Text Processing with MapReduce Chapters 1-3Hadoop: The Definitive Guide (chapters 1-3; 16, 17, 20)