CS346: Advanced Databases Alexandra I. Cristea [email protected] MapReduce and Hadoop.

CS346: Advanced DatabasesAlexandra I. [email protected]

MapReduce and Hadoop

Outline

Reading: find resources online, or pick fromData Intensive Text Processing with MapReduce Chapters 1-3

Jimmy Lin, Chris Dyer, Morgan&Claypoolwww.coreservlets.com/hadoop-tutorial/ Marty HallHadoop: The Definitive Guide Tom White, O’Reilly Media Chapter

1-3; 16 (part of); 17 (part of); 20 (part of);

Outline: Data is big and getting bigger. New tools are emerging¨ Hadoop: A file system and processing paradigm (MapReduce)¨ Hbase: A way of storing and retrieving large amounts of data¨ Pig and Hive: High-level abstractions to make Hadoop easier

CS346 Advanced Databases2

http://lintool.github.io/MapReduceAlgorithms/MapReduce-book-final.pdf


http://www.coreservlets.com/hadoop-tutorial/

http://www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs346/hadoop_the_definitive_guide_4th_edition.pdf

3 CS346 Advanced Databases

¨ Data is growing faster than our ability to store or index it

¨ There are 3 Billion Telephone Calls in USA each day, 30 Billion emails daily, 1 Billion SMS, IMs.

¨ Scientific data: NASA's observation satellitesgenerate billions of readings each per day.

¨ IP Network Traffic: up to 1 Billion packets per hour per router. Each ISP has many (hundreds) routers!

¨ Whole genome sequences for many species now available: each megabytes to gigabytes in size

Why: Data is Massive

Other examples: massive data

¨ High-energy physics community: 2005 : PB databases– Now, Large Hadron Collider near Geneva, worlds largest particle

accelerator, recreating Big Bang conditions ~ 15 PB per year ¨ Google: in 2008 processing 20 PB a day!¨ eBay: in 2009 8.5 PB of user data, 170 trillion records, 150

billion new records per day¨ Facebook: 2.5 PB of user data, 15 TB growth per day

¨ >> Petabyte datasets the norm!


http://home.web.cern.ch/topics/large-hadron-collider



However: bottleneck disk access

¨ Moore’s law: Disk capacity: 1980’ tens of MB -> now: few TB (several orders of magnitude growth)

¨ Latency: 2x improvement in the last quarter century¨ bandwidth: 50x

¨ >>90’s 1.37MB storage, transferred at 4.4 MB/s, read in 5min¨ >>Now, 1 TB storage, transferred at 100 MB/s, read in 2.5h¨ >>Writing is even slower!


6 CS346 Advanced Databases

Massive Data Management

Must perform queries on this massive data:¨ Scientific research (monitor environment, species)¨ System management (spot faults, drops, failures)¨ Customer research (association rules, new offers) ¨ For revenue protection (phone fraud, service abuse)¨ Natural Language Processing (for unstructured (user) data)Else, why even collect this data?

Solution: Parallel Processing

¨ Using many machines (hardware)¨ Parallel access¨ Issue: hardware failure

– RAID e.g. uses redundant copies– HDFS uses a different approach

¨ Issue: data combination– MapReduce abstracts the R/W problem, transforming it into a

computation over sets of keys and values¨ Hadoop:

– Reliable, scalable platform for storage and analysis


Hadoop

¨ Hadoop is an open-source architecture for handling big data– First developed by Doug Cutting (named after a toy elephant)– Google’s original implementations are not publicly available– Currently managed by the Apache Software Foundation

¨ Hadoop now:– More than just MapReduce (so more than a batch query processor)


(some) Hadoop Tools and Products¨ Many tools/products now reside on top of Hadoop

– HBase: (non-relational) distributed database Key-value store Uses HDFS online read/write access Batch R/W

– YARN: allows any distributed program to run on Hadoop– Hive: data warehouse infrastructure developed at Facebook– Pig: high-level language that compiles to Hadoop, from Yahoo– Mahout: machine learning algorithms in Hadoop


Hadoop in business use

¨ Hadoop widely used in technology-based businesses:– 2008 top-level project and Apache– Facebook, LinkedIn, Twitter, IBM, Amazon, Adobe, Ebay, Last.fm,

New York Times

– Offered as part of: Amazon, Cloudera, Microsoft Azure, MapR, Hortonworks, EMC, IBM, Microsoft, Oracle

– 2008 1TB in 209s; 2014 4.27 TB per minute – ongoing competition


Hadoop Cluster

¨ A Hadoop cluster implements the MapReduce framework– Many commodity (off the shelf) machines, with fast network– Placed in physical proximity (allow fast communication)– Typically rack-mounted hardware

¨ Expect and tolerate failures– Disks have MTBF of 10 years– When you have 10,000 disks...– ...expect 3 to fail per day– Jobs can last minutes to hours– So system must cope with failure!

Data is replicated, tasks that fail are retried The developer does not have to worry about this


Building Blocks – Data Locality

Source: Barroso and Urs Hölzle (2009)

Hadoop philosophy

¨ “Scale-out, not scale-up”– Don’t upgrade, but add more hardware to the system, – End of Moore’s law means CPUs not getting faster– Individual disk size is not growing fast– So add more machines/disks (scale-out)– Allow hardware addition/removal mid-job


Hadoop philosophy - continuation

¨ “Move code to data, not vice-versa”– Data is big, distributed while code is fairly small– So do the processing locally where the data resides– May have to move results across the network though


Hadoop versus the RDBMS

¨ Hadoop and RDBMS are not in direct competition– Solve different problems on different kinds of data

¨ Hadoop: data processing on huge, distributed data (TB-PB)– Batch approach: data is not modified frequently, results take time– No guarantees of resilience, no real-time response, no locking– Data is not in relations, but key-values

¨ RDBMS: resilient, reliable processing of large data (MB-GB)– Provide high level-language (SQL) to deal with structured data– Hit a ceiling when scaling up beyond 10s of TB

¨ But the gaps between the two are narrowing– Lots of work to make Hadoop look like DB (Hive, Hbase...)– Hadoop & RDBMS can coexist: DB front-end, Hadoop log analysis



Hadoop versus the RDBMS

The differences are blurring

Running Hadoop

¨ Different parts of the Hadoop ecosystem have incompatibilities– Require certain versions to play well together

¨ Led to Hadoop distributions (like Linux distributions)– Curated releases e.g. cloudera.com/hadoop– Available as a Linux package, or virtual machine image

¨ How to run Hadoop?– Run on your own (multi-core) machine (for development/testing)– Use a local cluster that you have access to– Go to the cloud ($$$): Amazon S3, Cloudera, Microsoft Azure

See Jonny Foss’s instructions


http://www2.warwick.ac.uk/fac/sci/dcs/teaching/material/cs346/seminars/aws/

HDFS: the Hadoop Distributed File System¨ Hadoop Distributed File System is an important part of Hadoop

– Good for storing truly massive data ¨ Some HDFS numbers:

– Suitable for files in the TB, PB range– Can store millions-billions of files– Suits 100MB+ minimum size per file

¨ Assumptions about the data– Assume that the data will be written once, read many times– Assume no dynamic updates: append only– Optimize for streaming (sequential) reads, not random access

¨ Not good for low-latency reads, many small files, multiple writers


Files and Blocks

¨ Files are broken into blocks, just like traditional file systems– But each block is much larger: 64MB or 128MB (instead of 512 bytes)– Ensure time to seek << time to transfer– Compare 10ms access, 100MB/s read

Seek = 1% * transfer time => block size 100 MB (default 128 MB)– Files smaller than the block don’t occupy it all!!– Metadata is stored separately


Files and Blocks

¨ Blocks are replicated across different datanodes– Default replication level is 3, all managed by namenode

CS346 Advanced Databases

20

HDFS Daemons¨ Namenode:

– Master– manages the file systems namespace– Map from file name to where data is stored, like other file systems– Can be a single point of failure in the system (SPOF)

¨ Datanodes: – workers– stores and retrieves data blocks– Each datanode reports to namenode

¨ Secondary namenode: does housekeeping (checkpointing, logging)– Not directly a backup for the namenode!– Not a namenode

¨ Client running the processes isn’t aware of configuration!CS346 Advanced Databases21

HDFS Daemons


Last time:

¨ Big data¨ Hadoop basics, philosophy, clusters, racks¨ Started on HDFS¨ HDFS main elements (daemons)

¨ Next: – continuation HDFS – MapReduce


Replication and Reliability

¨ Namenode is “rack aware”: knows how machines are arranged– Second replica is on same rack as the first, but different machine– Third replica is on a different rack– Balances performance (failover time) vs. reliability (independence)

¨ Namenode does not directly read/write data– Client gets data location from namenode– Client interacts directly with datanode to read/write data

¨ Namenode keeps all block metadata in (fast) memory– Puts constraint on number of files stored: millions of large files– Future iterations of Hadoop expect to remove these constraints


HDFS features

¨ Block Caching– Datanodes read blocks from disk; frequently accessed files may

be explicitly cached in datanode’s memory¨ HDFS Federation

– Since 2.x release series, allows a cluster to scale by adding namenodes, each managing a portion of the filesystem namespace.

– Managed with:– ViewFileSystem and viewfs://URIs


Using HDFS file system

¨ HDFS gives similar control to a traditional file system– Paths in the form of directories below a root– Can ls (list directory), cat (read file), cd, rm, cp, etc.– put: copy file from local file system to HDFS– get: copy file from HDFS to local file system– File permissions similar to Unix/Linux– hadoop fs - help - detailed help on every command

¨ Some HDFS-specific commands: change file replication level– dfs.replication– Can rebalance data: ensure datanodes are similarly loaded– Java API to read/write HDFS files

¨ Original use for HDFS: store data for MapReduceCS346 Advanced Databases26

MapReduce and Big Data

¨ MapReduce is a popular paradigm for analyzing massive data– When the data is much too big for one machine– Allows the parallelization of computations over many machines

¨ Introduced by Jeffrey Dean and Sanjay Ghemawat 2004– MapReduce model implemented by MapReduce system at Google– Hadoop MapReduce implements same ideas

¨ Allows a large computation to be distributed over many machines– Brings the computation to the data, not vice-versa– System manages data movement, machine failures, errors– User just has to specify what to do with each piece of data


https://www.google.co.uk/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&cad=rja&uact=8&ved=0CCEQFjAAahUKEwi93O29-N_IAhWHshQKHWSvBpU&url=http://research.google.com/archive/mapreduce-osdi04.pdf&usg=AFQjCNEL7nTxrQ6fiMUtt4AZh6gK5og2IQ&sig2=Clx1mH2hQfy2QlLY



Motivating MapReduce

¨ Many computations over big data follow a common outline:– The data formed of many (many) simple records– Iterate over each record and extract a value– Group together intermediate results with same properties– Aggregate these groups to get final results– Possibly, repeat this process with different functions

¨ MapReduce framework abstracts this outline– Iterate over records = Map– Aggregate the groups = Reduce


What is MapReduce?

¨ MapReduce draws inspiration from functional programming– Map: apply the “map” function to every piece of data– Reduce: form the mapped data into groups and apply a function

¨ Designed for efficiency– Process the data in whatever order it is stored, avoid random access

Random access can be very slow over large data– Split the computation over many machines

Can Map the data in parallel, and Reduce each group in parallel– Resilient to failure: if a Map or Reduce task fails, just run it again

Requires that tasks are idempotent: can repeat on same input


MapReduce approach and terminology¨ Entire dataset processed for each query (or large portion)¨ Batch query processor: one query for all data >> brute force approach

¨ MapReduce job = unit of work from client = input data, MapReduce program, configuration information

¨ Hadoop divides jobs into tasks: map & reduce tasks (scheduled by YARN, running on nodes in clusters)

¨ Hadoop divides input into (input) splits; runs map task split, i.e. user-defined map fct. record in split


Programming in MapReduce

¨ Data is assumed to be in the form of (key, value) pairs– E.g. (key = “CS346”, value = “Advanced Databases”)– E.g. (key = “111-222-3333”, value = “(male, 29 years, married…)”

¨ Abstract view of programming MapReduce. Specify:– Map function: take a (k, v) pair, output some number of (k’,v’) pairs– Reduce function: take all (k’, v’) pairs with same k’ key, and output a

new set of (k’’, v’’) pairs– The “type” of output (key, value) pairs can be different to the input

¨ Many other options/parameters in practice:– Can specify a “partition” function for how to map k’ to reducers– Can specify a “combine” function that aggregates output of Map– Can share some information with all nodes via distributed cache


mapmap map map

Shuffle and Sort: aggregate values by keys

reduce reduce reduce

k1 k2 k3 k4 k5 k6v1 v2 v3 v4 v5 v6

ba 1 2 c c3 6 a c5 2 b c7 8

a 1 5 b 2 7 c 2 3 6 8

r1 s1 r2 s2 r3 s3

MapReduce schematic

32

“Hello World”: Word Count

¨ The generic MapReduce computation that’s always used…– Count occurrences of each word in a (massive) document collection

Map(String docid, String text): for each word w in text: Emit(w, 1);

lintool.github.io/MapReduce-course-2013s/syllabus.html

private static class MyMapper extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable ONE = new IntWritable(1); private final static Text WORD = new Text(); @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = ((Text) value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { WORD.set(itr.nextToken()); context.write(WORD, ONE); } } }

private static class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable> { private final static IntWritable SUM = new IntWritable(); @Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { Iterator<IntWritable> iter = values.iterator(); int sum = 0; while (iter.hasNext()) { sum += iter.next().get(); } SUM.set(sum); context.write(key, SUM); } }

Reduce(String term, Iterator<Int> values): int sum = 0; for each v in values: sum += v; Emit(term, value);

33

“Hello World”


MapReduce and Graphs

¨ MapReduce is a powerful way of handling big graph data– Graph: a network of nodes linked by edges– Many big graphs: the web, (social network) friendship, citations

Often have millions of nodes, billions of edges Facebook: > 1billion nodes, 100 billion edges

¨ Many complex calculations needed over large graphs– Rank importance of nodes (for web search)– Predict which links will be added soon / suggest links (social nets)– Label nodes based on classification over graphs (recommendation)

¨ MapReduce allows computation over big graphs– Represent each edge as a value in a key-value pair


MapReduce example: compute degree¨ The degree of a node is the number of edges incident on it

– Here, assume undirected edges

¨ To compute degree in MapReduce:– Map: for edge (E, (v, w)) output (v, 1), (w, 1)– Reduce: for (v, (c1, c2, … cn)) output (v, i=1

n ci )¨ Advanced: could use “combine” to compute partial sums

– E.g. Combine ((A, 1), (A, 1), (B, 1)) = ((A, 2), (B,1))


A B

CD

(E1, (A,B))(E2, (A, C))(E3, (A, D))(E4, (B,C))

(A, 1)(B, 1)(A, 1)(C, 1)(A, 1)(D, 1)(B, 1)(C, 1)

Map(A, (1, 1, 1))(B, (1, 1))(C, (1, 1))(D, (1))

Shuffle

(A, 3)(B, 2)(C, 2)(D, 1)

Reduce

MapReduce Criticism (circa 2008)

Two prominent DB leaders (DeWitt and Stonebraker) complained:¨ MapReduce is a step backward in database access:

– Schemas are good– Separation of the schema from the application is good– High-level access languages are good

¨ MapReduce only allows poor implementations– Brute force and only brute force (no indexes, for example)

¨ MapReduce is missing features– Bulk loader, indexing, updates, transactions…

¨ MapReduce is incompatible with DBMS toolsMuch subsequent debate and development to remedy these

Source: Blog post by DeWitt and Stonebraker (http://craig-henderson.blogspot.co.uk/2009/11/dewitt-and-stonebrakers-mapreduce-major.html)37

Relational Databases vs. MapReduce

¨ Relational databases:– Multipurpose: analysis and transactions; batch and interactive– Data integrity via ACID transactions [see later]– Lots of tools in software ecosystem (for ingesting, reporting, etc.)– Supports SQL (and SQL integration, e.g., JDBC)– Automatic SQL query optimization

¨ MapReduce (Hadoop):– Designed for large clusters, fault tolerant– Data is accessed in “native format”– Supports many developing query languages (but not full SQL)– Programmers retain control over performance

Source: O’Reilly Blog post by Joseph Hellerstein (11/19/2008)38

Database operations in MapReduce

For SQL-like processing in MapReduce, need relational operations¨ PROJECT in MapReduce is easy

– Map over tuples, emit new tuples with appropriate attributes– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing

¨ SELECT in MapReduce is easy– Map over tuples, emit only tuples that meet criteria– No reducers, unless for regrouping or resorting tuples– Or pipeline: perform in reducer, after some other processing


Last time:

¨ HDFS features¨ Using HDFS¨ MapReduce philosophy, terminology, background, ‘Hello

World’ (counting words in large files), Criticism, comparison to DBMS, emulating DB operations (project, select)

¨ Today: ¨ continuing with emulation of DB operations (GROUP BY,

JOINs)¨ HBase, Pig


Group by… Aggregation

¨ Example: What is the average time spent per URL?– Given data for each visit to a URL giving the time spent

¨ In SQL: SELECT url, AVG(time) FROM visits GROUP BY url;¨ In MapReduce:

– Map over tuples, emit time, keyed by url– MapReduce automatically groups by keys– Compute average in reducer– Optimize with combiners

Not possible to put averages directly Think about why not!

41

Join Algorithms in MapReduce

¨ Joins are more difficult to do well– Could do a join as a Cartesian product followed by a select– But: This will kill your system for even moderate data sizes

¨ Will exploit some “extensions” of MapReduce– These allow extra ways to access data (e.g. distributed cache)

¨ Several approaches to join in MapReduce– Reduce-side join– Map-side join– In-memory join

42

Reduce-side Join

¨ Basic idea: group by join key– Map over both sets of tuples– Emit tuple as the value with join key as the intermediate key– Hadoop brings together tuples sharing the same key– Perform actual join in reducer– Similar to “sort-merge join” (but in parallel)

¨ Different variants, depending on how the join goes:– 1-to-1 joins– 1-to-many and many-to-many joins

43

Reduce-side Join: 1-to-1

R1

R4

S2

S3

R1

R4

S2

S3

keys valuesMap

R1

R4

S2

S3

keys values

Reduce

Note: need extra work if we want attributes ordered!

44

Reduce-side Join: 1-to-many

R1

S2

S3

R1

S2

S3

S9

keys valuesMap

R1 S2

keys values

Reduce

S9

S3 …

Need extra work to get the tuple from R out first45

Reduce-side Join: many to many

¨ Follow similar outline in the many to many case– Need enough memory to store all tuples from one relation

¨ Not particularly efficient– End up sending all the data over the network in the shuffle step


Map-side Join: Basic Idea

Assume two datasets are sorted by the join key:R1

R2

R3

R4

S1

S2

S3

S4

A sequential scan through both datasets to join(equivalent to a merge join)

Doesn’t seem to fit MapReduce model?47

Map-side Join: Parallel Scans

¨ If datasets are sorted by join key, then just scan over both¨ How can we accomplish this in parallel?

– Partition and sort both datasets with the same ordering¨ In MapReduce:

– Map over one dataset, read from other corresponding partition Requires reading from (distributed) data in Map

– No reducers necessary (unless to repartition or resort)¨ Requires data to be organized just how we want it

– If not, fall back to reduce-side join R1

R2

R3

R4

S1

S2

S3

S4

48

Map-side Join


S T

In-Memory (Memory-backed) Join

¨ Basic idea: load one dataset into memory, stream over the other– Works if R << S, and R fits into memory– Equivalent to a hash join

¨ MapReduce implementation– Distribute R to all nodes: use the distributed cache– Map over S, each mapper loads R in memory, hashed by join key– For every tuple in S, look up join key in R– No reducers, unless for regrouping or resorting tuples

¨ Striped variant (like single-loop join): if R is too big for memory– Divide R into R1, R2, R3, … s.t. each Rn fits into memory– Perform in-memory join: n, Rn S⋈– Take the union of all join results

50

Summary: Relational Processing in Hadoop¨ MapReduce algorithms for processing relational data:

– Group by, sorting, partitioning are handled automatically by shuffle/sort in MapReduce

– Selection, projection, and other computations (e.g., aggregation), are performed either in mapper or reducer

– Multiple strategies for relational joins Prefer In-memory over map-side over reduce-side Reduce-side is most general, in-memory is most restricted

¨ Complex operations will need multiple MapReduce jobs– Example: top ten URLs in terms of average time spent– Opportunities for automatic optimization

51

HBase

¨ HBase (Hadoop Database) is a column-oriented data store– 2006 Chad Walters and Jim Kellerman at Powerset (NL search for

web – now owned by Microsoft)– An example of a “NoSQL” database: not the full relational model– Open source, written in Java– Does allow update operations (unlike HDFS…)

¨ HBase designed to handle very large tables– Billions of rows, millions of columns– Inspired by “BigTable”, internal to Google


Suitability of HBase

¨ HBase suits applications when– Don’t need full power of relational database– Need a large enough cluster (5+ nodes)– Data is very large (obviously) – 100M to Billions of rows

Typical use case: crawled webpages and attributes– Don’t need real-time response: can be slow to respond (latency)– Have many clients– Access pattern is mostly selects or range scan by key– Suits when the data is sparse (many attributes, mostly null)– Don’t want to do group by/join etc.


HBase data model

¨ The HBase data model is similar to relational model:– Data is stored in tables, which have rows– Each row is identified/referenced by a unique key value– Rows have columns, which are grouped into column families

¨ Data (bytes) is stored in cells– Each cell is indentified by (row, column-family, column)– Limited support for secondary indexes on non-key values– Cell contents are versioned: multiple values are stored (default: 3)– Optimized to provide access to most recent version– Can access old versions by timestamp


HBase data storage¨ Rows are kept in sorted order of key; columns can be added

on the fly, as long as family exists¨ Example of (logical) data layout:

¨ Data is stored in Hfiles, usually under HDFS– Empty cells are not explicitly stored – allows very sparse data


HFiles

¨ Since HDFS does not allow updates, need to use some tricks¨ Data is stored in HFiles (still stored in HDFS)¨ Newly added data is stored in a Write Ahead Log (WAL)¨ Delete markers are used to indicate records to delete¨ When data is accessed, the HFile and WAL are merged

¨ HBase periodically applies compaction to the Hfiles¨ Minor compaction: merge together multiple hfiles (fast)¨ Major compaction: more extensive merging and deletion

¨ Management of data relies on a “distributed coordination service”¨ Provided by Zookeeper (similar to Google’s Chubby)¨ Maps names to locations


HBase column families and columns

¨ Columns are grouped into families to organize data– Referenced as family:column e.g. user:first_name

¨ Family definitions are static: rarely added to or changed– Expect a small number of families

¨ Columns are not static, can be updated dynamically– Can have millions of columns per family


HBase application example

¨ Use HBase to store and retrieve a large number of articles¨ Example Schema: two sets of column families

– Info, containing columns ‘title’, ‘author’, ‘date’– Content, containing column ‘post’

¨ Can then access data– Get: retrieve a single row (or columns from a row, other versions)– Scan: retrieve a range of rows– Edit and delete data


HBase conclusions

¨ HBase best suited to storing/retrieving large amounts of data– E.g. managing a very large blogging network– Facebook uses HBase to store users’ messages (since 2010)

www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919

¨ Need to think about how to design the data storage– E.g. one row per blog, or one row per article– “Tall-narrow” design (1 row per article) works well

Fits better with the way HBase structures HFiles Scales better when blogs have many articles

¨ Can use Hadoop for heavy duty processing– HBase can be the input (and output) for a Hadoop job


http://www.facebook.com/notes/facebook-engineering/the-underlying-technology-of-messages/454991608919

Hive and Pig

¨ Hive: data warehousing application in Hadoop– Query language is HQL, variant of SQL– Tables stored on HDFS with different encodings– Developed by Facebook, now open source

¨ Pig: large-scale data processing system– Scripts are written in Pig Latin, a dataflow language– Programmer focuses on data transformations– Developed by Yahoo!, now open source

¨ Common idea:– Provide higher-level language to facilitate large-data processing– Higher-level language “compiles down” to Hadoop jobs

60

Pig

¨ Pig is a “platform for analyzing large datasets”– High-level (declarative) language (Pig Latin)– Compiled in MapReduce for execution on Hadoop cluster– Developed at Yahoo, used by Twitter, Netflix...

¨ Aim: make MapReduce coding easier for non-programmers– Data analysts, data scientists, statisticians...

¨ Various use-cases suggested:– Extract, Transform, Load (ETL): analyze large log data (clean, join)– Analyze “raw” unstructured data, multiple sources e.g. user logs


Pig concepts

¨ Field: a piece of data¨ Tuple: an ordered set of fields

– Example: (10.4, 5, word, 4, field1)¨ Bag: collection of tuples

– { (10.4, 5, word, 4, field1), (this, 1, blah) }¨ Similar to tables in a relational DB

– But don’t require that all tuples in a bag have the same arity– Can be nested: a tuple can contain a bag, (a, {(1), (2), (3), (4)})

¨ Standard set of datatypes available:– int, long, float, double, chararray (string), bytearray (blob)


Pig Latin

¨ Pig Latin language somewhere between SQL and imperative¨ LOAD data AS schema;

– t = LOAD ‘mylog’ AS (userId:chararray, timestamp:long, query:chararray);¨ DUMP displays results to screen; STORE saves to disk

– DUMP t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)...¨ GROUP tuples BY field;

– Create new tuples, one for each different value of field – E.g. g = GROUP t BY userId;– Will generate a bag of timestamp and query tuples for each user– DUMP g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})


Pig: Foreach

¨ FOREACH bag GENERATE data : iterate over all elements in a bag– r = FOREACH t GENERATE timestamp– DUMP r : (12:34), (12:36), (12:37)

¨ GENERATE can also apply various built-in functions to data– s = FOREACH g GENERATE group, COUNT(t)– DUMP s : (u1, 2), (u3, 1)

¨ Several built-in functions to manipulate data– TOKENIZE: break strings into words– FLATTEN: remove structure, e.g. convert bag of bags into a bag– Can also use User Defined Functions (UDFs) in Java, Python...

¨ The “word count” problem can be done easily with these tools– All commands correspond to simple Map, Reduce or MR tasks


t : (u1, 12:34, “database”), (u3, 12:36, “work”), (u1, 12:37, “abc”)g: (u1, {(12:34, “database”), (12:37, “abc”)}), (u3, {(12:36, “work”)})

Joins in Pig

¨ Pig supports join between two bags¨ JOIN bag1 BY field1, bag2 BY field2

– Performs an equijoin, with the condition field1=field2¨ Can perform the join on a tuple of fields

– E.g. join on (date, time): only join if both match¨ Implemented via join algorithms seen earlier


Pig: Example

User Url Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Url Category PageRank

cnn.com News 0.9

bbc.com News 0.8

flickr.com Photos 0.7

espn.com Sports 0.9

Visits Url Info

Task: Find the top 10 most visited pages in each category

Pig Slides adapted from Olston et al. (SIGMOD 2008)66

Pig Script for example query

1. visits = load ‘/data/visits’ as (user, url, time);2. gVisits = group visits by url;3. visitCounts = foreach gVisits generate url, count(visits);4. urlInfo = load ‘/data/urlInfo’ as (url, category, pRank);5. visitCounts = join visitCounts by url, urlInfo by url;6. gCategories = group visitCounts by category;7. topUrls = foreach gCategories generate top(visitCounts,10);8. store topUrls into ‘/data/topUrls’;


Load VisitsLoad Visits

Group by urlGroup by url

Foreach urlgenerate count

Foreach urlgenerate count Load Url InfoLoad Url Info

Join on urlJoin on url

Group by category

Group by category

Foreach categorygenerate top10(urls)

Foreach categorygenerate top10(urls)

Pig Query Plan for Hadoop Execution

Map1

Reduce1Map2

Reduce2

Map3

Reduce3


Hive

¨ Hive is a data warehouse built on top of Hadoop– Originated at Facebook in 2007, now part of Apache Hadoop– Provides SQL-like language called HiveQL

¨ Hive gives simple interface for queries and analysis– Access to files stored via HDFS, HBase– Does not give fast “real-time” response – inherent from Hadoop– Minimum response time may be minutes: designed to scale

¨ Example use case at Netflix: log data analysis– 0.6TB of log data per day, analyzed by 50+ nodes– Test quality: how well is the network performing?– Statistics: how many streams/day, errors/session etc.


HiveQL to Hive

¨ Hive: translates HiveQL query to a set of MR jobs and executes¨ To support persistent schemas, keeps metadata in a RDBMS

– Known as the metastore (implemented by Apache Derby DBMS)


Hive concepts

¨ Hive presents a view of data similar to relational DB– Database is a set of tables– Tables formed from rows with the same schema (attributes)– Row of a table: a single record– Column in a row: an attribute of the record


HiveQL examples: Create and Load

¨ CREATE TABLE posts (user STRING, post STRING, time BIGINT)ROW FORMAT DELIMITEDFIELDS TERMINATED BY ‘,’STORED AS TEXTFILE;

¨ LOAD DATA LOCAL INPATH ‘data/user-posts.txt’ OVERWRITE INTO TABLE posts;

¨ SELECT COUNT(1) FROM posts;Total MapReduce jobs = 1Launching Job 1 out of 1 [...]Total MapReduce CPU Time Spent: 2 seconds 640 msec4Time taken:14.204 seconds


HiveQL examples: querying

¨ SELECT * FROM posts WHERE user=“u1”;– Similar to SQL syntax

¨ SELECT * FROM posts WHERE time<=1343182133839 LIMIT 2;– Only return the first 2 matching results

¨ GROUP BY and HAVING allow aggregation as in SQL– SELECT category, count(1) AS cnt FROM items GROUP BY category HAVING cnt > 10;

¨ Can also specify how results are sorted– ORDER BY (totally ordered) and SORT BY (sorted by each reducer)

¨ Can specify how tuples are allocated to reducers– Via DISTRIBUTE BY keyword


Hive: Bucketing and Partitioning

¨ Can use one column to partition data– Each partition stored in a separate file– E.g. partition by country– No difference in syntax, but querying on partitioned attribute is fast

¨ Can cluster data by buckets: randomly hash data into buckets– Allows parallelization in MapReduce: one mapper per bucket– Use buckets to evaluate query on a sample (one bucket)


Summary


¨ Large, complex ecosystem for data management around Hadoop– We have barely scratched the surface of this world

¨ Began with Hadoop and HDFS for MapReduce– HBase for storage/retrieval of large data– Hive and Pig for more high-level programming abstractions

Reading: www.coreservlets.com/hadoop-tutorial/Data Intensive Text Processing with MapReduce Chapters 1-3Hadoop: The Definitive Guide (chapters 1-3; 16, 17, 20)

http://www.coreservlets.com/hadoop-tutorial/



http://proquest.safaribooksonline.com/book/software-engineering-and-development/9781449328917

CS346: Advanced Databases Alexandra I. Cristea [email protected] MapReduce and Hadoop.

Documents

Transcript of CS346: Advanced Databases Alexandra I. Cristea [email protected] MapReduce and Hadoop.