Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

31
Have fun with Hadoop Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside

Transcript of Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

Page 1: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

Have fun with HadoopHave fun with HadoopExperiences with Hadoop and MapReduce

Jian WenDB Lab, UC Riverside

Page 2: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

OutlineOutlineBackground on MapReduceSummer 09 (freeman?):

Processing Join using MapReduceSpring 09 (Northeastern):

NetflixHadoopFall 09 (UC Irvine): Distributed

XML Filtering Using Hadoop

Page 3: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

Background on Background on MapReduce MapReduce Started from Winter 2009

◦ Course work: Scalable Techniques for Massive Data by Prof. Mirek Riedewald.

◦ Course project: NetflixHadoopShort explore in Summer 2009

◦ Research topic: Efficient join processing on MapReduce framework.

◦ Compared the homogenization and map-reduce-merge strategies.

Continued in California◦ UCI course work: Scalable Data

Management by Prof. Michael Carey◦ Course project: XML filtering using Hadoop

Page 4: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

MapReduce Join: Research MapReduce Join: Research PlanPlanFocused on performance analysis on

different implementation of join processors in MapReduce.◦Homogenization: add additional

information about the source of the data in the map phase, then do the join in the reduce phase.

◦Map-Reduce-Merge: a new primitive called merge is added to process the join separately.

◦Other implementation: the map-reduce execution plan for joins generated by Hive.

Page 5: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

MapReduce Join: Research MapReduce Join: Research NotesNotesCost analysis model on process

latency.◦The whole map-reduce execution plan is

divided into several primitives for analysis. Distribute Mapper: partition and distribute data

onto several nodes. Copy Mapper: duplicate data onto several nodes. MR Transfer: transfer data between mapper and

reducer. Summary Transfer: generate statistics of data

and pass the statistics between working nodes. Output Collector: collect the outputs.

Some basic attempts on theta-join using MapReduce.◦ Idea: a mapper supporting multi-cast key.

Page 6: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: Problem NetflixHadoop: Problem DefinitionDefinitionFrom Netflix Competition

◦Data: 100480507 rating data from 480189 users on 17770 movies.

◦Goal: Predict unknown ratings for any given user and movie pairs.

◦Measurement: Use RMSE to measure the precise.

Out approach: Singular Value Decomposition (SVD)

Page 7: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: SVD NetflixHadoop: SVD algorithmalgorithmA feature means…

◦ User: Preference (I like sci-fi or comedy…)◦ Movie: Genres, contents, …◦ Abstract attribute of the object it belongs

to. Feature Vector

◦ Each user has a user feature vector;◦ Each movie has a movie feature vector.

Rating for a (user, movie) pair can be estimated by a linear combination of the feature vectors of the user and the movie.

Algorithm: Train the feature vectors to minimize the prediction error!

Page 8: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: SVD NetflixHadoop: SVD PseudcodePseudcodeBasic idea:

◦Initialize the feature vectors;◦Recursively: calculate the error,

adjust the feature vectors.

Page 9: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ImplementationImplementationData Pre-process

◦Randomize the data sequence.◦Mapper: for each record, randomly

assign an integer key.◦Reducer: do nothing; simply output

(automatically sort the output based on the key)

◦Customized RatingOutputFormat from FileOutputFormat Remove the key in the output.

Page 10: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ImplementationImplementationFeature Vector Training

◦Mapper: From an input (user, movie, rating), adjust the related feature vectors, output the vectors for the user and the movie.

◦Reducer: Compute the average of the feature vectors collected from the map phase for a given user/movie.

Challenge: Global sharing feature vectors!

Page 11: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ImplementationImplementationGlobal sharing feature vectors

◦Global Variables: fail! Different mappers use different JVM and no global variable available between different JVM.

◦Database (DBInputFormat): fail! Error on configuration; expecting bad performance due to frequent updates (race condition, query start-up overhead)

◦Configuration files in Hadoop: fine! Data can be shared and modified by different mappers; limited by the main memory of each working node.

Page 12: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ExperimentsExperimentsExperiments using single-thread,

multi-thread and MapReduceTest Environment

◦Hadoop 0.19.1◦Single-machine, virtual environment:

Host: 2.2 GHz Intel Core 2 Duo, 4GB 667 RAM, Max OS X

Virtual machine: 2 virtual processors, 748MB RAM each, Fedora 10.

◦Distributed environment: 4 nodes (should be… currently 9 node) 400 GB Hard Driver on each node Hadoop Heap Size: 1GB (failed to finish)

Page 13: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ExperimentsExperiments

770919 113084 1502071 1894636

0

10

20

30

40

50

60

1 mapper v.s. 2 mappersRandomizer

1 mapper

2 mappers

# of Records

Tim

e (

sec)

770919 113084 1502071 1894636

0

20

40

60

80

100

120

1 mapper v.s. 2 mapper2Learner

1 mapper

2 mappers

# of RecordsT

ime

(se

c)

Page 14: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ExperimentsExperiments

Randomizer

Vector Initializer

Learner

0 20 40 60 80 100 120

Mappers 123on 1894636 ratings

2 mappers+c3 mappers2 mappers1 mapper

Time (sec)

Typ

es

Page 15: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

NetflixHadoop: NetflixHadoop: ExperimentsExperiments

Page 16: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Problem XML Filtering: Problem Definition Definition Aimed at a pub/sub system

utilizing distributed computation environment◦Pub/sub: Queries are known, data

are fed as a stream into the system (DBMS: data are known, queries are fed).

Page 17: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Pub/Sub XML Filtering: Pub/Sub SystemSystem

XML Queries

XML

Docs

XML

Filters

Page 18: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: AlgorithmsXML Filtering: AlgorithmsUse YFilter Algorithm

◦ YFilter: XML queries are indexed as a NFA, then XML data is fed into the NFA and test the final state output.

◦ Easy for parallel: queries can be partitioned and indexed separately.

Page 19: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ImplementationsImplementationsThree benchmark platforms are

implemented in our project:◦Single-threaded: Directly apply the

YFilter on the profiles and document stream.

◦Multi-threaded: Parallel YFilter onto different threads.

◦Map/Reduce: Parallel YFilter onto different machines (currently in pseudo-distributed environment).

Page 20: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Single-XML Filtering: Single-Threaded ImplementationThreaded Implementation

The index (NFA) is built once on the whole set of profiles.

Documents then are streamed into the YFilter for matching.

Matching results then are returned by YFilter.

Page 21: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Multi-Threaded XML Filtering: Multi-Threaded ImplementationImplementation

Profiles are split into parts, and each part of the profiles are used to build a NFA separately.

Each YFilter instance listens a port for income documents, then it outputs the results through the socket.

Page 22: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Map/Reduce XML Filtering: Map/Reduce ImplementationImplementationProfile splitting: Profiles are read

line by line with line number as the key and profile as the value.◦Map: For each profile, assign a new key

using (old_key % split_num)◦Reduce: For all profiles with the same

key, output them into a file.◦Output: Separated profiles, each with

profiles having the same (old_key % split_num) value.

Page 23: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Map/Reduce XML Filtering: Map/Reduce ImplementationImplementationDocument matching: Split profiles

are read file by file with file number as the key and profiles as the value. ◦ Map: For each set of profiles, run YFilter on

the document (fed as a configuration of the job), and output the old_key of the matching profile as the key and the file number as the values.

◦ Reduce: Just collect results.◦ Output: All keys (line numbers) of matching

profiles.

Page 24: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: Map/Reduce XML Filtering: Map/Reduce ImplementationImplementation

Page 25: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperimentsHardware:

◦ Macbook 2.2 GHz Intel Core 2 Duo◦ 4G 667 MHz DDR2 SDRAM

Software:◦ Java 1.6.0_17, 1GB heap size◦ Cloudera Hadoop Distribution (0.20.1) in a virtual

machine.Data:

◦ XML docs: SIGMOD Record (9 files).◦ Profiles: 25K and 50K profiles on SIGMOD Record.

Page 26: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperimentsRun-out-of-memory: We encountered this

problem in all the three benchmarks, however Hadoop is much robust on this:◦ Smaller profile split◦ Map phase scheduler uses the memory wisely.

Race-condition: since the YFilter code we are using is not thread-safe, in multi-threaded version race-condition messes the results; however Hadoop works this around by its shared-nothing run-time.◦ Separate JVM are used for different mappers, instead

of threads that may share something lower-level.

Page 27: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperiments

Page 28: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperiments

There are memory failures, and jobs

are failed too.

Page 29: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperiments

Page 30: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

XML Filtering: XML Filtering: ExperimentsExperiments

There are memory failures but recovered.

Page 31: Have fun with Hadoop Experiences with Hadoop and MapReduce Jian Wen DB Lab, UC Riverside.

Questions?Questions?