NoSQL - cs. · PDF fileExample MapReduce Problem Exercise: Write your own queries in Hadoop!...
-
Upload
truongdiep -
Category
Documents
-
view
220 -
download
3
Transcript of NoSQL - cs. · PDF fileExample MapReduce Problem Exercise: Write your own queries in Hadoop!...
Outline
● What is NoSQL?● What is MapReduce/Hadoop?● Example MapReduce Problem● Exercise: Write your own queries in
Hadoop!
● “No SQL”○ No relational database
● Umbrella term for many different types of datastores○ Key-Value Stores, Document Stores, Graph
Database systems, etc.● (Really, it’s more like “Not Only SQL” – we don’t
want to abandon the relational DBMS entirely)
What is NoSQL?
Why NoSQL?
● In general, we want our databases to be:○ Convenient○ Reliable○ Safe○ Scalable○ Efficient
Why NoSQL?
● In general, we want our databases to be:○ Convenient○ Reliable○ Safe○ Scalable○ Efficient
● Nowadays, we care a lot more about scalability and efficiency
RethinkDBRavenDB
Amazon DynamoDB
VoldemortBerkeleyDB
Amazon SimpleDB
Cloudata
ThruDB
Terrastore
RaptorDB
FoundationDB
LevelDB
RethinkDBRavenDB
Amazon DynamoDB
VoldemortBerkeleyDB
Amazon SimpleDB
Cloudata
ThruDB
Terrastore
RaptorDB
FoundationDB
LevelDB
What is MapReduce?
● Created in 2004 at Google● Problem: 100’s of data files distributed
across 1,000’s of machines○ how do we get that information, quickly?
● Solution: Extract the data from the files in parallel○ take advantage of the fact that the data is
distributed over 1,000’s of machines
What is MapReduce?
● No data model, data stored in files● Primarily used on distributed filesystems● Users provide two functions
○ map function (data transformation)○ reduce function (data aggregation)
● System takes care of parallelizing the process
Mapping and Reducing
● map: divide problem into subproblems○ input: single line from data file○ output: 0 or more (key, value) pairs
● reduce: work on each subproblem, combine results○ input: (key, list of values)○ output: 0 or more output records
Mapping and Reducing
● map: divide problem into subproblems○ input: single line from data file○ output: 0 or more (key, value) pairs
● reduce: work on each subproblem, combine results○ input: (key, list of values)○ output: 0 or more output records
What is Hadoop?
● An open source implementation of MapReduce○ Google didn’t want to share :-(
● Also used over distributed filesystems● Same mechanics as Google’s version of
MapReduce
Example Problem: Word Counts
How now brown cow
Brown cow is blue
InputHow now brown cowBrown cow is blue
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input MapHow now brown cowBrown cow is blue
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input Map (how, 1)
(now, 1)
(brown, 1)
(cow, 1)
(brown, 1)
(cow, 1)
(is, 1)
(blue, 1)
How now brown cowBrown cow is blue
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input Map (how, 1)
(now, 1)
(brown, 1)
(cow, 1)
(brown, 1)
(cow, 1)
(is, 1)
(blue, 1)
How now brown cowBrown cow is blue
Reduce
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input Map (how, 1)
(now, 1)
(brown, 1)
(cow, 1)
(brown, 1)
(cow, 1)
(is, 1)
(blue, 1)
How now brown cowBrown cow is blue
Reduce
how, 1now, 1brown, 2cow, 2is, 1blue, 1
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input Map (how, 1)
(now, 1)
(brown, 1)
(cow, 1)
(brown, 1)
(cow, 1)
(is, 1)
(blue, 1)
How now brown cowBrown cow is blue
Reduce
how, 1now, 1brown, 2cow, 2is, 1blue, 1
Example Problem: Word Counts
How now brown cow
Brown cow is blue
Input Map (how, 1)
(now, 1)
(brown, 1)
(cow, 1)
(brown, 1)
(cow, 1)
(is, 1)
(blue, 1)
How now brown cowBrown cow is blue
Reduce
how, 1now, 1brown, 2cow, 2is, 1blue, 1
Now it’s your turn!● ssh into corn, and copy the Hadoop starter code
○ cp -r /usr/class/cs145/NoSQL-activity .
○ cd NoSQL-activity/● Run the initialization script
○ local-hadoop/start-local-hadoop.py○ Don’t forget to run local-hadoop/stop-
local-hadoop.py before you log out!
Query #1: Word Counts (again!)● (We’ll do this one together.)● Starter code can be found in src/query1● Dataset can be found in /usr/class/cs145/NoSQL-data● Compile and run your code using query1-wordcount.sh● Results will show up in output1/ directory
○ check results by running diff output1/part-00000 /usr/class/cs145/NoSQL-answers/output1/part-00000
Query #2: Hashtag Counts
● Count the number of times each Hashtag appears in the Twitter dataset○ a hashtag is a term that starts with ‘#’
● Answer should be of the form:○ <Hashtag> <Integer>
● How many times does #goStanford appear in the dataset?