MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...

13
MapReduce and NoSQL CMSC 461 Michael Wilson

Transcript of MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...

Page 1: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

MapReduce and NoSQLCMSC 461Michael Wilson

Page 2: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Big data The term big data has become fairly

popular as of late There is a need to store vast quantities

of data and retrieve them in a short amount of time

Images, movies, etc. Large files

Page 3: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

MapReduce http://research.google.com/archive/

mapreduce.html Concept pioneered by Google Performing operations on large volumes

of data Map function Reduce function

Page 4: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Map function Map function

Receives a set of key value pairs as input Performs some operation (user defined) Produces a set of new key value pairs

Page 5: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Reduce function Receives the intermediate key value

pairs Can have multiple values for the same

key Merges the values together in some way Produces a merged output

Page 6: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

When to use MapReduce MapReduce doesn’t work for all

problems Problems have to be parallelizable In other words, an algorithm that involves

stateful steps is not necessarily a good candidate for MapReduce

Page 7: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Commodity hardware MapReduce clusters are commodity

hardware X86 processors, several gigabytes of RAM In this day and age, more computers are

cheap Rather than beef up the machines, just

use more

Page 8: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Hadoop Hadoop is a Java based MapReduce

implementation Very popular

Has a secondary component, HDFS Hadoop Distributed File System

Page 9: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

HDFS File system spread across a Hadoop

MapReduce cluster Large block sizes – 64 MB by default

Very popular base for other distributed applications In particular, NoSQL applications

Page 10: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

NoSQL NoSQL is a somewhat nebulous term

Basically means “not SQL,” or “something other than SQL”

Many different approaches Key-Value stores are a big part of the

NoSQL movement Focus on them here

Page 11: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Key-Value?! This almost seems like a step backward

Key-Value stores are far less structured Can’t establish relations between entities

in a key value store Can’t constrain data very well

Why is reducing the structure gaining popularity?

Page 12: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

Distributable nature Many Key-Value stores can be

distributed amongst many nodes By distributing these nodes, searches and

operations on vast swaths of data can be performed in a sensible amount of time

Not all, however Some can be single server applications

stored in RAM

Page 13: MapReduce and NoSQL CMSC 461 Michael Wilson. Big data  The term big data has become fairly popular as of late  There is a need to store vast quantities.

NoSQL Key-Value implementations Hbase Accumulo Memcached Dynamo Many many more