MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...

MapReduce and NoSQLCMSC 461Michael Wilson

Big data The term big data has become fairly

popular as of late There is a need to store vast quantities

of data and retrieve them in a short amount of time

Images, movies, etc. Large files

MapReduce http://research.google.com/archive/

mapreduce.html Concept pioneered by Google Performing operations on large volumes

of data Map function Reduce function

http://research.google.com/archive/mapreduce.html

http://research.google.com/archive/mapreduce.html

Map function Map function

Receives a set of key value pairs as input Performs some operation (user defined) Produces a set of new key value pairs

Reduce function Receives the intermediate key value

pairs Can have multiple values for the same

key Merges the values together in some way Produces a merged output

When to use MapReduce MapReduce doesn’t work for all

problems Problems have to be parallelizable In other words, an algorithm that involves

stateful steps is not necessarily a good candidate for MapReduce

Commodity hardware MapReduce clusters are commodity

hardware X86 processors, several gigabytes of RAM In this day and age, more computers are

cheap Rather than beef up the machines, just

use more

Hadoop Hadoop is a Java based MapReduce

implementation Very popular

Has a secondary component, HDFS Hadoop Distributed File System

HDFS File system spread across a Hadoop

MapReduce cluster Large block sizes – 64 MB by default

Very popular base for other distributed applications In particular, NoSQL applications

NoSQL NoSQL is a somewhat nebulous term

Basically means “not SQL,” or “something other than SQL”

Many different approaches Key-Value stores are a big part of the

NoSQL movement Focus on them here

Key-Value?! This almost seems like a step backward

Key-Value stores are far less structured Can’t establish relations between entities

in a key value store Can’t constrain data very well

Why is reducing the structure gaining popularity?

Distributable nature Many Key-Value stores can be

distributed amongst many nodes By distributing these nodes, searches and

operations on vast swaths of data can be performed in a sensible amount of time

Not all, however Some can be single server applications

stored in RAM

NoSQL Key-Value implementations Hbase Accumulo Memcached Dynamo Many many more

MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...

Documents

Transcript of MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...