MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...
-
Upload
lynn-higgins -
Category
Documents
-
view
213 -
download
0
Transcript of MapReduce and NoSQL CMSC 461 Michael Wilson. Big data The term big data has become fairly popular...
MapReduce and NoSQLCMSC 461Michael Wilson
Big data The term big data has become fairly
popular as of late There is a need to store vast quantities
of data and retrieve them in a short amount of time
Images, movies, etc. Large files
MapReduce http://research.google.com/archive/
mapreduce.html Concept pioneered by Google Performing operations on large volumes
of data Map function Reduce function
Map function Map function
Receives a set of key value pairs as input Performs some operation (user defined) Produces a set of new key value pairs
Reduce function Receives the intermediate key value
pairs Can have multiple values for the same
key Merges the values together in some way Produces a merged output
When to use MapReduce MapReduce doesn’t work for all
problems Problems have to be parallelizable In other words, an algorithm that involves
stateful steps is not necessarily a good candidate for MapReduce
Commodity hardware MapReduce clusters are commodity
hardware X86 processors, several gigabytes of RAM In this day and age, more computers are
cheap Rather than beef up the machines, just
use more
Hadoop Hadoop is a Java based MapReduce
implementation Very popular
Has a secondary component, HDFS Hadoop Distributed File System
HDFS File system spread across a Hadoop
MapReduce cluster Large block sizes – 64 MB by default
Very popular base for other distributed applications In particular, NoSQL applications
NoSQL NoSQL is a somewhat nebulous term
Basically means “not SQL,” or “something other than SQL”
Many different approaches Key-Value stores are a big part of the
NoSQL movement Focus on them here
Key-Value?! This almost seems like a step backward
Key-Value stores are far less structured Can’t establish relations between entities
in a key value store Can’t constrain data very well
Why is reducing the structure gaining popularity?
Distributable nature Many Key-Value stores can be
distributed amongst many nodes By distributing these nodes, searches and
operations on vast swaths of data can be performed in a sensible amount of time
Not all, however Some can be single server applications
stored in RAM
NoSQL Key-Value implementations Hbase Accumulo Memcached Dynamo Many many more