Big Data Processing, 2014/15
Transcript of Big Data Processing, 2014/15
Big Data Processing, 2014/15Lecture 1: Introduction!
!Claudia Hauff (Web Information Systems)!
1
Course organisation• 2 lectures a week (Tuesday, Thursday) for 7 weeks
• Content: • Data streaming algorithms • MapReduce (bulk of the course) • Approaches to iterative big-data problems
• One “free” lecture in January (suggestions?) • e.g. Big data visualisations, Industry talk, etc.
• Course style: interactive when possible
2
Course content• Introduction!
• Data streams 1 & 2
• The MapReduce paradigm!
• Looking behind the scenes of MapReduce: HDFS & Scheduling!
• Algorithm design for MapReduce!
• A high-level language for MapReduce: Pig 1 & 2!
• MapReduce is not a database, but HBase nearly is
• Lets iterate a bit: Graph algorithms & Giraph!
• How does all of this work together? ZooKeeper/Yarn
3
Small changes are possible.
4
• A mix of pen-and-paper and programming
• Individual work
• 5 lab sessions on Hadoop & Co between Nov. 24 and Jan. 5 on Mondays 8.45am - 12.45pm • Lab attendance is not compulsory !
• Two lab assistants will help you with your lab work (Kilian Grashoff and Robert Carosi)
• Assignments are handed out on Thursdays, and due the following Thursday (6 in total);
Assignments
5
• Course grade: 25% assignments, 75% final exam
• Weekly quizzes: up to 1 additional grade point • +0.5 if you answer >50% correctly • +1 if you answer >75% correctly
• You can pass without handing in the assignments (not recommended!)
• Final exam: January 28, 2-5pm (MC & open questions)
• Resit: April 15, 2-5pm (MC & open questions)
Grading & exams
Assignments and quizzes?? Yes!
6
• Questions are always welcome
• Contact preferably via mail: [email protected]
• Questions and answers may be posted on blackboard if they are relevant for others as well
• If the pace is too slow/fast, please let me know
Questions?
7
• Recommended material are either book chapters or papers, usually available through the TU Delft campus network
• Course covers a recent topic, no single book available that covers all angles!
• MapReduce: Data-Intensive Text Processing with MapReduce by Lin, Dyer and Hirst, Morgan and Claypool Publishers, 2010. http://lintool.github.io/MapReduceAlgorithms/
Reading material (lectures)
8
• A book on Java if you are not yet comfortable with it!
• Hadoop: Hadoop: The Definite Guide by Tom White, O’Reilly Media, 2012 (3rd edition)
Reading material (assignments)
Big Data
10
• Explain the ideas behind the “big data buzz”
• Understand and describe the three different paradigms covered in class
• Code productively in one of the most important big data software frameworks we have to date: Hadoop (and tools building on it)
• Transform big data problems into sensible algorithmic solutions
Course objectives
11
• Explain and recognise the V’s of big data in use case scenarios
• Explain the main differences between data streaming and MapReduce algorithms
• Identify the correct approach (streaming vs. MapReduce) to be taken in an application setting
Today’s learning objectives
12
• A buzz word, fuzzy boundaries
!
!
!
!
• Requires novel infrastructure to support storage and processing
What is “big data”?
“Massive amounts of diverse, unstructured data produced by high-performance applications.”
“Data too large & complex to be effectively handled by standard database technologies currently found in most organisations.”
13
• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching
Large-scale computing is not new
ECMWF Source
14
• Weather forecasting has been a long-term scientific challenge • Supercomputers were already used in the 1970s • Equation crunching
Large-scale computing is not new
ECMWF Source
15
• So-called big data technologies are about discovering patterns (in semi/unstructured data)
• Main focus is on how to make computations on big data feasible, i.e. without a supercomputer • We use cluster(s) of commodity hardware!
• Next quarter: Data Mining course (will use Hadoop*) focuses on how to discover those patterns
Big data processing
*probably
Just an academic exercise?• Cloud computing: “Anything running inside a browser
that gathers and stores user-generated content”
• Utility computing • Computing as a metered service!• A “cloud user” buys any amount of computing
power from a “cloud provider” (pay-per-use) • Virtual machine instances • IaaS: infrastructure as a service!• Amazon Web Services is the dominant provider
16
Just an academic exercise?
17
No!!
AWS EC2 Pricing
You can run your own big data experiments!
Progress often driven by industry• Development of big data standards & (open source)
software commonly driven by companies such as Google, Facebook, Twitter, Yahoo! …
• Why do they care about big data?
!
• More knowledge leads to • better customer engagement • fraud prevention • new products
18
More data More knowledge
More money
Big data analytics: IBM pitch
19
https://www.youtube.com/watch?v=1RYKgj-QK4I
A concrete example: big data vs. small data
20
Scaling to very very large corpora for !natural language disambiguation.!M. Banko and E. Brill, 2001.
qual
ity
perfect
poortraining datalittle a lot
Task: confusion set disambiguation then vs. than to vs. two vs. too
with little training simple algorithms fail
complex algorithm
but with a lot of data …
The 3 V’s
• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
21
The 5 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
22
The 7 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
• Validity: ensure that the interpreted data is sound
• Visibility: data from diverse sources need to be stitched together
23
3/5 V’s most !
commonly used
The 5 V’s• Volume: large amounts of data
• Variety: data comes in many different forms from diverse sources
• Velocity: the content is changing quickly
• Value: data alone is not enough; how can value be derived from it?
• Veracity: can we trust the data? how accurate is it?
24
Question: how do these attributes apply to the use case of Flickr?
Instantiations
25
Terminology
• Batch processing: running a series of computer programs without human intervention
• Near real-time: brief delay between the data becoming available and it being processed
• Real-time: guaranteed responses between the data becoming available and it being processed
26
Terminology contd.
• Structured data (well defined fields)
• Semi-structured data
• Unstructured data(by humans for humans)
27most common today
standard in the past
Unstructured text• To get value out of unstructured text we need to
impose structure automatically!• Parse text • Extract meaning from it (can be easy or difficult)
• Amount of data we create is more than doubling every two years, most new data is unstructured or at most semi-structured
• Text is not everything - images, video, audio, etc.
28
Extracting meaning
29
easy
difficult
IBM Watson: How it works
30
https://www.youtube.com/watch?v=_Xcmh1LQB9I
8 minutes worth your time - even if not in class. !Contains a nice piece about the use of unstructured text.
Examples of Volume & Velocity: Twitter• >500 million tweets a day • On average >5700 tweets a second • Peaks of >100,000 tweets a second
• Super Bowl • US election • New Year’s Eve • Football World Cup (672M tweets
in total #WorldCup) !
• Messages are instantly accessible for search!• Messages are used in post-hoc analyses to gather insights
31
32
http://bl.ocks.org/anonymous/raw/0c64880b3a791dffb6e4/
#WorldCup
Examples of Volume: the Large Hadron Collider @ CERN• The world’s largest particle accelerator buried in a
tunnel with 27km circumference • Enormous amounts of
sensors register the passage of particles
• 40 million events/second (1MB raw data per event)
• Generates 15 Petabytes a year (15-25 million GB)
33
Image src
Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide
• whether to show an ad • which ad to show
• Users willing at best to wait 2 seconds for their search results
• Feedback loop via user clicks, user searches, mouse movements, etc.
34
Examples of Velocity: targeted advertising on the Web• US revenues in 2013: ~$40 billion • Advertisers usually get paid per click • For each search request, search engines decide
• whether to show an ad • which ad to show
• Users willing at best to wait 2 seconds for their search results
• Feedback loop via user clicks, user searches, mouse movements, etc.
35
More interactions/data on the Web• YouTube: 4 billion views a day, one hour of video
uploaded every second • Facebook: 483 million daily active users (Dec.
2011); 300 Petabytes of data • Google: >1 billion searches per day (March 2011) • Google processed 100 Terabytes of data per day
in 2004 and 20 Petabytes data/day in 2008 • Internet Archive: contains 2 Petabytes of data,
grows 20 Terabytes per month (2011)36
Use case: movie recommendations
37
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
?
Bob
Alice
Tom
Jane
Frank
Joe
Bob likes X-MEN.
Jane dislikes X-MEN.Should we suggest
X-MEN to Frank?
Question: how can we come up with a suggestion for Frank?!
Think about data-intensive and non-intensive approaches.
Use case contd.• Ignore the data, use experts instead (movie
reviewers); assumes no large subscriber/reviewer divergence
• Use all data but ignore individual preferences; assumes that most users are close to the average
• Lump people into preference groups based on shared likes/dislikes; compute group-based average score per movie
• Focus computational effort on difficult movies (some are universally liked/disliked)
38
A whole research field is concerned with this question: recommender systems.
Use case contd.• Netflix Prize: open competition for the best
collaborative filtering algorithm to predict user ratings for films, based on previous ratings (>100 million ratings by 0.5 million users for ~17,000 movies)
• First competitor to improve over Netflix’s baseline by 10% receives $1,000,000
• Competition started in 2006, price money was paid out in 2009 (winner was 20 minutes quicker than the runner up - same performance!)
39Many research teams competed - innovation driven by industry again!
Example of Variety: Restaurant Locator• Task: given a person’s location, list the top five
restaurants in the neighbourhood • Required data:
• World map • List of all restaurants in the world (opening
hours, GPS coordinates, menu, special offers) • Reviews/ratings • Optional: social media stream(s)
• Data is continuously changing (restaurants close, new ones open, data formats change, etc.)
40
Question: what kind of data do you need for this task?
Society can benefit too, not just companies• Accurate predictions of natural disasters and
diseases • Better responses to disaster discovery
• Timely & effective decisions • Provide resources where need the most
!
• Complete disease/genomics databases to enable biomedical discoveries
• Accurate models to support forecasting of ecosystem developments
41
Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time posting abilities !
!
!
• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre
42
Earthquake!occurs
Warn people further away
people in the area tweet about it
Question: do you think this is possible? If so, what are the challenges?
Idea: earthquake warnings• Social sensors: users (humans) that use Twitter,
Facebook, Instagram, i.e. portals with real-time posting abilities !
!
!
• Challenges: how to detect when a tweet is about an actual earthquake, which earthquake is it about and where is the centre
43
Earthquake!occurs
Warn people further away
people in the area tweet about it
Idea: earthquake warnings contd.• Goal: warning should reach people earlier than
the seismic waves • Travel times of seismic waves: 3-7km/s; arrival
time of a wave 100km away: 20 seconds!• Performance of an existing system (Sakaki et al.,
2010):
44 traditional warning system
Twitter-based
earthquake
It is possible!!
A brief introduction to Streaming & MapReduce
A brief introduction to Streaming
Data streaming scenario• Continuous and rapid input of data (“stream of data”) • Limited memory to store the data - less than linear in
the input size • Limited time to process each data item - sequential
access • Algorithms have one (or very few passes) over the
data • Can be approached from a practical or mathematical
point of view: metric embedding, pseudo-random computations, sparse approx. theory …
47
We go for the practical setup!
Data streaming example
48
3 6 5 4 1 8 2What is the
missing number?
stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data
Solution 1: memorise all number seen so far; memory requirements: n bit (impractical for large n)
Data streaming example
49
3 6 5 4 1 8 2What is the
missing number?
stream of n numbers; permutation from 1 to n; one number is missing; we are allowed one pass over the data
Solution 2: (closed form)
sum of all numbers from 1 to n
subtract seen numbers
memory: 2logn
Data streaming contd.
50
3 4 9 4 9 8 2What is the average?!
What is the median?
we are allowed one pass over the data, can only store 3 numbers
• Average: can be computed by keeping track of two numbers (sum and #numbers seen)
!• Median: sample data points - but how?
Data streaming contd.
51
• Typically: simple functions of the stream are computed and used as input to other algorithms • Median • Number of distinct elements • Longest increasing sequence • …
• Closed form solutions are rare • Common approaches are approximations of the
true value: sampling, hashing
A brief introduction to MapReduce
MapReduce is an industry standard
53
QCon 2013 (San Francisco)“QCon empowers software development by facilitating the spread of knowledge and
innovation in the developer community. A practitioner-driven conference …”
Hadoop is the open-source implementation of the MapReduce framework.
Industry is moving fast
54 QCon 2014 (San Francisco)
MapReduce
55
• Designed for batch processing over large data sets
• No limits on the number of passes, memory or time
• Programming model for distributed computations inspired by the functional programming paradigm
MapReduce example
56
WordCount: given an input text, determine the frequency of each word. The “Hello World” of the MapReduce realm.
Input text: The dog walks around the house. The dog is in the house.
Input: text
text 1
text 2
text 3
Mapper
Mapper
sort Reducer!(add)
(in,1)!(the,1)!
(house,1)
(dog,1)!(dog,1)!
(walks,1)!(house,1)!(house,1)!
…
dog: 2!walks: 1!house: 2!
…
Mapper
MapReduce example
57
Input: text
text 1
text 2
text 3
Mapper
Mapper
sort Reducer!(add)
(in,1)!(the,1)!
(house,1)
(dog,1)!(dog,1)!
(walks,1)!(house,1)!(house,1)!
…
dog: 2!walks: 1!house: 2!
…
Mapper
We implement the Mapper and the Reducer. Hadoop (and other tools) are responsible for the “rest”.
Summary• What are the characteristics of “big data”?
• Example use cases of big data
• Hopefully a convincing argument why you should care
• A brief introduction of data streams and MapReduce
58
Reading material
59
Required reading!None. !
Recommended reading!Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information by Jules Berman. Chapters 1, 14 and 15.
THE END