CS-495/595 Big Data processing concepts (part 1) Lecture...

31
1/31 Big Data Overview Concepts (part 1) Break Assignment Conclusion References CS-495/595 Big Data processing concepts (part 1) Lecture #2 Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge Dr. Chuck Cartledge 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015 21 Jan. 2015

Transcript of CS-495/595 Big Data processing concepts (part 1) Lecture...

Page 1: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

1/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

CS-495/595Big Data processing concepts (part 1)

Lecture #2

Dr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck CartledgeDr. Chuck Cartledge

21 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 201521 Jan. 2015

Page 2: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

2/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Table of contents I

1 Big Data

2 Overview

3 Concepts (part 1)

4 Break

5 Assignment

6 Conclusion

7 References

Page 3: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

3/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

What is it?

And, why is it interesting?

Big data has emerged as a technology term and trendthat is complementary to and considered to be equally astransformational as the cloud computing model.. . . represented as an “old” or “new” capability dependingon the perspective of those defining it, . . .

Lee Badger [10]

Big Data can be characterized by the three V’s:volume (large amounts of data), variety (includesdifferent types of data), and velocity (constantlyaccumulating new data).

Jules. J. Berman [3]

Page 4: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

4/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Notional definition

We’ll be covering virtually “bleeding edge” stuff.

Data too big for a singlemachine.

Processing too long for asingle machine.

Question/analysis isparalizable.

Page 5: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

5/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Where does it come from?

Lots of places, lots of it, and fast.

230,000,000 tweets per day[8]

2,700,000,000 Facebooklikes per day [2]

100 hours of YouTube videoevery minute [14]

Clickstream left on servers

Page 6: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

6/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Where does it come from?

Anatomy of a tweet

Originator name and bio

Location, time

Followers??

How active is thisoriginator??

Image from: http://www.slaw.ca/2011/11/17/the-anatomy-of-a-tweet-metadata-on-twitter/

Page 7: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

7/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Where does it come from?

How are tweets being used?

To monitor the affulence of an area.

The density of tweets

“Characterization” ofhashtags

Frequency, time, andlocation Collect data in real-time,

compare it to “old data,” andmake predictions[12].

Combining location, time, and topic can give insight into trendsbased on freely given data.

Page 8: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

8/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Where does it come from?

Big data quality problems aren’t limited to big data.

What was the gender and age breakdown of the Titanic surviors?Was it really women and children to the lifeboats first?

Genders are: male, female,and ”

Ages are: ” and positivefloats

Remember about sample size andstatistics?

Data from:https://www.kaggle.com/c/titanic-gettingStarted/data

Page 9: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

9/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

The Big “Vs”

Volume, Velocity, Variety are hard problems.

Vocabulary started withe-commerce [9]:

Volume: lots of data

Velocity: data is created fast

Variety: data has differentorigins

Big Data addresses questions based on these Vs

Page 10: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

10/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

Sometimes other Vs appear as well.

The original 3 Vs have been expanded by many[5]:

Veracity: is the datatrustworthy?

Value: how “good” is thedata?

Variability: is the dataconsistent?

Sometimes complexity gets added to the mix.

Page 11: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

11/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

A 3Vs visual.

The further out on the rings, the more it is “Big Data” like.

Velocity — All data isreal-time, only the intervalchanges

Variety — Is the datastructured, or not?

Volume — How much datahas to be processed?

Changes in any of these vectors can cause a revaluation of thecurrent approach [4].

Page 12: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

12/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

Velocity — what does it mean for Big Data?

Frequency of datageneration/delivery

Think of data from a device,or sensor, robots, clicklogs

Real-time analysis is small(9%) [13].

Most Big Data analytics isbatch

Take away: data is generated at a high speed, it must be analyzedbefore the next set of data is delivered. Little’s Law L = λW [11]

Page 13: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

13/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

Variety — what does it mean for Big Data?

Not all data is the same.

Data from a multitude ofdifferent sources.

Not all data is useful.

Data is lost during“normalization”

Hopefully not importantdata, when in doubt: keep itsomehow

Gets away from relationaldatabases

Page 14: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

14/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

Volume — what does it mean for Big Data?

How much is there? And, how do we store it?

Store relational records?

Store transactional records?

How long to keep dataavailable?

How to access data?

How to migrate data?

Figure: Exponential data growth[7]

See http://en.wikipedia.org/wiki/Metric prefix for list of prefixes.

Page 15: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

15/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Other Vs

The Big Data challenges.

Hetrogeneity

Scale

Timeliness

Complexity

Privacy

The Big Data user changes the question[1].

Page 16: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

16/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

The Vs

Our friends the Vs

Classic Vs

Additional Vs

Page 17: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

17/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Lots of data

Data sources

Government:1 Medicare data (we’ll see

more of this later)2 NSA, DoD, NASA

Private:1 Clickstream2 FICO3 Walmart4 Android devices

Figure: Sample Clickstream

[6]

Page 18: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

18/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

What does data look like?

Data characteristics

Formatted/unformatted

Bits, bytes, tagged, freeform

Clean, messy

Complete, fragmented

Page 19: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

19/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

What does data look like?

Torrents of data

Primary usage

Secondary usage

“Exhaust”

Storage1 Accessability2 Longivity3 Privacy

Page 20: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

20/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

What does data look like?

Big data players

Brokers

Scientists

Visionaries

Page 21: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

21/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Break time.

Take about 10 minutes.

Page 22: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

22/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

A “Hello World” level problem.

With a little license.

A simply stated problem: Countthe number of unique words inShakespeare’s Macbeth.

A few Java classes

A Hadoop environment

Process strings from a file

Summarize the results

Grad students have a little moreto do.

Page 23: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

23/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

A “Hello World” level problem.

The Hadoop “cook book”, simple things (on thesurface).

Partition (paralyze) thesource data

Create key value pairs1 Receive line of text2 Parse the text in some

way3 Create key/value pairs

Behind the scenes key valuepairs are combined

Reduce key and multiplevalues

Produce something useful

Page 24: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

24/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Undergraduate level problem

Shakespeare’s Macbeth (mechanics)

Things that need to get done:

1 Get a copy of the play

2 Get it onto the HadoopDistributed File System(HDFS)

3 Write and compile a Mapperclass

4 Write and compile aReducer class

5 Write and compile a mainclass

6 Run it on the ODU CSHadoop farm

Page 25: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

25/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Undergraduate level problem

Undergrad results: a simple textual listing

Some words are moreimportant than others (stopwords)

Only base works (stems ==stem)

Words sorted alphabetically

Number of occurrences perword

Words don’t havepunctuation

Case insensitive words

Page 26: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

26/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

Graduate level problem

Graduate challenges: undergrads worked with one file,graduates with two

How do the vocabularies of Romeo and Juliet, and Macbethcompare?

Slightly more work to do withdata:

Work with two files

Compare first 50 words ofboth plays

1 Order2 Usage (relative not

absolute)

Interested in how the similar the vocabularies are across the twoplays.

Page 27: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

27/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

What have we covered?

Big Data VsBig Data sourcesProblems associated with Big DataAssignment #1

Next time: Big Data processing concepts (part deux)

Page 28: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

28/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

References I

[1] Divyakant Agrawal, Philip Bernstein, Elisa Bertino, SusanDavidson, Umeshwas Dayal, and Michael Franklin,Challenges and Opportunities with Big Data, Purde e-Pubs(2011).

[2] Anson Alexander,Facebook User Statistics 2012 [Infographic], ansonAlex.com(2012).

[3] Jules J Berman,Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information,Newnes, 2013.

[4] Pinal Dave, Big Data Beginning Big Data Day 2 of 21,http://blog.sqlauthority.com/2013/10/02/, 2013.

Page 29: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

29/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

References II

[5] Mike Ferguson,Architecting A Big Data Platform for Analytics, AWhitepaper Prepared for IBM (2012).

[6] Christian Hagen, KHalid Khan, Marco Ciobo, and Jason Miller,Big Data and the Creative Destruction of Today’s Business Models,http://www.atkearney.com/strategic-it/ideas-insights/article/-/asset publisher/LCcgOeS4t85g/content/big-data-and-the-creative-destruction-of-today-s-business-models/10192,2013.

[7] Applied Innovations, Track website visitors,http://www.appliedi.net/blog/track-website-visitors/, 2010.

Page 30: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

30/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

References III

[8] Joab Jackson, The Big Promise of Big Data, BusinessSoftware (2012).

[9] Doug Laney, 3d data management: Controlling data volume,velocity and variety, META Group Research Note 6 (2001).

[10] Robert Bohn Lee Badger, David Bernstein,US Government Cloud Computing Technology Roadmap Volume I,Tech. report, National Institute of Standards and Technology,2014.

[11] John DC Little, A Proof for the Queuing Formula: L= λ W,Operations Research 9 (1961), no. 3, 383–387.

Page 31: CS-495/595 Big Data processing concepts (part 1) Lecture ...ccartled/Teaching/2015-Spring/Lectures/00… · 1 Clickstream 2 FICO 3 Walmart 4 Android devices Figure:Sample Clickstream

31/31

Big Data Overview Concepts (part 1) Break Assignment Conclusion References

References IV

[12] Patrick Meier, Using big data to inform poverty reductionstrategies, http://irevolution.net/2013/06/19/pulse-of-egypt-to-inform-poverty-reduction/,2013.

[13] Philip Russom, Big Data Analytics, TDWI Best PracticesReport, Fourth Quarter (2011).

[14] YouTube, Statistics,http://www.youtube.com/yt/press/statistics.html.