Datainnovation

36
Data Science Innovations: Roadmap to Hadoop Ecosystem & Spark [email protected] linkedin.com/in/sureshsood @soody http://www.slideshare.net/ssood/datainnovation February 4, 2015

Transcript of Datainnovation

Page 1: Datainnovation

Data Science Innovations:Roadmap to Hadoop Ecosystem & Spark

[email protected]

linkedin.com/in/sureshsood

@soody

http://www.slideshare.net/ssood/datainnovation

February 4, 2015

Page 2: Datainnovation

Topic Areas for Discussion

1. Statistics/Data mining or Data Science?

2. What is big data and the challenge today ?

3. Data types

4. Data Science workflows & discovery

5. Hadoop

6. Data Science innovation

7. New Sources of Information (Big data) Data Driven Innovations

8. Internet of Things

9. Data Science Innovations

10. Apache Spark

Page 3: Datainnovation

Statistics, Data Mining or Data Science ?

• Statistics– precise deterministic causal analysis over precisely collected data

• Data Mining– deterministic causal analysis over re-purposed data carefully sampled

• Data Science– trending/correlation analysis over existing data using bulk of

population i.e. big data

Adapted from:

NIST Big Data taxonomy draft report (see http://bigdatawg.nist.gov /show_InputDoc.php)

Page 4: Datainnovation

Unknown relationships

Unstructured data

95% of data not collected

Social-Psychological- local-Mobile-GPS-M2M

Beyond Transactions including interactions and observations

4

What is Big Data ?

Page 5: Datainnovation

Big Data Challenge Today : Moving from Transactions Alone to Relationships and Empathy

Current State= Transactions $$$

We do this stuff well e.g.Collect payments …

Future State= Human Empathy (relationships)

We don’t do this really e.g. User generated content, ratings, reviews, 1:1 dialogue, Distress Signals, Geolocation

5

Page 6: Datainnovation

Data Types • Astronomical

• Documents

• Earthquake

• Email

• Environmental sensors

• Fingerprints

• Health (personal) Images

• Graph data (social network)

• Location

• Marine

• Particle accelerator

• Satellite

• Scanned survey data

• Sound

• Text

• Transactions

• Video

Page 7: Datainnovation

Data Science Workflows & Discovery

Page 8: Datainnovation

Hadoop & Spark Explained

Page 9: Datainnovation

Had

oo

pC

on

figu

rati

on

s (

Sin

gle

and

Mu

lti-

Rac

k)

Adapted from: http://stackiq.com/

Cluster manager e.g. Apache Ambari, Apache Mesos, or Rocks

3 TB drives ,18 data nodes configuration represents 648 TB of raw storage HDFS standard replication factor of 3216 TB of usable storage

Name/secondary/data nodes – 6 core 96 GBManagement node – 4 core 16 GB

Page 10: Datainnovation

Data Science Innovation

Data science innovation is something an organization has not done before or even something nobody anywhere has done before. A data science innovation focuses on discovering and using new or untraditional data sources to solve new problems.

Adapted from:Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son

Page 11: Datainnovation

http://tacocopter.com/

New Sources of Information (Big data) : Social Media + Internet of Things Data Science Innovations

Page 12: Datainnovation

Internet of Things (IOTs)“trillion sensors”

Source: www.tsensorssummit.org

Page 13: Datainnovation

Data Science InnovationsID Analytics Innovative Info source Innovation Software/Platform

1. Node-Link (NLA) Multiple Reduce suspect list from 18 m to 230/32

New version Spark GraphX

2. ANZ Truckometer NZ transport authority real time traffic data

GDP forecast 6 months in advance

N/A

3. Driving (Usage Based) Black box (telematics)Unstructured data

Pay as you drive policyPay how you drive

Hadoop Map Reduce

4a. Deception (veracity) Found stories online blogs Flag fake stories text, images and short video

MongoDB – Pythondictionary

4b. Psychological State Twitter and Instagram Junk words MongoDB – Python dictionary

4c. Thematic Apperception Technique Mobile phone screencustomisation

Automated informant testing Sparkling Water (H2O/Spark)Deep Learning

5. Brand Brand stories “found” online Brand user profile R/Hadoop

6. Supermarket shopper behavior CCTV /beacon transmitters “My store” product placement based on time of day predictive shopping behaviour

MongoDBHadoop 2 ClusterSpark GraphXSpark MLib

7. Sandbag exercise Sandbag sensors Virtual trainer Spark GraphXSpark MLib

8. Oil reserves shipment monitoring Skybox (Google) satellite images Improved oil forecast “Busboy” – C /Hadoop

9. J score for mobile energy usage Sparse incomplete data from community of mobile users

Energy bug mgmt. Spark/Amazon Web

Suresh Sood 2015

Page 14: Datainnovation

1. Node Link Analytics• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer

• Everyone in Australia was a suspect

• Large volumes of data from multiple sources

RTA Vehicle records Gym Memberships Gun Licensing records Internal Police records

• Police applied node link analysis techniques (NetMap) to the data

• Harness power of the human mind

• Analyst can spot indirect links, patterns , structure, relationships and anomalies

• A bottom-up approach with process of discovery to uncover structure

• Reduced the suspect list from 18 million to 230

• Further analysis with the use of additional satellite information reduced this to 32

Data Information Knowledge

Page 15: Datainnovation

The ANZ Heavy Traffic Index comprises flows of vehicles weighing more than 3.5 tonnes(primarily trucks) on 11 selected roads around NZ. It is contemporaneous with GDP growth.

The ANZ Light Traffic Index is made up of light or total traffic flows (primarily cars and vans) on 10 selected roads around the country. It gives a six month lead on GDP growth

http://www.anz.co.nz/about-us/economic-markets-research/truckometer/

2.

Page 16: Datainnovation

3. Black Box Insurance

• Big data transforms actuarial insurance from using probability methods to estimate premiums into dynamic risk management using real data generating individually tailored premiums

• Estimate 20 km work or home journey, data point acquired every min and journey captures 12 points per km. Assume 1000 km per month driving or generating 12,000 points per month resulting in 144,000 points per car/annum. Hence, 1,000 cars leads to 144 million points per annum.

• Telematics technology (black box) monitor helps assess the driving behavior and prices policy based on true driver centric premiums by capturing:

– Number of journeys

– Distances travelled

– Types of roads

– Speed

– Time of travel

– Acceleration and braking

– Any accidents

– Location ?

• Benefits low mileage, smooth and safe drivers

• Privacy vs. Saving monies on insurance (Canada ; http://bit.ly/Black_box)

Page 17: Datainnovation

Psychological analytics helps put human context into Business

• Behavior data Links human emotions to business -> Analyse footprints left behind.

• What really does customer satisfaction mean ? Is the person actually happy?

• How do we take the emotional dimension into account for customer experience?

• How do we recognize someone is dissatisfied?

• How do we recognize a “distressed” person?

• Do we use text and voice? Will sleeping patterns and eating habits help?

• would you act differently if someone is happy?

• How do you coach employees to see how someone sounds in emotional terms?

• Understanding when distress exists and when a customer needs enhanced service

• Behavior data reveals attitude and intent. This is more predictive of future

opportunities and risk versus historical data

Page 18: Datainnovation

18

4a.

Page 19: Datainnovation

1.Gayle

3. Paris

2. Paige

+

+

4.”The occasion was my cousin Paige’s 16th”

5. “I am a Canadian and get by in French.”

6. "All I can say is WOW! We rented a 2 bedroom, 1 ½ bath apartment (two showers), "Merlot" from ParisPerfect http://www.parisperfect.com/ and boy was it ever perfect! "

7. “We had a full view of the Eiffel from our charming little terrace. ....We were within walking distance to two metro stops (Pont d'Alma or Ecole Militaire) "

8. "We were walkable to many good bistros, cafes and bakeries and only a few blocks from the wonderful market street Rue Cler."

9. "I bought a Paris Pratique pocket-sized book at a Metro station. This handy guide has detailed maps of each arrondisement, as well as the metro lines, the bus lines, the RER and the SCNF (trains). I'll never be without this again."

10."Six months before our trip, I gave Paige a couple of good guide books on Paris and suggested she let me know what her interests were since after all, this was to be her trip."

11.Sites•The Marais•Notre Dame•L'Arc de Triomphe - 248 steps up and 248 steps down...•Champs Elysee•Jacquemart Museum•Louvre Lite•Musee D'Orsay•Les Invalides, Napoleon's Tomb and the Napoleon Museum•Sacre Coeur•Monmartre•Rodin Museum•Pompidou Museum•Train to Vernon, bike to Giverny with Fat Tire Bike Tours•http://www.fattirebiketoursparis.com/•Eiffel Tower

Elaboration of Trip to Paris Blog Story (Means-End & Heider)

Woodside, Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing

12. Unforgettable Memories"This trip had so many memories, but here are a few choice highlights........On our very first night, knowing that the Eiffel Tower light show started at 10:00 p.m.... she [Paige] dropped her camera…down 6 flights…we were stunned…SpanishFamily below standing below [with pieces of the camera]”

15." Michael Osman is an American artists living in Paris.""He supplements his income by being a tour guide." I" found out about him on Fodors""So I engaged Michael for two days."

16. "On our trip to Giverny, we met a young woman from Brisbane, Australia who was traveling on her own and we invited her to join us. Three of us enjoyed delicious and innovative soufflés, while Paige had the rack of lamb. We shared two dessert soufflés, one chocolate and the other cherry/almond. Yum"

17. "I wanted Paige to get a feel for shopping experiences that

she would not have at home (aka the ubiquitous mall). "

18."We went on Fat Tire's day trip to Monet's gardens and house in Giverny, about an hour outside Paris."

13."The father stretched out his cupped hands which held all of the pieces they were able to recover, including the memory stick and he very solemnly said, "El muerto...".

14. "They had decide to come to Paris to find the Harley Davidson store so they could buy Harley Paris t-shirts."

+

+

+

+

19....."I know Paige will treasure the memory of this girl's trip for many

years to come."

19

Page 20: Datainnovation

20

Page 21: Datainnovation

The Newman Model of Deception (Pennebaker et al)

Key word categories for deception mapping:

1. Self words e.g. “I” and “me” – decrease when someone distances themselves from content

1. Exclusive words e.g. “but” and “or” decrease with fabricated content owing to complexity of maintaining deception

1. Negative emotion words e.g. “hate” increase in word usage owing to shame or guilty feeling

1. Motion verbs e.g. “go” or “move” increase as exclusive words go down to keep the story on track

Page 22: Datainnovation

Instagram Deception (Suspects outside of -20 & +20)

Vine Deception (Suspects outside of -5 and +5)

Page 23: Datainnovation

4b. Psychological State• LIWC (analyzewords.com)

– Reveal personality from word usage

– Uses LIWC classification of words

• TweetPsych (tweetpsych.com/)

– Linguisitic analysis using:

– RID

– LIWC

Note: TweetPsych is not without critics:http://psychcentral.com/blog/archives/2009/06/18/putting-cool-ahead-of-science-tweetpsych/

Page 24: Datainnovation

4c. Thematic Apperception Technique

Page 25: Datainnovation

Social CRM integrates “breadcrumb” data

25

5. Brand User Analytics

Page 26: Datainnovation

Aquarius,Aries,Cancer,Capricorn,Gemini,Leo,Libra, Pisces, Sagittarius,Scorpio,Taurus,Virgo

Ambivalent, Employee, Opposer, Reporter, Supporter 11. Committed Partnerships, 12. Compartmentalised Friendship,13. Childhood friendship,14. Courtship,15. Fling, 16. Secret-Affair, 17. Enslavement , 2. Marriages of Convenience,3. Best Friendships,4. Kinships, 5. Rebounds/ Avoidance-Driven,6. Courtships,7.Dependencies 8. Enmities, 9. Love-Hate (Sweeney and Chew)

Africa,Argentina,Australia,Australia/Hong Kong, Austria, California, Canada, China, Egypt, England, Finland, France Germany, Guernsey, Holland, India, Indonesia, Ireland , Israel, Italy , Japan, Kuwait, Malaysia, Nepal,Paraguay , Philippines, Phillipines, Portugual, Saudi Arabia, Singapore South Africa, Spain, Sweden, Taiwan, Thailand,UK ,USA

A&F,Beijing ,Gucci,LVMH,New York,Old Navy, ,Paris, Sydney, Tiffany, Tokyo, Tommy, Versace

An-Verb,An-Vis,Hol-Verb,Hol-Vis

Depriv/Enhance,Enhance/Depriv

Variables and Data Types in Big Data Set

Page 27: Datainnovation

27

Page 28: Datainnovation

Model Comparison By Variables/Predictors

Page 29: Datainnovation

6. Supermarket Shopper Behavior

Beacon

Active Card

Page 30: Datainnovation

7.Smart Sandbag System

smart-dove.com

The first 3 columns are x, y, z axis of gyroscope, then x, y, z axis of accelerator. These are raw data of 40 repetitions of shoulder press exercise. Standard Deviation and moving average algorithm to build the chart and Hidden Markov Model to extract features and build model of exercise. All models are put into cloud for trainee exercise scoring.

Page 31: Datainnovation

8. Oil reserves shipment monitoring

Ras Tanura Najmah compound, Saudi Arabia

Source: http://www.skyboximaging.com/blog/monitoring-oil-reserves-from-space

Page 32: Datainnovation

9. Carat: Collaborative Energy Diagnosis

Page 33: Datainnovation

Information Architecture

Source: http://carat.cs.berkeley.edu

Page 34: Datainnovation

Spark Streaming

GraphX

SparkSQL

MLLib

Page 35: Datainnovation

Square Kilometer Array

(SKA)

• Data collected in a single day take nearly two million years to playback on an MP3 player

• Central computer has processing power of about one hundred million PCs.

• SKA will use enough optical fiber linking up all the radio telescopes to wrap twice around the Earth.

• Dishes of SKA when fully operational will produce 10 times the global internet traffic as of 2013.

• Aperture arrays in the SKA could produce more than 100 times the global internet traffic as of 2013.

• The SKA will generate enough raw data to fill 15 million 64 GB MP3 players every day.

• The SKA supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky Way galaxies - in order to process all the data that the SKA will produce.

• So sensitive that it will be able to detect an airport radar on a planet 50 light years away.

• Thousands of antennas with collecting area of about one square kilometer (that's 1,000,000 square meters).

• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations or several years. SKA ETA 5 minutes !

• In first six hours of operation, SKA will generate more information than all previous radio telescopes

• in the world combined.

To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which, according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came into existence. As a scientist, this is a once in a lifetime opportunity.”

Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska

Centaurus A

Page 36: Datainnovation

Caution!

“Children never put off till tomorrow what will keep them from going to bed tonight”

ADVERTISING AGE