Spark

36
Unleashing Data Science Innovations: Sparking Big Data linkedin.com/in/sureshsood @soody http://www.slideshare.net/ssood/spark-47741029 6 May, 2015

Transcript of Spark

Unleashing Data Science Innovations: Sparking Big Data

linkedin.com/in/sureshsood

@soody

http://www.slideshare.net/ssood/spark-47741029

6 May, 2015

Topic Areas for Discussion

1. Statistics/Data mining or Data Science?

2. What is big data and the challenge today ?

3. Data types

4. Hadoop File Storage System and Spark

5. Data Science innovation

6. Data Science discoveries and workflow

7. New Sources of Information (Big data) Data Innovations

8. Internet of Things

9. Data Science Innovations

10. Apache Spark

Statistics, Data Mining or Data Science ?

• Statistics– precise deterministic causal analysis over precisely collected data

• Data Mining– deterministic causal analysis over re-purposed data carefully sampled

• Data Science– trending/correlation analysis over existing data using bulk of population i.e.

big data

Adapted from:

NIST Big Data taxonomy draft report (see http://bigdatawg.nist.gov /show_InputDoc.php)

Big Data Challenge Today : Moving from Transactions Alone to Relationships and Empathy

Current State= Transactions $$$

We do this stuff well e.g.Collect payments …

Future State= Human Empathy (relationships)

We don’t do this really e.g. User generated content, ratings, reviews, 1:1 dialogue, Distress Signals, Geolocation

4

5

What is Big Data ?

Unknown relationships

Unstructured data

95% of data not collected

Social-Psychological- local-Mobile-GPS-M2M

Beyond Transactions including interactions and observations

Data Types

• Astronomical

• Documents

• Earthquake

• Email

• Environmental sensors

• Fingerprints

• Health (personal) Images

• Graph data (social network)

• Location

• Marine

• Particle accelerator

• Satellite

• Scanned survey data

• Sound

• Text

• Transactions

• Video

Had

oo

p C

on

figu

rati

on

s (

Sin

gle

and

Mu

lti-

Rac

k)

Adapted from: http://stackiq.com/

Cluster manager e.g. Apache Ambari, Apache Mesos, or Rocks

3 TB drives ,18 data nodes configuration represents 648 TB of raw storage HDFS standard replication factor of 3216 TB of usable storage

Name/secondary/data nodes – 6 core 96 GBManagement node – 4 core 16 GB

Spark Explained

Full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Information Architecture

Source: http://carat.cs.berkeley.edu

AWS - Amazon Web Services

Data Science Innovation

Data science innovation is something an organization or individual has not done before using data. The innovation focuses on discovery using new or

nontraditional data sources solving new problems.

Adapted from:Franks, B. (2012) Taming the Big Data Tidal Wave, p. 255, John Wiley & Son

Data Science Discoveries

1. Outlier / Anomaly / Novelty / Surprise detection

2. Clustering (= New Class discovery, Segmentation)

3. Correlation & Association discovery

4. Classification, Diagnosis, Prediction

Source: Borne, Kirk (2012) LSST All Hands Meeting, 13-17 August

11

Data Science Workflows & Discovery

http://tacocopter.com/

New Sources of Information (Big data) : Social Media + Internet of Things Data Science Innovations

Internet of Things (IOTs)“trillion sensors”

Source: www.tsensorssummit.org

Data Science Innovations

ID Analytics Innovative Info source Innovation Platform/Library

1. Graph Analytics Multiple Reduce suspect list from

18 million to 230/32 Spark GraphX

2. ANZ Truckometer NZ transport authority real

time traffic data

GDP forecast 6 months in

advance

N/A potential for

combining with GDELT

3. Driving (Usage Based

Insurance)

Black box (telematics)

Unstructured data

Pay as you drive policy

Pay how you drive

Hadoop Map Reduce

4a. Deception (veracity) Found stories online blogs Flag fake stories text,

images and short video

MongoDB/Spark

Python dictionary

4b. Psychological State Twitter and Instagram Junk words MongoDB/Spark

Python dictionary

4c. Thematic Apperception

Technique

Mobile phone screen

customisation

Automated informant

testing

Sparkling Water

(H2O/Spark)

Deep Learning

5. Brand Brand stories “found” online Brand user profile SparkR

6. Supermarket shopper behavior CCTV /beacon transmitters “My store” product

placement based on time

of day predictive shopping

behaviour

MongoDB

Hadoop 2 Cluster

Spark GraphX

Spark MLib

7. Sandbag exercise Sandbag sensors Virtual trainer Spark GraphX

Spark MLib

8. Oil reserves shipment

monitoring

Skybox (Google) satellite

images

Improved oil forecast “Busboy” – C /Hadoop

Suresh Sood 2015

1. Graph Analytics• 1990’s Ivan Milat killed 7 backpackers making him Australia's most notorious Serial Killer

• Everyone in Australia was a suspect

• Large volumes of data from multiple sources

RTA Vehicle records Gym Memberships Gun Licensing records Internal Police records

• Police applied node link analysis techniques (NetMap) to the data

• Harness power of the human mind

• Analyst can spot indirect links, patterns , structure, relationships and anomalies

• A bottom-up approach with process of discovery to uncover structure

• Reduced the suspect list from 18 million to 230

• Further analysis with the use of additional satellite information reduced this to 32

Data Information Knowledge

The ANZ Heavy Traffic Index comprises flows of vehicles weighing more than 3.5 tonnes (primarily trucks) on 11 selected roads around NZ. It is contemporaneous with GDP growth.

The ANZ Light Traffic Index is made up of light or total traffic flows (primarily cars and vans) on 10 selected roads around the country. It gives a six month lead on GDP growth

http://www.anz.co.nz/about-us/economic-markets-research/truckometer/

2.

3. Black Box Insurance

•Big data transforms actuarial insurance from using probability methods to estimate premiums into dynamic risk management using real data generating individually tailored premiums

•Estimate 20 km work or home journey, data point acquired every min and journey captures 12 points per km. Assume 1000 km per month driving or generating 12,000 points per month resulting in 144,000 points per car/annum. Hence, 1,000 cars leads to 144 million points per annum.

•Telematics technology (black box) monitor helps assess the driving behavior and prices policy based on true driver centric premiums by capturing:

–Number of journeys

–Distances travelled

–Types of roads

–Speed

–Time of travel

–Acceleration and braking

–Any accidents

–Location ?

•Benefits low mileage, smooth and safe drivers

•Privacy vs. Saving monies on insurance (Canada ; http://bit.ly/Black_box)

Psychological analytics helps put human context into Business

• Behavior data Links human emotions to business -> Analyse footprints left behind.

• What really does customer satisfaction mean ? Is the person actually happy?

• How do we take the emotional dimension into account for customer experience?

• How do we recognize someone is dissatisfied?

• How do we recognize a “distressed” person?

• Do we use text and voice? Will sleeping patterns and eating habits help?

• would you act differently if someone is happy?

• How do you coach employees to see how someone sounds in emotional terms?

• Understanding when distress exists and when a customer needs enhanced service

• Behavior data reveals attitude and intent. This is more predictive of future opportunities and

risk versus historical data

20

4a.

1.Gayle

3. Paris

2. Paige

+

+

4.”The occasion was my cousin Paige’s 16th”

5. “I am a Canadian and get by in French.”

6. "All I can say is WOW! We rented a 2 bedroom, 1 ½ bath apartment (two showers), "Merlot" from ParisPerfect http://www.parisperfect.com/ and boy was it ever perfect! "

7. “We had a full view of the Eiffel from our charming little terrace. ....We were within walking distance to two metro stops (Pont d'Alma or Ecole Militaire) "

8. "We were walkable to many good bistros, cafes and bakeries and only a few blocks from the wonderful market street Rue Cler."

9. "I bought a Paris Pratique pocket-sized book at a Metro station. This handy guide has detailed maps of each arrondisement, as well as the metro lines, the bus lines, the RER and the SCNF (trains). I'll never be without this again."

10."Six months before our trip, I gave Paige a couple of good guide books on Paris and suggested she let me know what her interests were since after all, this was to be her trip."

11.Sites•The Marais•Notre Dame•L'Arc de Triomphe - 248 steps up and 248 steps down...•Champs Elysee•Jacquemart Museum•Louvre Lite•Musee D'Orsay•Les Invalides, Napoleon's Tomb and the Napoleon Museum•Sacre Coeur•Monmartre•Rodin Museum•Pompidou Museum•Train to Vernon, bike to Giverny with Fat Tire Bike Tours•http://www.fattirebiketoursparis.com/•Eiffel Tower

Elaboration of Trip to Paris Blog Story (Means-End & Heider)

Woodside, Sood & Miller 2008 When Consumers and Brands Talk Psychology & Marketing

12. Unforgettable Memories"This trip had so many memories, but here are a few choice highlights........On our very first night, knowing that the Eiffel Tower light show started at 10:00 p.m.... she [Paige] dropped her camera…down 6 flights…we were stunned…SpanishFamily below standing below [with pieces of the camera]”

15." Michael Osman is an American artists living in Paris.""He supplements his income by being a tour guide." I" found out about him on Fodors""So I engaged Michael for two days."

16. "On our trip to Giverny, we met a young woman from Brisbane, Australia who was traveling on her own and we invited her to join us. Three of us enjoyed delicious and innovative soufflés, while Paige had the rack of lamb. We shared two dessert soufflés, one chocolate and the other cherry/almond. Yum"

17. "I wanted Paige to get a feel for shopping experiences that she

would not have at home (aka the ubiquitous mall). "

18."We went on Fat Tire's day trip to Monet's gardens and house in Giverny, about an hour outside Paris."

13."The father stretched out his cupped hands which held all of the pieces they were able to recover, including the memory stick and he very solemnly said, "El muerto...".

14. "They had decide to come to Paris to find the Harley Davidson store so they

could buy Harley Paris t-shirts."

+

+

+

+

19....."I know Paige will treasure the memory of this girl's trip for many

years to come."

21

22

The Newman Model of Deception (Pennebaker et al)

Key word categories for deception mapping:

1. Self words e.g. “I” and “me” – decrease when someone distances themselves from content

1. Exclusive words e.g. “but” and “or” decrease with fabricated content owing to complexity of maintaining deception

1. Negative emotion words e.g. “hate” increase in word usage owing to shame or guilty feeling

1. Motion verbs e.g. “go” or “move” increase as exclusive words go down to keep the story on track

Instagram Deception (Suspects outside of -20 & +20)

Vine Deception (Suspects outside of -5 and +5)

4b. Psychological State• LIWC (analyzewords.com)

– Reveal personality from word usage

– Uses LIWC classification of words

• TweetPsych (tweetpsych.com/)

– Linguisitic analysis using:

– RID

– LIWC

Note: TweetPsych is not without critics:http://psychcentral.com/blog/archives/2009/06/18/putting-cool-ahead-of-science-tweetpsych/

4c. Thematic Apperception Technique

Social CRM integrates “breadcrumb” data

27

5. Brand User Analytics

Aquarius,Aries,Cancer,Capricorn,Gemini,Leo,Libra, Pisces, Sagittarius,Scorpio,Taurus,Virgo

Ambivalent, Employee, Opposer, Reporter, Supporter 11. Committed Partnerships, 12. Compartmentalised Friendship,13. Childhood friendship,14. Courtship,15. Fling, 16. Secret-Affair, 17. Enslavement , 2. Marriages of Convenience,3. Best Friendships,4. Kinships, 5. Rebounds/ Avoidance-Driven,6. Courtships,7.Dependencies 8. Enmities, 9. Love-Hate (Sweeney and Chew)

Africa,Argentina,Australia,Australia/Hong Kong, Austria, California, Canada, China, Egypt, England, Finland, France Germany, Guernsey, Holland, India, Indonesia, Ireland , Israel, Italy , Japan, Kuwait, Malaysia, Nepal,Paraguay , Philippines, Phillipines, Portugual, Saudi Arabia, Singapore South Africa, Spain, Sweden, Taiwan, Thailand,UK ,USA

A&F,Beijing ,Gucci,LVMH,New York,Old Navy, ,Paris, Sydney, Tiffany, Tokyo, Tommy, Versace

An-Verb,An-Vis,Hol-Verb,Hol-Vis

Depriv/Enhance,Enhance/Depriv

Variables and Data Types in Big Data Set

29

Model Comparison By Variables/Predictors

6. Supermarket Shopper Behavior

Beacon

Active Card

smart-dove.com

The first 3 columns are x, y, z axis of gyroscope, then x, y, z axis of accelerator. These are raw data of 40 repetitions of shoulder press exercise. Standard Deviation and moving average algorithm to build the chart and Hidden Markov Model to extract features and build model of exercise. All models are put into cloud for trainee exercise scoring.

7. Smart Sandbag

8. Oil reserves shipment monitoring

Ras Tanura Najmah compound, Saudi Arabia

Source: http://www.skyboximaging.com/blog/monitoring-oil-reserves-from-space

Spark Streaming

GraphX

SparkSQL

MLLib

Square Kilometer Array

(SKA)

• Data collected in a single day take nearly two million years to playback on an MP3 player

• Central computer has processing power of about one hundred million PCs.

• SKA will use enough optical fiber linking up all the radio telescopes to wrap twice around the Earth.

• Dishes of SKA when fully operational will produce 10 times the global internet traffic as of 2013.

• Aperture arrays in the SKA could produce more than 100 times the global internet traffic as of 2013.

• The SKA will generate enough raw data to fill 15 million 64 GB MP3 players every day.

• The SKA supercomputer will perform 1018 operations per second - equivalent to the number of stars in three million Milky Way galaxies - in order to process all the data that the SKA will produce.

• So sensitive that it will be able to detect an airport radar on a planet 50 light years away.

• Thousands of antennas with collecting area of about one square kilometer (that's 1,000,000 square meters).

• Previous mapping of Centaurus A galaxy took a team 12,000 hours of observations or several years. SKA ETA 5 minutes !

• In first six hours of operation, SKA will generate more information than all previous radio telescopes

• in the world combined.

• The Square Kilometer Array will link 250,000 radio telescopes together, creating most sensitive telescope.

To the scientists involved, however, the SKA is no testbed, it’s a transformative instrument which, according to Luijten, will lead to “fundamental discoveries of how life and planets and matter all came into existence. As a scientist, this is a once in a lifetime opportunity.”

Sources: http://bit.ly/amazin-facts & http://bit.ly/astro-ska

Centaurus A

Caution!

“Children never put off till tomorrow what will keep them from going to bed tonight”

ADVERTISING AGE