Stories in Data Science: It starts with a question · KNOWING THE KNOWABLE THROUGH DATA SCIENCE....

59
Principal Data Scientist Booz Allen Hamilton http://www.boozallen.com/datascience Kirk Borne @KirkDBorne Stories in Data Science: It starts with a question 1 / 59 NOVA Data Science Meetup (12/7/2017)

Transcript of Stories in Data Science: It starts with a question · KNOWING THE KNOWABLE THROUGH DATA SCIENCE....

Principal Data Scientist

Booz Allen Hamilton

http://www.boozallen.com/datascience

Kirk Borne@KirkDBorne

Stories in Data Science: It starts with a question

1 / 59NOVA Data Science Meetup (12/7/2017)

Hanny’s Voorwerp –

It started with a question: “Anyone?”

2

How would you tag this image for easy search and discovery?

http://www.timesunion.com/news/slideshow/The-Northeast-Blackout-of-2003-47515.php 3

Galaxy Zoo helped scientists

by asking the world a question: ”Is it a Spiral Galaxy or Elliptical Galaxy?”

Galaxy Zoo project = Crowdsourcing massive data characterization (pattern recognition and detection), Social Engagement, Citizen Science!

– Over 1 million participants (and growing)

– Over 1 million galaxies have been labeled (classified)

– Over 300 million classifications have been collected4

?

Hanny’s Voorwerp: “Anyone?”What did Hanny do, how did she do it, and what was her discovery?

5

True color picture of Hanny’s Voorwerp: Hanny’s Object – the green blob is probably a light echo from an

old Quasar that burned out 100,000 years ago

6

True color picture of Hanny van Arkel and KB!

7

8https://twitter.com/dez_blanchfield/status/645139875440668672

THE MOST IMPORTANT “V” OF BIG DATA = VALUE!

THE 5 6 MOST IMPORTANT THINGS IN DATA SCIENCEhttps://www.this isme tis .co m/ de mystify ing-d ata-sc ience-re co rd ings

★★★

3

1. The Data

2. The Science

3. Data Storytelling

4. Data Ethics

5. Data Literacy

6. The Data Scientist

3

X

9

KNOWING THE KNOWABLE THROUGH DATA SCIENCEDon’t just explain to us how you used Machine Learning

(= algorithms that learn from experience), but tell us what you discovered, why you did it, and what it now means!

10Booz Allen Ham ilton

1) Class Discovery: Find the categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).

2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and

dependencies in data, which reveal new governing principles or behavioral patterns (the entity’s “DNA”).

3) Novelty (Surprise!) Discovery: Find new, rare,

one-in-a-[million / billion / trillion] objects, events, and behaviors.

4) Association (or Link) Discovery: (Graph and Network Analytics) – Find the unusual (interesting) co-occurring associations / links / connections.

THE 5 LEVELS OF ANALYTICS MATURITYExplain the level of analytics maturity that your Data Science is attempting to achieve.

11Booz Allen Hamilton

1) Descriptive Analytics

– Hindsight (What happened?)

2) Diagnostic Analytics

– Oversight (real-time / What is happening?

Why did it happen?)

3) Predictive Analytics

– Foresight (What will happen?)

4) Prescriptive Analytics

– Insight (How can we optimize what happens?)

(Follow the dots / connections in the graph!)

5) Cognitive Analytics– Right Sight (the 360 view , what is the right

question to ask for this set of data in this

context = Game of Jeopardy)

– Finds the right insight, the right action, the

right decision,… right now! = Next Best Action!

– Moves beyond simply providing answers, to

generating new questions and hypotheses.

As data scientists, we must not only Walk The Talk, but we must also must Talk The Walk.

From the article: “Top 3 machine learning technologies to watch in the next 3 three years”http://blogs.sas.com/content/sascom/2017/08/31/3-machine-learning-technologies-3-three-years/

1) Graph Analytics = “all the world is a graph!”

2) Geospatial Analytics = includes IoT, the Internet of “Context”

3) Natural Language Generation (NLG) and Narrative Science = automatic narrative generation that tells your data’s story

12

HOT TRENDS TO WATCH

Booz Allen Hamilton Internal

1. Journey Science (see https://www.clickfox.com/)- is an application of Graph Analytics (directed graphs) to infer predictive

*and* prescriptive models of behavior; - is applicable to many types of journeys: customer, employee, patient,

cyber actor, and also products, processes, ideas, & machines.

2. Journeys are Stories (therefore, perfect for Data Storytelling)

3. Other applications:- Discovery across disconnected document collections, through linked

semantic assertions- Causal Factor Analysis: Marketing Attribution, Safety Incident

Investigation, …- Fraud networks, money-Laundering networks, illegal goods trafficking

networks, …

https://mapr.com/blog/mapr-big-moves-marketing/

As data scientists, we must not only Walk The Talk, but we must also must Talk The Walk.

13

MY #1 FAVORITE EMERGING HOT TOPIC: JOURNEY SCIENCE

Context is King!“You can see a lot just by looking.” – Yogi Berra

• Context is “other data” about your data = i.e., Metadata!

• The 3 most important things in your data are: Metadata, Metadata,

Metadata!

• Metadata are…

– Other Data that describes Other Data

– Other Data that describes Your Data

– Your Data that describes Other Data

• Contextual data empowers both Prescriptive and Cognitive Analytics.

14

Context is King!“You can see a lot just by looking.” – Yogi Berra

– Your Data that describes Other Data

• Contextual data empowers both Prescriptive and Cognitive Analytics.

• IoT sensor data can provide a lot of contextual data (metadata!)

• The Internet used to be a thing. Now, things are the Internet.

15

The Internet of Context

https://www.geoforce.com/the-internet-of-things-iot-lives-in-the-oilfield-too/

The Internet of Things (IoT)

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

16

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

Before we start on those algorithms…… let’s start with an easy one!

0) Counting!

Remember…

17

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

18

1

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

Machine Learning = mathematical algorithms that learn from experience (i.e., learn patterns from previous data). Training the algorithms often requires back-propagation of errors.So, you need at least two wrong models in order to move toward the best!

19

Automated Wildfire Detection (and Prediction)

through Artificial Neural Networks (ANN)

• Short Description of Wildfire Project:

– Identify all wildfires in Earth-observing satellite images

– Train ANN to mimic human analysts’ classifications

– Apply ANN to new data (from 3 remote-sensing satellites: GOES, AVHRR, MODIS)

– Extend NOAA fire product from USA to the whole Earth

19

20

OLD FORMAT NEW FORMAT (as of May 16, 2003)

Lon, Lat Lon, Lat, Time, Satellite, Method of Detection

-80.531, 25.351 -80.597, 22.932, 1830, MODIS AQUA, MODIS

-81.461, 29.072 -79.648, 34.913, 1829, MODIS, ANALYSIS

-83.388, 30.360 -81.048, 33.195, 1829, MODIS, ANALYSIS

-95.004, 30.949 -83.037, 36.219, 1829, MODIS, ANALYSIS

-93.579, 30.459 -83.037, 36.219, 1829, MODIS, ANALYSIS

-108.264, 27.116 -85.767, 49.517, 1805, AVHRR NOAA-16, FIMMA

-108.195, 28.151 -84.465, 48.926, 2130, GOES-WEST, ABBA

-108.551, 28.413 -84.481, 48.888, 2230, GOES-WEST, ABBA

-108.574, 28.441 -84.521, 48.864, 2030, GOES-WEST, ABBA

-105.987, 26.549 -84.557, 48.891, 1835, MODIS AQUA, MODIS

-106.328, 26.291 -84.561, 48.881, 1655, MODIS TERRA, MODIS

-106.762, 26.152 -84.561, 48.881, 1835, MODIS AQUA, MODIS

-106.488, 26.006 -89.433, 36.827, 1700, MODIS TERRA, MODIS

-106.516, 25.828 -89.750, 36.198, 1845, GOES, ANALYSIS

Hazard Mapping System (HMS) ASCII Fire Product

20

21

GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire

2003: Day 126 , –82.10 Deg West Longitude, 30.49 Deg North Latitude

File: florida_ch2.png

21

22

Zoom of GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire

2003:Day 126, –82.10 Deg W Long, 30.49 Deg N Lat

Local minimum in vicinity of core pixel used as fire location.

File: florida_fire_ch2_zoom.png File: florida_ch2_zoom.png

22

23

Neural Network Configuration

for Wildfire Detection Neural Network

Connections

(weights)

Connections

(weights)

Input

Layer 0Hidden

Layer 1

Output

Layer 2

Output

Classification

Band A

Inputs:1 - 49

Band B

Inputs: 50 - 98

Band C

Inputs: 99 - 147(fire / no-fire)

23

Error Matrix:

Class-A Class-B Totals

Class-A

Class-B

Totals

TRAINING DATA (actual classes)

3007

318(FN)

34213103(TN)

32763152 6428

173(FP)

2834(TP)

True Positive False Positive

False Negative True Negative

Classification Accuracy

24

25

Typical Measures of Accuracy

• Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)

• Producer’s Accuracy (fire) = TP/(TP+FN)

• Producer’s Accuracy (nonfire) = TN/(FP+TN)

• User’s Accuracy (fire) = TP/(TP+FP)

• User’s Accuracy (nonfire) = TN/(TN+FN)

Accuracy of our Classification on previous slide• Overall Accuracy = 92.4%

• Producer’s Accuracy (fire) = 89.9%

• Producer’s Accuracy (nonfire) = 94.7%

• User’s Accuracy (fire) = 94.2%

• User’s Accuracy (nonfire) = 90.7%

25

Ask me to

compare thiswith the TSA

example!! ☺

Schematic Approach to Avoiding Overfitting

Error

Training Epoch

Validation Set error

Training

Set error

To avoid overfitting, you

need to know when to stop

training the model.

Although the Training Set

error may continue to

decrease, you may simply be

overfitting the Training Data.

Test this by applying the

model to Validation Data Set

(not part of Training Set).

If the Validation Data Set

error starts to increase,

then you know that you are

overfitting the Training Set

and it is time to stop!

STOP Training HERE !

26

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

27

2

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

28

PCA vs ICA

Initial impression is that the data are extended in only one direction (one principal component)

29

Initial impression is that the data are extended in only one direction (one principal component)

But, there are

2 independent correlations here

… hence there are

2 signal sources!

(an example of

Class Discovery

using ICA= Independent

Component Analysis)

PCA vs ICA

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Association Mining (Link Analysis)

5) Clustering and Validation

6) Bayesian Belief Networks

30

3

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

(Graphic by Cray, for Cray Graph Engine CGE)http://www.cray.com/products/analytics/cray-graph-engine

“All the World is a Graph” – Shakespeare?

The natural data structure of the world is not rows and columns, but a Graph!

The Human Connectome Project:mapping and linking the major pathways in the brain.http://www.humanconnectomeproject.org

31

“All the World is a Graph” – Shakespeare?

• Smart Data = annotating our data with its context and meaning…

• … the Semantics! This is based on Ontologies.

• My students memorized the definition of an Ontology…

–“is_a formal, explicit specification of a shared conceptualization.”from Tom Gruber (Stanford)

• Semantic “facts” can be expressed in a database as RDF triples:

{subject, predicate, object} = {noun, verb, noun}

32

Simple Example of the Power of Graph:

Semi-Metric Space

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

{1} {3}{2}

33

• Entity {1} is linked to Entity {2} (small distance A)

• Entity {2} is linked to Entity {3} (small distance B)

• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)

• Similarity Distances between A, B, and C violate the triangle inequality!

• The connection between black hat entities {1} and {3} never appears explicitly in

a link network, or within a transactional database.

• Examples: (a) Medical Research Discoveries across disconnected journals,

through linked semantic assertions; (b) Customer Journey modeling; (c) Safety

Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud

networks, Illegal goods trafficking networks, Money-Laundering networks.

{1} {3}{2}

Simple Example of the Power of Graph:

Semi-Metric Space

34

analytics.gmu.eduCDDA Spring 2014 Workshop

Research Example: Discovery in the

NIH-NLM Semantic MEDLINE Database

Project Description: Conduct semantic graph mining of the NIH-NLM metadata repository from ~26 million medical research

articles.

Graph Database: ~90 million RDF triples (predications; semantic assertions).

Research Project: (PhD dissertation at GMU) Novel subgraph discovery; Context-based discovery; New concept emergence in medical

research; Story discovery in linked graph network; and Hidden knowledge discovery through semi-metrics.

35https://skr3.nlm.nih.gov/SemMedDB/

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

36

4

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

Clustering = the process of partitioning a set of data into subsets

(segments or clusters) such that a data element belonging to any

chosen cluster is more similar to data elements belonging to

that cluster than to data elements belonging to other clusters.

= Grouping together similar items, and separating dissimilar

items

= Identifying similar characteristics, patterns, or behaviors

among subsets of the data elements.

Challenge #1) No prior knowledge of the number of clusters.

#2) No prior knowledge of semantic meaning of the clusters.

#3) Different clusters are possible from the same data set!

#4) Selecting different features can lead to different clusters. 37

How do you know if your clusters are good enough?

38

The number of clusters is not known

There might not exist a “correct” number of clusters

Results depend on which attributes are selected

Results depend on the choice of distance/similarity metric

Therefore, there is no “correct” set of clusters.

So, how do you know what is a good set of clusters?

38

How do you know if your clusters are good enough?

Reference: http://www.biomedcentral.com/content/supplementary/1471-2105-9-90-S2.pdf

You know the clusters are good … … if the clusters are compact relative to their separation

… if the clusters are well separated from one another

… the “within cluster” errors are small (low variance within)

… if the number of clusters is small relative to the number of data points

Various measures of cluster compactness exist, including the Dunn index , C-index, and the DBI (Davies-Bouldin Index)

39

Application of Davies-Bouldin Index

Assume k (number of clusters) and assume other things (choice of clustering algorithm; the choice of

clustering feature attributes; etc.)

Measure DBI

Test another set of values for the cluster input parameters (k, feature attributes, etc.)

Measure DBI

… continue iterating like this until you find the set of

cluster input parameters that yields the best (minimum) value for DBI.

40

Scientific Discovery from

Cluster Analysis of data

parameters from events on

the Sun and around the Earth

Cluster Analysis:Find the clusters, then Evaluate them

D- B

Ind

ex

Delay (hr) of Dst from Vsw and Bz

DBI for Dst_Vsw_Bz

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

0 1 2 3 4 5 6 7 8 9 10 11 12

Time Shift

DB

I

2C DBI

3C DBI

4C DBI

Average

Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of

~2-3 hours for Dst.

Good Clusters =

Small Size relative to

Cluster Separation.

DISCOVERY! ...

Solar wind events

have the strongest

association (i.e., the

tightest clusters) with

the space plasma

events within the

Earth’s magnetosphere

about 2-4 hours after

a major plasma outburst

occurs on the Sun.

42

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

43

5

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

4 Examples of Big Data Association Mining:

The goal of Rec Sys algorithms is Diversity!

44

Classic Textbook Example of Data Mining (Legend?): Data

mining of grocery store logs indicated that men who buy

diapers also tend to buy beer at the same time.

Example #1

45

Amazon.com mines its customers’ purchase logs to

recommend books to you: “People who bought this book also

bought this other one.”

Example #2

46

Netflix mines its video rental history database to recommend

rentals to you based upon other customers who rented similar

movies as you.

Example #3

47

Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many of {one particular product}

compared to everything else.

Example #4

48

Wal-Mart studied product sales in their Florida stores in 2004

when several hurricanes passed through Florida.

Wal-Mart found that, before the hurricanes arrived, people

purchased 7 times as many strawberry pop tarts compared

to everything else.

Example #4

49

Strawberry pop tarts???

http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html

http://bit.ly/1gHZddA50

Association Rule Mining forHurricane Intensification Prediction

• Research by GMU geoscientists

• Predict the final strength of hurricane at landfall.

• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.

• Result: the association mining model predication is better than National Hurricane Center prediction!

• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf

51

1) Neural Networks

2) Principal Component Analysis (PCA)

3) Graph Mining (Network Analysis)

4) Clustering and Validation

5) Association Mining (Link Analysis)

6) Bayesian Belief Networks

52

6

Data Stories – Some Atypical Applications of typical Machine Learning Algorithms

Bayes Theorem

• Bayes Theorem…

• Naïve Bayes assumption53

http://www.datasciencecentral.com/profiles/blogs/6-easy-steps-to-learn-naive-bayes-algorithm-with-code-in-python

Bayes Theorem

• Bayes Theorem… now with Legos

54

https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego

Bayes Theorem• … for missing value imputation

• Bad idea: inserting estimated values of missing data elements = Data Creation!

• Better idea: predicting a value that is not knowable in advance = Predictive Analytics!

55

Bayes Belief Networks• … for missing value imputation … Example:

• Use all conditional probabilities across all database attributes to predict the missing value.

56

Bayes Belief Networks for Cosmology:PhD dissertation by Dr. Pragyan Nayak

• …for missing value imputation: Galaxy Redshifts

• The problem: less than 0.1% of catalogued galaxies have a measured redshift = Distance!

• Bigger sky surveys are coming in the next 10 yrs

• Less than 0.001% of galaxies will have distance estimate!!

• Traditional method: use colors of galaxies (red-shift…)

• BBN method: use all properties of galaxies (shape, size, color, texture, concentration,…)

• Result: probability distribution of redshift for each galaxy!

• Consequence: map the galaxy mass density of Universe!57

★★★★★★

Finally… One more story…

58

6Data

Mining

what?