April 2014 building data science keynote at Boston Data Science Meetup - Crowdsourcing Data Science
Stories in Data Science: It starts with a question · KNOWING THE KNOWABLE THROUGH DATA SCIENCE....
Transcript of Stories in Data Science: It starts with a question · KNOWING THE KNOWABLE THROUGH DATA SCIENCE....
Principal Data Scientist
Booz Allen Hamilton
http://www.boozallen.com/datascience
Kirk Borne@KirkDBorne
Stories in Data Science: It starts with a question
1 / 59NOVA Data Science Meetup (12/7/2017)
How would you tag this image for easy search and discovery?
http://www.timesunion.com/news/slideshow/The-Northeast-Blackout-of-2003-47515.php 3
Galaxy Zoo helped scientists
by asking the world a question: ”Is it a Spiral Galaxy or Elliptical Galaxy?”
Galaxy Zoo project = Crowdsourcing massive data characterization (pattern recognition and detection), Social Engagement, Citizen Science!
– Over 1 million participants (and growing)
– Over 1 million galaxies have been labeled (classified)
– Over 300 million classifications have been collected4
?
True color picture of Hanny’s Voorwerp: Hanny’s Object – the green blob is probably a light echo from an
old Quasar that burned out 100,000 years ago
6
8https://twitter.com/dez_blanchfield/status/645139875440668672
THE MOST IMPORTANT “V” OF BIG DATA = VALUE!
THE 5 6 MOST IMPORTANT THINGS IN DATA SCIENCEhttps://www.this isme tis .co m/ de mystify ing-d ata-sc ience-re co rd ings
★★★
3
1. The Data
2. The Science
3. Data Storytelling
4. Data Ethics
5. Data Literacy
6. The Data Scientist
3
X
9
KNOWING THE KNOWABLE THROUGH DATA SCIENCEDon’t just explain to us how you used Machine Learning
(= algorithms that learn from experience), but tell us what you discovered, why you did it, and what it now means!
10Booz Allen Ham ilton
1) Class Discovery: Find the categories of objects (population segments), events, and behaviors in your data. + Learn the rules that constrain the class boundaries (that uniquely distinguish them).
2) Correlation (Predictive and Prescriptive Power) Discovery: Find trends, patterns, and
dependencies in data, which reveal new governing principles or behavioral patterns (the entity’s “DNA”).
3) Novelty (Surprise!) Discovery: Find new, rare,
one-in-a-[million / billion / trillion] objects, events, and behaviors.
4) Association (or Link) Discovery: (Graph and Network Analytics) – Find the unusual (interesting) co-occurring associations / links / connections.
THE 5 LEVELS OF ANALYTICS MATURITYExplain the level of analytics maturity that your Data Science is attempting to achieve.
11Booz Allen Hamilton
1) Descriptive Analytics
– Hindsight (What happened?)
2) Diagnostic Analytics
– Oversight (real-time / What is happening?
Why did it happen?)
3) Predictive Analytics
– Foresight (What will happen?)
4) Prescriptive Analytics
– Insight (How can we optimize what happens?)
(Follow the dots / connections in the graph!)
5) Cognitive Analytics– Right Sight (the 360 view , what is the right
question to ask for this set of data in this
context = Game of Jeopardy)
– Finds the right insight, the right action, the
right decision,… right now! = Next Best Action!
– Moves beyond simply providing answers, to
generating new questions and hypotheses.
As data scientists, we must not only Walk The Talk, but we must also must Talk The Walk.
From the article: “Top 3 machine learning technologies to watch in the next 3 three years”http://blogs.sas.com/content/sascom/2017/08/31/3-machine-learning-technologies-3-three-years/
1) Graph Analytics = “all the world is a graph!”
2) Geospatial Analytics = includes IoT, the Internet of “Context”
3) Natural Language Generation (NLG) and Narrative Science = automatic narrative generation that tells your data’s story
12
HOT TRENDS TO WATCH
Booz Allen Hamilton Internal
1. Journey Science (see https://www.clickfox.com/)- is an application of Graph Analytics (directed graphs) to infer predictive
*and* prescriptive models of behavior; - is applicable to many types of journeys: customer, employee, patient,
cyber actor, and also products, processes, ideas, & machines.
2. Journeys are Stories (therefore, perfect for Data Storytelling)
3. Other applications:- Discovery across disconnected document collections, through linked
semantic assertions- Causal Factor Analysis: Marketing Attribution, Safety Incident
Investigation, …- Fraud networks, money-Laundering networks, illegal goods trafficking
networks, …
https://mapr.com/blog/mapr-big-moves-marketing/
As data scientists, we must not only Walk The Talk, but we must also must Talk The Walk.
13
MY #1 FAVORITE EMERGING HOT TOPIC: JOURNEY SCIENCE
Context is King!“You can see a lot just by looking.” – Yogi Berra
• Context is “other data” about your data = i.e., Metadata!
• The 3 most important things in your data are: Metadata, Metadata,
Metadata!
• Metadata are…
– Other Data that describes Other Data
– Other Data that describes Your Data
– Your Data that describes Other Data
• Contextual data empowers both Prescriptive and Cognitive Analytics.
14
Context is King!“You can see a lot just by looking.” – Yogi Berra
– Your Data that describes Other Data
• Contextual data empowers both Prescriptive and Cognitive Analytics.
• IoT sensor data can provide a lot of contextual data (metadata!)
• The Internet used to be a thing. Now, things are the Internet.
15
The Internet of Context
https://www.geoforce.com/the-internet-of-things-iot-lives-in-the-oilfield-too/
The Internet of Things (IoT)
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
16
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
Before we start on those algorithms…… let’s start with an easy one!
0) Counting!
Remember…
17
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
18
1
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
Machine Learning = mathematical algorithms that learn from experience (i.e., learn patterns from previous data). Training the algorithms often requires back-propagation of errors.So, you need at least two wrong models in order to move toward the best!
19
Automated Wildfire Detection (and Prediction)
through Artificial Neural Networks (ANN)
• Short Description of Wildfire Project:
– Identify all wildfires in Earth-observing satellite images
– Train ANN to mimic human analysts’ classifications
– Apply ANN to new data (from 3 remote-sensing satellites: GOES, AVHRR, MODIS)
– Extend NOAA fire product from USA to the whole Earth
19
20
OLD FORMAT NEW FORMAT (as of May 16, 2003)
Lon, Lat Lon, Lat, Time, Satellite, Method of Detection
-80.531, 25.351 -80.597, 22.932, 1830, MODIS AQUA, MODIS
-81.461, 29.072 -79.648, 34.913, 1829, MODIS, ANALYSIS
-83.388, 30.360 -81.048, 33.195, 1829, MODIS, ANALYSIS
-95.004, 30.949 -83.037, 36.219, 1829, MODIS, ANALYSIS
-93.579, 30.459 -83.037, 36.219, 1829, MODIS, ANALYSIS
-108.264, 27.116 -85.767, 49.517, 1805, AVHRR NOAA-16, FIMMA
-108.195, 28.151 -84.465, 48.926, 2130, GOES-WEST, ABBA
-108.551, 28.413 -84.481, 48.888, 2230, GOES-WEST, ABBA
-108.574, 28.441 -84.521, 48.864, 2030, GOES-WEST, ABBA
-105.987, 26.549 -84.557, 48.891, 1835, MODIS AQUA, MODIS
-106.328, 26.291 -84.561, 48.881, 1655, MODIS TERRA, MODIS
-106.762, 26.152 -84.561, 48.881, 1835, MODIS AQUA, MODIS
-106.488, 26.006 -89.433, 36.827, 1700, MODIS TERRA, MODIS
-106.516, 25.828 -89.750, 36.198, 1845, GOES, ANALYSIS
Hazard Mapping System (HMS) ASCII Fire Product
20
21
GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire
2003: Day 126 , –82.10 Deg West Longitude, 30.49 Deg North Latitude
File: florida_ch2.png
21
22
Zoom of GOES CH2 (3.78 - 4.03 μm) – Northern Florida Fire
2003:Day 126, –82.10 Deg W Long, 30.49 Deg N Lat
Local minimum in vicinity of core pixel used as fire location.
File: florida_fire_ch2_zoom.png File: florida_ch2_zoom.png
22
23
Neural Network Configuration
for Wildfire Detection Neural Network
Connections
(weights)
Connections
(weights)
Input
Layer 0Hidden
Layer 1
Output
Layer 2
Output
Classification
Band A
Inputs:1 - 49
Band B
Inputs: 50 - 98
Band C
Inputs: 99 - 147(fire / no-fire)
23
Error Matrix:
Class-A Class-B Totals
Class-A
Class-B
Totals
TRAINING DATA (actual classes)
3007
318(FN)
34213103(TN)
32763152 6428
173(FP)
2834(TP)
True Positive False Positive
False Negative True Negative
Classification Accuracy
24
25
Typical Measures of Accuracy
• Overall Accuracy = (TP+TN)/(TP+TN+FP+FN)
• Producer’s Accuracy (fire) = TP/(TP+FN)
• Producer’s Accuracy (nonfire) = TN/(FP+TN)
• User’s Accuracy (fire) = TP/(TP+FP)
• User’s Accuracy (nonfire) = TN/(TN+FN)
Accuracy of our Classification on previous slide• Overall Accuracy = 92.4%
• Producer’s Accuracy (fire) = 89.9%
• Producer’s Accuracy (nonfire) = 94.7%
• User’s Accuracy (fire) = 94.2%
• User’s Accuracy (nonfire) = 90.7%
25
Ask me to
compare thiswith the TSA
example!! ☺
Schematic Approach to Avoiding Overfitting
Error
Training Epoch
Validation Set error
Training
Set error
To avoid overfitting, you
need to know when to stop
training the model.
Although the Training Set
error may continue to
decrease, you may simply be
overfitting the Training Data.
Test this by applying the
model to Validation Data Set
(not part of Training Set).
If the Validation Data Set
error starts to increase,
then you know that you are
overfitting the Training Set
and it is time to stop!
STOP Training HERE !
26
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
27
2
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
28
PCA vs ICA
Initial impression is that the data are extended in only one direction (one principal component)
29
Initial impression is that the data are extended in only one direction (one principal component)
But, there are
2 independent correlations here
… hence there are
2 signal sources!
(an example of
Class Discovery
using ICA= Independent
Component Analysis)
PCA vs ICA
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Association Mining (Link Analysis)
5) Clustering and Validation
6) Bayesian Belief Networks
30
3
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
(Graphic by Cray, for Cray Graph Engine CGE)http://www.cray.com/products/analytics/cray-graph-engine
“All the World is a Graph” – Shakespeare?
The natural data structure of the world is not rows and columns, but a Graph!
The Human Connectome Project:mapping and linking the major pathways in the brain.http://www.humanconnectomeproject.org
31
“All the World is a Graph” – Shakespeare?
• Smart Data = annotating our data with its context and meaning…
• … the Semantics! This is based on Ontologies.
• My students memorized the definition of an Ontology…
–“is_a formal, explicit specification of a shared conceptualization.”from Tom Gruber (Stanford)
• Semantic “facts” can be expressed in a database as RDF triples:
{subject, predicate, object} = {noun, verb, noun}
32
Simple Example of the Power of Graph:
Semi-Metric Space
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
{1} {3}{2}
33
• Entity {1} is linked to Entity {2} (small distance A)
• Entity {2} is linked to Entity {3} (small distance B)
• Entity {1} is *not* linked directly to Entity {3} (Similarity Distance C = infinite)
• Similarity Distances between A, B, and C violate the triangle inequality!
• The connection between black hat entities {1} and {3} never appears explicitly in
a link network, or within a transactional database.
• Examples: (a) Medical Research Discoveries across disconnected journals,
through linked semantic assertions; (b) Customer Journey modeling; (c) Safety
Incident Causal Factor Analysis; (d) Marketing Attribution Analysis; (e) Fraud
networks, Illegal goods trafficking networks, Money-Laundering networks.
{1} {3}{2}
Simple Example of the Power of Graph:
Semi-Metric Space
34
analytics.gmu.eduCDDA Spring 2014 Workshop
Research Example: Discovery in the
NIH-NLM Semantic MEDLINE Database
Project Description: Conduct semantic graph mining of the NIH-NLM metadata repository from ~26 million medical research
articles.
Graph Database: ~90 million RDF triples (predications; semantic assertions).
Research Project: (PhD dissertation at GMU) Novel subgraph discovery; Context-based discovery; New concept emergence in medical
research; Story discovery in linked graph network; and Hidden knowledge discovery through semi-metrics.
35https://skr3.nlm.nih.gov/SemMedDB/
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
36
4
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
Clustering = the process of partitioning a set of data into subsets
(segments or clusters) such that a data element belonging to any
chosen cluster is more similar to data elements belonging to
that cluster than to data elements belonging to other clusters.
= Grouping together similar items, and separating dissimilar
items
= Identifying similar characteristics, patterns, or behaviors
among subsets of the data elements.
Challenge #1) No prior knowledge of the number of clusters.
#2) No prior knowledge of semantic meaning of the clusters.
#3) Different clusters are possible from the same data set!
#4) Selecting different features can lead to different clusters. 37
How do you know if your clusters are good enough?
38
The number of clusters is not known
There might not exist a “correct” number of clusters
Results depend on which attributes are selected
Results depend on the choice of distance/similarity metric
Therefore, there is no “correct” set of clusters.
So, how do you know what is a good set of clusters?
38
How do you know if your clusters are good enough?
Reference: http://www.biomedcentral.com/content/supplementary/1471-2105-9-90-S2.pdf
You know the clusters are good … … if the clusters are compact relative to their separation
… if the clusters are well separated from one another
… the “within cluster” errors are small (low variance within)
… if the number of clusters is small relative to the number of data points
Various measures of cluster compactness exist, including the Dunn index , C-index, and the DBI (Davies-Bouldin Index)
39
Application of Davies-Bouldin Index
Assume k (number of clusters) and assume other things (choice of clustering algorithm; the choice of
clustering feature attributes; etc.)
Measure DBI
Test another set of values for the cluster input parameters (k, feature attributes, etc.)
Measure DBI
… continue iterating like this until you find the set of
cluster input parameters that yields the best (minimum) value for DBI.
40
Scientific Discovery from
Cluster Analysis of data
parameters from events on
the Sun and around the Earth
Cluster Analysis:Find the clusters, then Evaluate them
D- B
Ind
ex
Delay (hr) of Dst from Vsw and Bz
DBI for Dst_Vsw_Bz
0.8
0.85
0.9
0.95
1
1.05
1.1
1.15
1.2
0 1 2 3 4 5 6 7 8 9 10 11 12
Time Shift
DB
I
2C DBI
3C DBI
4C DBI
Average
Figure 10. Davies-Bouldin index for various time delays of Dst from Vsw and Bz for cases of 2 (blue), 3 (red), 4 (yellow) clusters, and the overall average (purple), indicating an optimal delay of
~2-3 hours for Dst.
Good Clusters =
Small Size relative to
Cluster Separation.
DISCOVERY! ...
Solar wind events
have the strongest
association (i.e., the
tightest clusters) with
the space plasma
events within the
Earth’s magnetosphere
about 2-4 hours after
a major plasma outburst
occurs on the Sun.
42
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
43
5
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
Classic Textbook Example of Data Mining (Legend?): Data
mining of grocery store logs indicated that men who buy
diapers also tend to buy beer at the same time.
Example #1
45
Amazon.com mines its customers’ purchase logs to
recommend books to you: “People who bought this book also
bought this other one.”
Example #2
46
Netflix mines its video rental history database to recommend
rentals to you based upon other customers who rented similar
movies as you.
Example #3
47
Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many of {one particular product}
compared to everything else.
Example #4
48
Wal-Mart studied product sales in their Florida stores in 2004
when several hurricanes passed through Florida.
Wal-Mart found that, before the hurricanes arrived, people
purchased 7 times as many strawberry pop tarts compared
to everything else.
Example #4
49
Strawberry pop tarts???
http://www.nytimes.com/2004/11/14/business/yourmoney/14wal.htmlhttp://www.hurricaneville.com/pop_tarts.html
http://bit.ly/1gHZddA50
Association Rule Mining forHurricane Intensification Prediction
• Research by GMU geoscientists
• Predict the final strength of hurricane at landfall.
• Find co-occurrence of final hurricane strength with specific values of measured physical properties of the hurricane while it is still over the ocean.
• Result: the association mining model predication is better than National Hurricane Center prediction!
• Research Paper by GMU scientists: https://ams.confex.com/ams/pdfpapers/84949.pdf
51
1) Neural Networks
2) Principal Component Analysis (PCA)
3) Graph Mining (Network Analysis)
4) Clustering and Validation
5) Association Mining (Link Analysis)
6) Bayesian Belief Networks
52
6
Data Stories – Some Atypical Applications of typical Machine Learning Algorithms
Bayes Theorem
• Bayes Theorem…
• Naïve Bayes assumption53
http://www.datasciencecentral.com/profiles/blogs/6-easy-steps-to-learn-naive-bayes-algorithm-with-code-in-python
Bayes Theorem
• Bayes Theorem… now with Legos
54
https://www.countbayesie.com/blog/2015/2/18/bayes-theorem-with-lego
Bayes Theorem• … for missing value imputation
• Bad idea: inserting estimated values of missing data elements = Data Creation!
• Better idea: predicting a value that is not knowable in advance = Predictive Analytics!
55
Bayes Belief Networks• … for missing value imputation … Example:
• Use all conditional probabilities across all database attributes to predict the missing value.
56
Bayes Belief Networks for Cosmology:PhD dissertation by Dr. Pragyan Nayak
• …for missing value imputation: Galaxy Redshifts
• The problem: less than 0.1% of catalogued galaxies have a measured redshift = Distance!
• Bigger sky surveys are coming in the next 10 yrs
• Less than 0.001% of galaxies will have distance estimate!!
• Traditional method: use colors of galaxies (red-shift…)
• BBN method: use all properties of galaxies (shape, size, color, texture, concentration,…)
• Result: probability distribution of redshift for each galaxy!
• Consequence: map the galaxy mass density of Universe!57
★★★★★★
59Booz | Allen | Hamilton
@KirkDBorne
@BoozDataScience
LISTEN
READ, BUILD, and EXPLOREwww.boozallen.com/datascience
Tips for Building a Data Science Capability The Mathematical Corporation 10 Signs of Data Science Maturity
The Field Guide to Data Science The Data and Analytics Catalyst
Explore: sailfish.boozallen.com
Booz | Al len | Hamilton
PARTICIPATEdatasciencebowl.com
…Learn how AI and Machine Intelligence empower The Mathematical Corporation
in MachineIntelligence
THANK YOU!Check out some of these resources…