Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data
-
Upload
claire-willett -
Category
Technology
-
view
3.110 -
download
1
description
Transcript of Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data
![Page 1: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/1.jpg)
BETTER ALGORITHMSFROM BIGGER DATAChris Bingham, CTO, Crimson Hexagon
April 26th, 2012
![Page 2: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/2.jpg)
INTRODUCTIONCrimson Hexagon and me
![Page 3: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/3.jpg)
ABOUT CRIMSON HEXAGON
•Founded 4 years ago; now 40+ employees in Boston
•Help companies make actionable business decisions
•Based on unique analysis of social media and internal
data
•Customers include F100, agencies, UN
•Tech stack:• Java, with R for algorithms• Massive Lucene infrastructure with custom shard management• Distributed computing framework for analysis• Hadoop increasingly used
![Page 4: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/4.jpg)
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•World’s largest searchable social media archive
•>200 billion posts in 2012
•Adding 1 billion every 2-3 days
•Twitter, Facebook, blogs, forums, comments, news,
etc.
![Page 5: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/5.jpg)
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•Who’s talking and listening?• Demographics• Interests• Relationships
•Trends and comparisons• Compared to yourself, over time• Compared to industry, competitors, etc.
•Human input• Define specific business question and possible answers• Provides focus and context
![Page 6: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/6.jpg)
BIG DATA, BETTER DATA, BETTER ALGORITHMS
•Based on work by co-founder Gary King at Harvard
•Takes all those billions of posts, plus the human input
•Leverages the human judgment to massive scale
•Quantitative answers to specific business questions
•Accurate in any language
![Page 7: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/7.jpg)
ALGORITHMS AND BIG DATAThe problem of leverage
![Page 8: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/8.jpg)
MACHINE LEARNING
Let’s consider a typical data-analysis problem
using machine learning.
How does having more data help (or hurt) us?
![Page 9: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/9.jpg)
DEFINE CATEGORIES
A
B
C
D
Some set of user-defined
categories (AKA topics, classes,
etc.)
![Page 10: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/10.jpg)
PROVIDE TRAINING
A
B
C
D
Training examples to
map features to categories
![Page 11: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/11.jpg)
LEARN A MODEL
A
B
C
D
Algorithm classifies items into categories
based on training data
![Page 12: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/12.jpg)
CLASSIFY ITEMS
A
B
C
DIncoming unknown
items to be classified
w x y z
![Page 13: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/13.jpg)
OBTAIN RESULTS
A
B
C
D
Result: Items are classified, hopefully
correctly!
w
x
y
z
![Page 14: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/14.jpg)
DID IT WORK?
A
B
C
D
Compare algorithm to human(s) to
measure accuracy—here “z” was
incorrectly classified
w
x
y
z
A
B
C
D
w
x
y
z
![Page 15: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/15.jpg)
ERROR RATE
We were wrong 25% of the time.
What happens when we add more data?
75% correct
25% wrong
![Page 16: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/16.jpg)
SCALE TO BIG DATA
We just make the same mistakes
on a larger scale.
75% correct
25% wrong
75% correct
25% wrong
![Page 17: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/17.jpg)
CAN MORE DATA HELP?
Can bigger data help us? In some ways.
• It can enable more types of analysis
• It can enable analysis of more categories
• It can provide more raw material for training and validation
What about accuracy?
A
B
C
D
E
F
![Page 18: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/18.jpg)
HUMAN SCALE
A
B
C
D
More training usually improves accuracy—but we need not just more
data, but more humans.
Humans don’t scale.
![Page 19: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/19.jpg)
w
x z
FEEDBACK
A
B
C
D
For some applications, users can implicitly provide feedback through their use.
e.g. ad placement; spam detection
But this isn’t possible in all cases—and you can’t be too wrong to begin
with
y
![Page 20: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/20.jpg)
BOOTSTRAPPING
A
B
C
D
We can also feed the classified items back
into the training set (no human intervention).
Some incorrect classifications will
become part of the training! But that
doesn’t necessarily hurt.
w
x
y
z
![Page 21: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/21.jpg)
BOOTSTRAPPING RESULT
A
B
C
D
The more data you have, the more you can
classify.
The more you classify, the more training data
you obtain.
The more training data, the more accurate the
results.
And we didn’t have to scale the human
involvement.
w
x
y
z
y sr
wtw
xu
xx
xv
![Page 22: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/22.jpg)
INDIVIDUAL VS. AGGREGATE
w x y z
So far we’ve considered classification of individual items. This is the conventional machine-
learning approach. A
B
C
D
w
x
y
z
![Page 23: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/23.jpg)
C
25% A
25% B
50% C
0% D
INDIVIDUAL VS. AGGREGATE
w x y z
What if we want to know the size of each category, rather than
which items are in which category?
e.g. epidemiology, polls, market research
A
B
D
![Page 24: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/24.jpg)
w =
=
INDIVIDUAL VS. AGGREGATE
x
y
z
When considered individually, there’s a limited amount of information we have about each item.
As a result, there will be limited correlation with the training data, and therefore poor accuracy.
=
=
A? C?
B? D?
75% correct
25% wrong
![Page 25: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/25.jpg)
W+X+Y+Z =
INDIVIDUAL VS. AGGREGATE
When considered in the aggregate, there’s much more data correlating with the training
data for each category.
As a result, we can make more accurate estimates of the category proportions.
%A
%C
%B
%D
85% correct
15% wrong
![Page 26: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/26.jpg)
S+T+U+V+W+X+Y+Z =
INDIVIDUAL VS. AGGREGATE
Now, increasing the amount of data can actually increase the accuracy—
with the same amount of human training data.
%A
%C
%B
%D
95% correct
5% wrong
![Page 27: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/27.jpg)
CONCLUSION
•Bigger data is important
•Better data is important
•Better algorithms are important
•The sweet spot is when one leverages the other
Bigger data can lead to better algorithms.
![Page 28: Christopher Bingham, Crimson Hexagon: Better Algorithms from Bigger Data](https://reader033.fdocuments.us/reader033/viewer/2022060119/55906fb21a28ab642f8b4603/html5/thumbnails/28.jpg)
QUESTIONS?