Cognitive computing with big data, high tech and low tech approaches

49
© 2014 MapR Technologies 1 © 2014 MapR Technologies Cognitive Computing on Hadoop Low Tech and High Tech Approaches Ted Dunning

description

I explain some very approachable methods for analyzing big data via a detour through clipper ships and the 19th century open source scene. Note that I mixed up the route of the Flying Cloud record in this talk. The Flying Cloud's record was actually from New York to San Francisco and was even more impressive than what I said. The usual time had been about 180 days. With Maury's charts, the time was reduced to about 135 days. The Flying Cloud's time was 89 days. Thanks to Chen Kung for noticing my error.

Transcript of Cognitive computing with big data, high tech and low tech approaches

Page 1: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 1© 2014 MapR Technologies

Cognitive Computing on Hadoop

Low Tech and High Tech Approaches

Ted Dunning

Page 2: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 2© 2014 MapR Technologies

Cognitive Computing on Hadoop with Data

Low Tech and High Tech Approaches

Ted Dunning

Page 3: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 3

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email [email protected] [email protected]

Twitter @Ted_Dunning

Apache Mahout https://mahout.apache.org/

Twitter @ApacheMahout

Page 4: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 4

The outline

• The first open source, big data project

• Another big data project

• Conclusions

Page 5: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 5

First:

An apology for going off-script

Page 6: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 6

Now, the story

Page 7: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 7

Page 8: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 8

In 1866, the top finishers in the tea race reached London in 99 days, within 2 hours of each other

Page 9: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 9

Page 10: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 10

But in 1851, the record had been set at 89 days by the Flying Cloud

Page 11: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 11

The difference was due (in part) to big data

Page 12: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 12

Page 13: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 13

Page 14: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 14

Page 15: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 15

These charts were free …

If you donated your data

Page 16: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 16

But how does this apply today?

Page 17: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 17

Key Points of Maury’s Work

• Give to get– Give the Abstract Log to captains, get data

• Data consortium wins– Merging data gives pictures nobody else can see

• Give back– Them that gives, also gets

• But this is just what every data driven web site does!– Just 150 years before everybody else

Page 18: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 18

The Real News in Behavioral Analysis

• Everybody knows that:

• You need ensembles of many models to do recommendations

• You need to use factorization models

• You predict what you observe

• (You should predict ratings)

Page 19: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 19

But …

none of this is really true

Page 20: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 20

In fact,

• Fancy models are rarely useful expenditures of time• Factorization can be good, but not much better (if at all)• Ratings are disastrously bad data• Cross-recommendation and multi-modal recommendations are

much more interesting– Multiple kinds of input are far better than multiple models

• The UI has a far larger impact than the models• The best algorithms combine simplicity with accuracy

– So simple you can embed them in a search engine

Page 21: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 21

Here’s how

Page 22: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 22

Cooccurrence Analysis

Page 23: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 23

How Often Do Items Co-occur

Page 24: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 24

Which Co-occurrences are Interesting?

Each row of indicators becomes a field in a search engine document

Page 25: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 25

Recommendations

Alice got an apple and a puppyAlice

Page 26: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 26

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Page 27: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 27

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob Bob got an apple

Page 28: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 28

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob What else would Bob like?

Page 29: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 29

Recommendations

Alice got an apple and a puppyAlice

Charles got a bicycleCharles

Bob A puppy!

Page 30: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 30

You get the idea of how recommenders can work…

Page 31: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 31

By the way, like me, Bob also wants a pony…

Page 32: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 32

Recommendations

?

Alice

Bob

Charles

Amelia

What if everybody gets a pony?

What else would you recommend for new user Amelia?

Page 33: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 33

Recommendations

?

Alice

Bob

Charles

Amelia

If everybody gets a pony, it’s not a very good indicator of what to else predict...

Page 34: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 34

Problems with Raw Co-occurrence

• Very popular items co-occur with everything or why it’s not very helpful to know that everybody wants a pony… – Examples: Welcome document; Elevator music

• Very widespread occurrence is not interesting to generate indicators for recommendation– Unless you want to offer an item that is constantly desired, such as

razor blades (or ponies)• What we want is anomalous co-occurrence

– This is the source of interesting indicators of preference on which to base recommendation

Page 35: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 35

Overview: Get Useful Indicators from Behaviors

1. Use log files to build history matrix of users x items– Remember: this history of interactions will be sparse compared to all

potential combinations

2. Transform to a co-occurrence matrix of items x items

3. Look for useful indicators by identifying anomalous co-occurrences to make an indicator matrix– Log Likelihood Ratio (LLR) can be helpful to judge which co-

occurrences can with confidence be used as indicators of preference– ItemSimilarityJob in Apache Mahout uses LLR

Page 36: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 36

Which one is the anomalous co-occurrence?

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

A not A

B 1 0

not B 0 2

Page 37: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 37

Which one is the anomalous co-occurrence?

A not A

B 13 1000

not B 1000 100,000

A not A

B 1 0

not B 0 10,000

A not A

B 10 0

not B 0 100,000

A not A

B 1 0

not B 0 20.90 1.95

4.52 14.3

Page 38: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 38

Collection of Documents: Insert Meta-Data

Search Technology

Item meta-data

Document for “puppy” id: t4

title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

Ingest easily via NFS

Page 39: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 39

From Indicator Matrix to New Indicator Field

id: t4title: puppydesc: The sweetest little puppy ever.keywords: puppy, dog, pet

indicators: (t1)

Solr document for “puppy”

Note: data for the indicator field is added directly to meta-data for a document in Apache Solr or Elastic Search index. You don’t need to create a separate index for the indicators.

Page 40: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 40

Going Further: Multi-Modal Recommendation

Page 41: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 41

Going Further: Multi-Modal Recommendation

Page 42: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 42

For example

• Users enter queries (A)– (actor = user, item=query)

• Users view videos (B)– (actor = user, item=video)

• ATA gives query recommendation– “did you mean to ask for”

• BTB gives video recommendation– “you might like these videos”

Page 43: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 43

The punch-line

• BTA recommends videos in response to a query– (isn’t that a search engine?)– (not quite, it doesn’t look at content or meta-data)

Page 44: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 44

Real-life example

• Query: “Paco de Lucia”• Conventional meta-data search results:

– “hombres de paco” times 400– not much else

• Recommendation based search:– Flamenco guitar and dancers– Spanish and classical guitar– Van Halen doing a classical/flamenco riff

Page 45: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 45

Real-life example

Page 46: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 46

Hypothetical Example

• Want a navigational ontology?• Just put labels on a web page with traffic

– This gives A = users x label clicks

• Remember viewing history– This gives B = users x items

• Cross recommend– B’A = label to item mapping

• After several users click, results are whatever users think they should be

Page 47: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 47

More Details Available

available for free at

http://www.mapr.com/practical-machine-learning

Page 48: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 48

Who I am

Ted Dunning, Chief Applications Architect, MapR Technologies

Email [email protected] [email protected]

Twitter @Ted_Dunning

Apache Mahout https://mahout.apache.org/

Twitter @ApacheMahout

Page 49: Cognitive computing with big data, high tech and low tech approaches

© 2014 MapR Technologies 49

Q & A

@mapr maprtech

[email protected]

Engage with us!

MapR

maprtech

mapr-technologies