LDA Topic Models

48
LDA Topic Models turning words into meaning Andrius Knispelis

description

LDA Topic Models is a powerful tool for extracting meaning from text. In this video I talk about the idea behind the LDA itself, why does it work, what are the free tools and frameworks that can be used, what LDA parameters are tuneable, what do they mean in terms of your specific use case and what to look for when you evaluate it.

Transcript of LDA Topic Models

Page 1: LDA Topic Models

LDA Topic Models turning words into meaning

Andrius Knispelis

Page 2: LDA Topic Models

In 2011 I joined a danish startup issuu (the fastest growing online publishing platform) as their first Data Scientist.

Over the following 4 years I’ve worked on many interesting things there. And by far the coolest of all was the Topic Modelling.

Let me share with you:

What is LDA Topic Modelling? Why do you need one? How to build it?

Page 3: LDA Topic Models
Page 4: LDA Topic Models
Page 5: LDA Topic Models

?placement where&when

related

📄content

similar

?👥reading patterns

Page 6: LDA Topic Models

0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101

?👤 !

Page 7: LDA Topic Models

0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101

?TROPICAL FRUIT

Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN

KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.

MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.

PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.

PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.

Why not also try...• Salsa Peel and dice some

Why not also try...• Ice cream topping Simply

Why not also try...• Rice salad Cook long

kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps

knowledge about the world

?

👤

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

TROPICAL FRUIT

Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN

KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.

MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.

PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.

PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.

Why not also try...• Salsa Peel and dice some

Why not also try...• Ice cream topping Simply

Why not also try...• Rice salad Cook long

kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps

!what is it about? what is it related to? what does it feel like? what does it mean?

Page 8: LDA Topic Models

topicwordcontext

👤 !what is it about? what is it related to? what does it feel like? what does it mean?

knowledge about the world

?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

Page 9: LDA Topic Models

!

knowledge about the world

?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

topic

word

context

👤what is it about? what is it related to? what does it feel like? what does it mean?

use the right

words

set the right “window” of context

capture the widest range of

topics

Page 10: LDA Topic Models

use the right

words

capture the widest range of

topics

set the right “window” of context

!

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

knowledge about the world

?

topic

word

context

👤

🌎$millions or articles in Wikipedia

capturing the widest range of topics

Page 11: LDA Topic Models

!

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

knowledge about the world

🌎$

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

topic

word

context

👤use the

right words

capture the widest range of

topics

set the right “window” of context

millions or articles in Wikipedia capturing the widest range of topics

Page 12: LDA Topic Models

train the model?

⚙score

it on new document?

📖preprocess

the data?

'evaluate

the performance?

(

how to…

gensimtopic modeling framework

Free Python library

Page 13: LDA Topic Models

LDA hierarchical LDA

dynamic LDA DeepLearning

Word2Vec Doc2Vec

POS …

Page 14: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

🌎wikipedia

.

Page 15: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

🌎wikipedia

.

Setting the right “window” of context (a) Define minimum number of words to be present in an article.

Page 16: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Setting the right “window” of context (a) Define minimum number of words to be present in an article.

Recommended 100 - 300

🌎wikipedia

.

Page 17: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

1,167,766907,811892,147571,248128,603

8,8934,8152,3241,505

9150 500.000 1.000.000 1.500.000

Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:

Page 18: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:

1,167,766907,811892,147571,248128,603

8,8934,8152,3241,505

9150 500.000 1.000.000 1.500.000

Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:

Page 19: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

word word word word word word word word word word word word word word word word word word wordword word word word word

Page 20: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

remove if word appears in more than 10% of the articles

word word word word word word word word word word word word word word word word word word wordword word word word word

remove if the word appears in less than 20 articles

Let the right words in

Word length 1: i2: do, be, am, …3: ice, was, who, …

16: videoconferences, …17: superbillionaires, …18: intellectualization, …

Stoplists:general termslast names, first namescountries, cities

Lemmatizationam , are, is = be

Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)

Page 21: LDA Topic Models

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

keep top n words

discard the rest

Recommended 50.000 - 100.000

Let the right words in

Word length 1: i2: do, be, am, …3: ice, was, who, …

16: videoconferences, …17: superbillionaires, …18: intellectualization, …

Stoplists:general termslast names, first namescountries, cities

Lemmatizationam , are, is = be

Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)

word word word word word word word word word word word word word word word word word word wordword word word word word

remove if word appears in more than 10% of the articles

remove if the word appears in less than 20 articles

Page 22: LDA Topic Models

M

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

observed words in a document i

N words M documents

N

W

LATENT DIRICHLET ALLOCATION

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

Z

the topic distribution for

document i

the topic for the j’th word in a document i

Θ

Page 23: LDA Topic Models

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

M

observed words in a document i

N words M documents

N

LATENT DIRICHLET ALLOCATION

the topic distribution for

document i

the topic for the j’th word in a document i

topic

context

word

Page 24: LDA Topic Models

Take this recipe and generate a document based on the model’s “rules”

lets assume that…

topic#1

50%topic#2

30%topic#3

20%

recipe

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …

Take this collection of documents and learn a model that describes it best…

Page 25: LDA Topic Models

Take this recipe and generate a document based on the model’s “rules”

Take this collection of documents and learn a model that describes it best…

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …topic#1

50%topic#2

30%topic#3

20%

recipe

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

lets assume that…

what really happens…

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

words appearing in the same context (document) are

related

)

Page 26: LDA Topic Models

Take this collection of documents and learn a model that describes it best…

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …

…given these model parameters:

how many topics?

how are those topics assigned to a document?

Take this recipe and generate a document based on the model’s “rules”

words appearing in the same context (document) are

related

topic#1

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

topic#2 topic#N

)

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

lets assume that…

what really happens…

topic#1

50%topic#2

30%topic#3

20%

recipe

Page 27: LDA Topic Models

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

M

observed words in a document i

N words M documents

N

LATENT DIRICHLET ALLOCATION

the topic distribution for

document i

the topic for the j’th word in a document i

topic

context

word

Page 28: LDA Topic Models

the topic distribution for

document i

a parameter that sets the prior on the per-document topic distributions

a parameter that sets the prior on the per-topic word distributions

the topic for the j’th word in a document i

observed words in a document i

N

M

α

β

WN words M documents

Θ Z

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

LATENT DIRICHLET ALLOCATION

word word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

Page 29: LDA Topic Models

the topic distribution for

document i

a parameter that sets the prior on the per-document topic distributions

a parameter that sets the prior on the per-topic word distributions

the topic for the j’th word in a document i

observed words in a document i

N

M

Θα

β

Z WN words M documents

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

LATENT DIRICHLET ALLOCATION

word word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

words

topi

cs

model.lda

Page 30: LDA Topic Models

How many topics (dimensions) ?

👤

Page 31: LDA Topic Models

How many topics (dimensions) ?

topicwordcontext features

thresholds

PERCEPTIONa combination of top-down and

bottom-upprocessing

context

meaning

dimensions

spaces

gestalts👤!

Page 32: LDA Topic Models

A document is a probability distribution over topics A topic is a probability distribution over words

topic

word

context! features

thresholds

PERCEPTIONa combination of top-down and

bottom-upprocessing

context

meaning

dimensions

spaces

gestalts👤!

Page 33: LDA Topic Models

25024924824729282726252423222120191817151413121110987654321···

16

Page 34: LDA Topic Models

25024924824729282726252423222120191817151413121110987654321···

16

Page 35: LDA Topic Models

25024924824729282726252423222120191817151413121110987654321···

16

Each document gets represented as a pattern of LDA topics. Making every document appear…

…similar enoughto be grouped.

📖📖 📖♥␡

…different enoughto be separable,

📖 📖📖 ␡

Page 36: LDA Topic Models

DNA

Page 37: LDA Topic Models

DNA

topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +

topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +

0

🍔

🏢

?

?

?

Page 38: LDA Topic Models

DNA

topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +

topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +

0

🍔

🏢

?

?

?

Page 39: LDA Topic Models

0

🍔

🏢

LDA space a simplex

in this example 3 topics

Jensen-Shannon Divergence Jensen-Shannon Distance=( gives values between 0 and 1 )

a threshold that defines what is

considered similar (found experimentally)

0,21 similar enough

Page 40: LDA Topic Models
Page 41: LDA Topic Models

features

thresholds

PERCEPTION

context

meaning

dimensions

spaces

gestalts👤magazine level high number of words

noise - ads, editorial stuff, etc.

Does the model capture the right aspects of a

magazine?? What is the distance threshold

under which magazines are perceived as

similar?

?all models are wrong,

but some are useful

George E. P. Box

“ “more similar less similar

Page 42: LDA Topic Models

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Page 43: LDA Topic Models

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Page 44: LDA Topic Models

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Page 45: LDA Topic Models

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Page 46: LDA Topic Models

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

Page 47: LDA Topic Models

preprocess the data

'

Text corpus depends on the application domain.

It should be contextualised since the window of context will determine what words are considered to be related.

The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.

Training corpus can be different from the documents it will be scored on.

Good all utility corpus is Wikipedia.

train the model

The key parameter is the number of topics. Again, depends on the domain.

Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.

Good place to start is gensim - free python library.

score it on new document

📖

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

evaluate the performance

(

Evaluation depends on the application.

Use Jensen-Shannon Distance as similarity metric.

Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.

Use perplexity to see if your model is representative of the documents you’re scoring it on.

Page 48: LDA Topic Models

preprocess the data

'

Text corpus depends on the application domain.

It should be contextualised since the window of context will determine what words are considered to be related.

The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.

Training corpus can be different from the documents it will be scored on.

Good all utility corpus is Wikipedia.

train the model

The key parameter is the number of topics. Again, depends on the domain.

Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.

Good place to start is gensim - free python library.

score it on new document

📖

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

evaluate the performance

(

Evaluation depends on the application.

Use Jensen-Shannon Distance as similarity metric.

Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.

Use perplexity to see if your model is representative of the documents you’re scoring it on.

thank you

Andrius [email protected]

!