LDA Topic Models

Post on 25-Jul-2016

237 views 7 download

description

LDA Topic Models is a powerful tool for extracting meaning from text. In this video I talk about the idea behind the LDA itself, why does it work, what are the free tools and frameworks that can be used, what LDA parameters are tuneable, what do they mean in terms of your specific use case and what to look for when you evaluate it.

Transcript of LDA Topic Models

LDA Topic Models turning words into meaning

Andrius Knispelis

In 2011 I joined a danish startup issuu (the fastest growing online publishing platform) as their first Data Scientist.

Over the following 4 years I’ve worked on many interesting things there. And by far the coolest of all was the Topic Modelling.

Let me share with you:

What is LDA Topic Modelling? Why do you need one? How to build it?

?placement where&when

related

📄content

similar

?👥reading patterns

0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101

?👤 !

0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101

?TROPICAL FRUIT

Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN

KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.

MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.

PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.

PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.

Why not also try...• Salsa Peel and dice some

Why not also try...• Ice cream topping Simply

Why not also try...• Rice salad Cook long

kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps

knowledge about the world

?

👤

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

TROPICAL FRUIT

Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN

KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.

MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.

PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.

PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.

Why not also try...• Salsa Peel and dice some

Why not also try...• Ice cream topping Simply

Why not also try...• Rice salad Cook long

kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps

!what is it about? what is it related to? what does it feel like? what does it mean?

topicwordcontext

👤 !what is it about? what is it related to? what does it feel like? what does it mean?

knowledge about the world

?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

!

knowledge about the world

?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

topic

word

context

👤what is it about? what is it related to? what does it feel like? what does it mean?

use the right

words

set the right “window” of context

capture the widest range of

topics

use the right

words

capture the widest range of

topics

set the right “window” of context

!

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

knowledge about the world

?

topic

word

context

👤

🌎$millions or articles in Wikipedia

capturing the widest range of topics

!

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

knowledge about the world

right level of abstraction

knowledge about the world

🌎$

📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘

topic

word

context

👤use the

right words

capture the widest range of

topics

set the right “window” of context

millions or articles in Wikipedia capturing the widest range of topics

train the model?

⚙score

it on new document?

📖preprocess

the data?

'evaluate

the performance?

(

how to…

gensimtopic modeling framework

Free Python library

LDA hierarchical LDA

dynamic LDA DeepLearning

Word2Vec Doc2Vec

POS …

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

🌎wikipedia

.

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

🌎wikipedia

.

Setting the right “window” of context (a) Define minimum number of words to be present in an article.

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Setting the right “window” of context (a) Define minimum number of words to be present in an article.

Recommended 100 - 300

🌎wikipedia

.

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

1,167,766907,811892,147571,248128,603

8,8934,8152,3241,505

9150 500.000 1.000.000 1.500.000

Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:

1,167,766907,811892,147571,248128,603

8,8934,8152,3241,505

9150 500.000 1.000.000 1.500.000

Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

word word word word word word word word word word word word word word word word word word wordword word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

remove if word appears in more than 10% of the articles

word word word word word word word word word word word word word word word word word word wordword word word word word

remove if the word appears in less than 20 articles

Let the right words in

Word length 1: i2: do, be, am, …3: ice, was, who, …

16: videoconferences, …17: superbillionaires, …18: intellectualization, …

Stoplists:general termslast names, first namescountries, cities

Lemmatizationam , are, is = be

Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word

word word word word

keep top n words

discard the rest

Recommended 50.000 - 100.000

Let the right words in

Word length 1: i2: do, be, am, …3: ice, was, who, …

16: videoconferences, …17: superbillionaires, …18: intellectualization, …

Stoplists:general termslast names, first namescountries, cities

Lemmatizationam , are, is = be

Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)

word word word word word word word word word word word word word word word word word word wordword word word word word

remove if word appears in more than 10% of the articles

remove if the word appears in less than 20 articles

M

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

observed words in a document i

N words M documents

N

W

LATENT DIRICHLET ALLOCATION

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

Z

the topic distribution for

document i

the topic for the j’th word in a document i

Θ

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

M

observed words in a document i

N words M documents

N

LATENT DIRICHLET ALLOCATION

the topic distribution for

document i

the topic for the j’th word in a document i

topic

context

word

Take this recipe and generate a document based on the model’s “rules”

lets assume that…

topic#1

50%topic#2

30%topic#3

20%

recipe

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …

Take this collection of documents and learn a model that describes it best…

Take this recipe and generate a document based on the model’s “rules”

Take this collection of documents and learn a model that describes it best…

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …topic#1

50%topic#2

30%topic#3

20%

recipe

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

lets assume that…

what really happens…

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

words appearing in the same context (document) are

related

)

Take this collection of documents and learn a model that describes it best…

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

topic#1

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word

….

topic#2 topic#3

topics, themes, …

…given these model parameters:

how many topics?

how are those topics assigned to a document?

Take this recipe and generate a document based on the model’s “rules”

words appearing in the same context (document) are

related

topic#1

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word

….

topic#2 topic#N

)

word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word

lets assume that…

what really happens…

topic#1

50%topic#2

30%topic#3

20%

recipe

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

M

observed words in a document i

N words M documents

N

LATENT DIRICHLET ALLOCATION

the topic distribution for

document i

the topic for the j’th word in a document i

topic

context

word

the topic distribution for

document i

a parameter that sets the prior on the per-document topic distributions

a parameter that sets the prior on the per-topic word distributions

the topic for the j’th word in a document i

observed words in a document i

N

M

α

β

WN words M documents

Θ Z

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

LATENT DIRICHLET ALLOCATION

word word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

the topic distribution for

document i

a parameter that sets the prior on the per-document topic distributions

a parameter that sets the prior on the per-topic word distributions

the topic for the j’th word in a document i

observed words in a document i

N

M

Θα

β

Z WN words M documents

A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.

It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.

LATENT DIRICHLET ALLOCATION

word word word word word word word word word word word word word word word word

tfidf.mm wordids.txt

words

docu

men

ts

words

topi

cs

model.lda

How many topics (dimensions) ?

👤

How many topics (dimensions) ?

topicwordcontext features

thresholds

PERCEPTIONa combination of top-down and

bottom-upprocessing

context

meaning

dimensions

spaces

gestalts👤!

A document is a probability distribution over topics A topic is a probability distribution over words

topic

word

context! features

thresholds

PERCEPTIONa combination of top-down and

bottom-upprocessing

context

meaning

dimensions

spaces

gestalts👤!

25024924824729282726252423222120191817151413121110987654321···

16

25024924824729282726252423222120191817151413121110987654321···

16

25024924824729282726252423222120191817151413121110987654321···

16

Each document gets represented as a pattern of LDA topics. Making every document appear…

…similar enoughto be grouped.

📖📖 📖♥␡

…different enoughto be separable,

📖 📖📖 ␡

DNA

DNA

topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +

topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +

0

🍔

🏢

?

?

?

DNA

topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +

topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +

topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +

0

🍔

🏢

?

?

?

0

🍔

🏢

LDA space a simplex

in this example 3 topics

Jensen-Shannon Divergence Jensen-Shannon Distance=( gives values between 0 and 1 )

a threshold that defines what is

considered similar (found experimentally)

0,21 similar enough

features

thresholds

PERCEPTION

context

meaning

dimensions

spaces

gestalts👤magazine level high number of words

noise - ads, editorial stuff, etc.

Does the model capture the right aspects of a

magazine?? What is the distance threshold

under which magazines are perceived as

similar?

?all models are wrong,

but some are useful

George E. P. Box

“ “more similar less similar

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

👤Do the neighbours look similar? Where is the distance threshold?

Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.

2. Get the LDA topic distribution. Show me the topic distribution.

3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.

preprocess the data

'

Text corpus depends on the application domain.

It should be contextualised since the window of context will determine what words are considered to be related.

The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.

Training corpus can be different from the documents it will be scored on.

Good all utility corpus is Wikipedia.

train the model

The key parameter is the number of topics. Again, depends on the domain.

Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.

Good place to start is gensim - free python library.

score it on new document

📖

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

evaluate the performance

(

Evaluation depends on the application.

Use Jensen-Shannon Distance as similarity metric.

Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.

Use perplexity to see if your model is representative of the documents you’re scoring it on.

preprocess the data

'

Text corpus depends on the application domain.

It should be contextualised since the window of context will determine what words are considered to be related.

The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.

Training corpus can be different from the documents it will be scored on.

Good all utility corpus is Wikipedia.

train the model

The key parameter is the number of topics. Again, depends on the domain.

Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.

Good place to start is gensim - free python library.

score it on new document

📖

The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.

evaluate the performance

(

Evaluation depends on the application.

Use Jensen-Shannon Distance as similarity metric.

Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.

Use perplexity to see if your model is representative of the documents you’re scoring it on.

thank you

Andrius Knispelisandrius.knispelis@gmail.com

!