LDA Topic Models
-
Upload
andrius-knispelis -
Category
Documents
-
view
237 -
download
7
description
Transcript of LDA Topic Models
LDA Topic Models turning words into meaning
Andrius Knispelis
In 2011 I joined a danish startup issuu (the fastest growing online publishing platform) as their first Data Scientist.
Over the following 4 years I’ve worked on many interesting things there. And by far the coolest of all was the Topic Modelling.
Let me share with you:
What is LDA Topic Modelling? Why do you need one? How to build it?
?placement where&when
related
📄content
similar
?👥reading patterns
0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101
?👤 !
0010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010112001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010010100100010010101011001111010101010010001111101001001010111001001010011100010101000010111010101010000011100010101010111001010111010101
?TROPICAL FRUIT
Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN
KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.
MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.
PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.
PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.
Why not also try...• Salsa Peel and dice some
Why not also try...• Ice cream topping Simply
Why not also try...• Rice salad Cook long
kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps
knowledge about the world
?
👤
📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
knowledge about the world
TROPICAL FRUIT
Serve up something new with...RECIPES: GLYNIS MCGUINNESS, GREGOR MCMASTER. PHOTOGRAPHS: JONATHAN KENNEDY. STYLING: TAMZIN FERDINANDO, JENNY IGGLEDEN. FOOD STYLING: DENISE SMART, KATE BLINMAN
KIWI FRUITCheesecake layersPut a few digestive biscuits in a freezer bag and smash with a rolling pin. Beat a little icing sugar into soft cheese, then peel and slice some kiwi fruit. Layer in glasses until full.
MANGOSpicy mango salad with porkPeel and stone ripe mango and slice. Mix with sliced red onion, quartered cherry tomatoes, a sliced chilli, chopped coriander and a squeeze of lemon juice. Serve with grilled pork chops or steaks.
PASSION FRUITTropical pavlovaWhip double cream until thick and spoon into meringue nests. Top with mango slices. Halve 2 passion fruits, scoop outthe seeds and flesh. Spoon over the meringues.
PINEAPPLERum-flavoured ringsRemove the pineapple ends and peel. Slice thickly and remove the core. Heat a little butter with brown sugar and stir until melted. Add a good splash of dark rum and the pineapple and simmer for 5-10 minutes.
Why not also try...• Salsa Peel and dice some
Why not also try...• Ice cream topping Simply
Why not also try...• Rice salad Cook long
kiwi fruits and peeled, stoned avocados. Toss in lime juice, then stir in a little finely chopped shallot and deseeded red chilli. Serve with meat or fish.• Kiwi & chicken wraps
!what is it about? what is it related to? what does it feel like? what does it mean?
topicwordcontext
👤 !what is it about? what is it related to? what does it feel like? what does it mean?
knowledge about the world
?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
knowledge about the world
right level of abstraction
!
knowledge about the world
?📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
knowledge about the world
right level of abstraction
topic
word
context
👤what is it about? what is it related to? what does it feel like? what does it mean?
use the right
words
set the right “window” of context
capture the widest range of
topics
use the right
words
capture the widest range of
topics
set the right “window” of context
!
📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
knowledge about the world
right level of abstraction
knowledge about the world
?
topic
word
context
👤
🌎$millions or articles in Wikipedia
capturing the widest range of topics
!
📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
knowledge about the world
right level of abstraction
knowledge about the world
🌎$
📹 👶 $ ⚛ ♥ 💰 🎂 ) 🍺 🏈 🚘
topic
word
context
👤use the
right words
capture the widest range of
topics
set the right “window” of context
millions or articles in Wikipedia capturing the widest range of topics
train the model?
⚙score
it on new document?
📖preprocess
the data?
'evaluate
the performance?
(
how to…
gensimtopic modeling framework
Free Python library
LDA hierarchical LDA
dynamic LDA DeepLearning
Word2Vec Doc2Vec
POS …
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
🌎wikipedia
.
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
🌎wikipedia
.
Setting the right “window” of context (a) Define minimum number of words to be present in an article.
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Setting the right “window” of context (a) Define minimum number of words to be present in an article.
Recommended 100 - 300
🌎wikipedia
.
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
1,167,766907,811892,147571,248128,603
8,8934,8152,3241,505
9150 500.000 1.000.000 1.500.000
Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Title Wikipedia:Category:File:Portal:Template:MediaWiki:User:Help:Book:Draft:
1,167,766907,811892,147571,248128,603
8,8934,8152,3241,505
9150 500.000 1.000.000 1.500.000
Setting the right “window” of context (b) Skip articles whose titles start with those namespaces:
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word
word word word word word word word word word word word word word word word word word word wordword word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word
remove if word appears in more than 10% of the articles
word word word word word word word word word word word word word word word word word word wordword word word word word
remove if the word appears in less than 20 articles
Let the right words in
Word length 1: i2: do, be, am, …3: ice, was, who, …
16: videoconferences, …17: superbillionaires, …18: intellectualization, …
Stoplists:general termslast names, first namescountries, cities
Lemmatizationam , are, is = be
Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
Titleword word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word
word word word word
keep top n words
discard the rest
Recommended 50.000 - 100.000
Let the right words in
Word length 1: i2: do, be, am, …3: ice, was, who, …
16: videoconferences, …17: superbillionaires, …18: intellectualization, …
Stoplists:general termslast names, first namescountries, cities
Lemmatizationam , are, is = be
Parts of Speech:NN - noun (computer, car, cake, …) VB - verb (play, install, commit, …) RB - adverb (today, quickly, patiently, …) JJ - adjective (red, awesome, big, …) IN - preposition (of, about, from, …)
word word word word word word word word word word word word word word word word word word wordword word word word word
remove if word appears in more than 10% of the articles
remove if the word appears in less than 20 articles
M
TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word
tfidf.mm wordids.txt
words
docu
men
ts
observed words in a document i
N words M documents
N
W
LATENT DIRICHLET ALLOCATION
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.
It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
Z
the topic distribution for
document i
the topic for the j’th word in a document i
Θ
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.
It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word
tfidf.mm wordids.txt
words
docu
men
ts
M
observed words in a document i
N words M documents
N
LATENT DIRICHLET ALLOCATION
the topic distribution for
document i
the topic for the j’th word in a document i
topic
context
word
Take this recipe and generate a document based on the model’s “rules”
lets assume that…
topic#1
50%topic#2
30%topic#3
20%
recipe
topic#1
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
topic#2 topic#3
topics, themes, …
Take this collection of documents and learn a model that describes it best…
Take this recipe and generate a document based on the model’s “rules”
Take this collection of documents and learn a model that describes it best…
topic#1
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
topic#2 topic#3
topics, themes, …topic#1
50%topic#2
30%topic#3
20%
recipe
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
lets assume that…
what really happens…
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
words appearing in the same context (document) are
related
)
Take this collection of documents and learn a model that describes it best…
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
topic#1
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
P * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * wordP * word
….
topic#2 topic#3
topics, themes, …
…given these model parameters:
how many topics?
how are those topics assigned to a document?
Take this recipe and generate a document based on the model’s “rules”
words appearing in the same context (document) are
related
topic#1
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word
….
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word
….
P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word P * word
….
topic#2 topic#N
)
word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word word
lets assume that…
what really happens…
topic#1
50%topic#2
30%topic#3
20%
recipe
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.
It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
TitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlTitlword word word word word word word word word word word word word word word word
tfidf.mm wordids.txt
words
docu
men
ts
M
observed words in a document i
N words M documents
N
LATENT DIRICHLET ALLOCATION
the topic distribution for
document i
the topic for the j’th word in a document i
topic
context
word
the topic distribution for
document i
a parameter that sets the prior on the per-document topic distributions
a parameter that sets the prior on the per-topic word distributions
the topic for the j’th word in a document i
observed words in a document i
N
M
α
β
WN words M documents
Θ Z
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.
It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION
word word word word word word word word word word word word word word word word
tfidf.mm wordids.txt
words
docu
men
ts
the topic distribution for
document i
a parameter that sets the prior on the per-document topic distributions
a parameter that sets the prior on the per-topic word distributions
the topic for the j’th word in a document i
observed words in a document i
N
M
Θα
β
Z WN words M documents
A topic model developed by David Blei, Andrew Ng and Michael Jordan in 2003.
It tells us what topics are present in any given document by observing all the words in it and producing a topic distribution.
LATENT DIRICHLET ALLOCATION
word word word word word word word word word word word word word word word word
tfidf.mm wordids.txt
words
docu
men
ts
words
topi
cs
model.lda
How many topics (dimensions) ?
👤
How many topics (dimensions) ?
topicwordcontext features
thresholds
PERCEPTIONa combination of top-down and
bottom-upprocessing
context
meaning
dimensions
spaces
gestalts👤!
A document is a probability distribution over topics A topic is a probability distribution over words
topic
word
context! features
thresholds
PERCEPTIONa combination of top-down and
bottom-upprocessing
context
meaning
dimensions
spaces
gestalts👤!
25024924824729282726252423222120191817151413121110987654321···
16
25024924824729282726252423222120191817151413121110987654321···
16
25024924824729282726252423222120191817151413121110987654321···
16
Each document gets represented as a pattern of LDA topics. Making every document appear…
…similar enoughto be grouped.
📖📖 📖♥␡
…different enoughto be separable,
📖 📖📖 ␡
DNA
DNA
topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +
topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +
topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +
0
🍔
🏢
?
?
?
DNA
topic #810.019*recipes + 0.017*chef + 0.017*peanut + 0.016*cuisine + 0.015*cooking + 0.015*meat + 0.015*restaurant + 0.015*dish + 0.014*cookery + 0.014*vegetables + 0.014*dishes + 0.012*rice + 0.012*chicken + 0.011*sauce + 0.010*fried + 0.010*beef + 0.009*chefs + 0.009*peanuts + 0.009*bean + 0.009*pork + 0.008*culinary + 0.008*restaurants + 0.008*cucumber + 0.008*recipe + 0.007*kitchen + 0.007*pepper + 0.007*melon + 0.007*ingredients + 0.007*eaten + 0.007*cooked + 0.007*cook + 0.006*potato + 0.006*soup + 0.006*cooks + 0.006*coconut + 0.005*onion + 0.005*meal + 0.005*sausage + 0.005*cabbage + 0.005*anise + 0.005*potatoes +
topic #1430.057*wine + 0.056*plantings + 0.030*wines + 0.024*vineyard + 0.020*grape + 0.020*winery + 0.016*peaches + 0.016*vineyards + 0.015*grapes + 0.012*cabernet + 0.012*pinot + 0.012*vine + 0.012*napa + 0.011*blanc + 0.011*velvety + 0.010*mourad + 0.010*magie + 0.010*sauvignon + 0.010*trophic + 0.009*approachable + 0.009*neda + 0.009*vines + 0.009*gall + 0.009*bano + 0.008*powdery + 0.008*degraw + 0.007*kimiko + 0.007*viticulture + 0.007*dagupan + 0.007*noir + 0.006*haridas + 0.006*aphid + 0.006*mccray + 0.006*chardonnay + 0.006*osmotic + 0.006*tasting + 0.006*merlot + 0.006*benidorm + 0.006*kyōko +
topic #2700.048*dutch + 0.034*netherlands + 0.029*amsterdam + 0.019*danish + 0.014*batavia + 0.014*denmark + 0.014*copenhagen + 0.012*rotterdam + 0.012*holland + 0.011*utrecht + 0.010*hague + 0.010*willem + 0.009*haarlem + 0.009*leiden + 0.008*pieter + 0.008*odense + 0.008*hansen + 0.008*cornelis + 0.007*congreve + 0.007*groningen + 0.007*sint + 0.007*hendrik + 0.007*frans + 0.006*lange + 0.006*roughriders + 0.006*rasmus + 0.005*wilhelmina + 0.005*jørgensen + 0.005*roskilde + 0.005*witton + 0.005*eskimos + 0.005*stampeders + 0.005*vries + 0.005*arnhem + 0.005*nijmegen + 0.005*delft + 0.004*johan + 0.004*niels + 0.004*johannes +
0
🍔
🏢
?
?
?
0
🍔
🏢
LDA space a simplex
in this example 3 topics
Jensen-Shannon Divergence Jensen-Shannon Distance=( gives values between 0 and 1 )
a threshold that defines what is
considered similar (found experimentally)
0,21 similar enough
features
thresholds
PERCEPTION
context
meaning
dimensions
spaces
gestalts👤magazine level high number of words
noise - ads, editorial stuff, etc.
Does the model capture the right aspects of a
magazine?? What is the distance threshold
under which magazines are perceived as
similar?
?all models are wrong,
but some are useful
George E. P. Box
“ “more similar less similar
👤Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.
2. Get the LDA topic distribution. Show me the topic distribution.
3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
👤Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.
2. Get the LDA topic distribution. Show me the topic distribution.
3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
👤Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.
2. Get the LDA topic distribution. Show me the topic distribution.
3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
👤Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.
2. Get the LDA topic distribution. Show me the topic distribution.
3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
👤Do the neighbours look similar? Where is the distance threshold?
Take this piece of text 1. Preprocess it. Show me what was removed and what stayed.
2. Get the LDA topic distribution. Show me the topic distribution.
3. Calculate similarity between this page and the rest of the pages. Show me the nearest neighbours, sorted by the distance metric.
preprocess the data
'
Text corpus depends on the application domain.
It should be contextualised since the window of context will determine what words are considered to be related.
The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.
Training corpus can be different from the documents it will be scored on.
Good all utility corpus is Wikipedia.
train the model
⚙
The key parameter is the number of topics. Again, depends on the domain.
Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.
Good place to start is gensim - free python library.
score it on new document
📖
The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.
evaluate the performance
(
Evaluation depends on the application.
Use Jensen-Shannon Distance as similarity metric.
Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.
Use perplexity to see if your model is representative of the documents you’re scoring it on.
preprocess the data
'
Text corpus depends on the application domain.
It should be contextualised since the window of context will determine what words are considered to be related.
The only observable features for the model are words. Experiment with various stoplists to make sure only the right ones are getting in.
Training corpus can be different from the documents it will be scored on.
Good all utility corpus is Wikipedia.
train the model
⚙
The key parameter is the number of topics. Again, depends on the domain.
Other parameters are alpha and beta. You can leave them aside to begin with and only tune later.
Good place to start is gensim - free python library.
score it on new document
📖
The goal of the model is not to label documents, but rather to give them a unique fingerprint so that they can be compared to each other in a humanlike fashion.
evaluate the performance
(
Evaluation depends on the application.
Use Jensen-Shannon Distance as similarity metric.
Evaluation should show whether the model captures the right aspects compared to a human. Also it will show what distance threshold is still being perceived as similar enough.
Use perplexity to see if your model is representative of the documents you’re scoring it on.
thank you
Andrius [email protected]
!