Information Retrieval to Knowledge Retrieval , one more step
description
Transcript of Information Retrieval to Knowledge Retrieval , one more step
![Page 1: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/1.jpg)
Information Retrieval to Knowledge Retrieval, one more step
Xiaozhong LiuAssistant Professor
School of Library and Information ScienceIndiana University Bloomington
![Page 2: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/2.jpg)
What is Information?
What is Retrieval?
What is Information Retrieval?
![Page 3: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/3.jpg)
I am Retriever
![Page 4: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/4.jpg)
How to find this book in Library?
![Page 5: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/5.jpg)
Search something based on User Information Need!!
How to express your information need?
Query
![Page 6: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/6.jpg)
User Information Need!!
Query
What is Good query?What is Bad query?
Good query: query ≈ information needBad query: query ≠ information need
Wait!!! User NEVER make mistake!!!It’s OUR job!!!
![Page 7: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/7.jpg)
Task 1: Given user information need, how to help (or automatically) help user propose a better query?
If there is a query… Perfect query:
𝑄𝑢𝑒𝑟𝑦𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒
User input query:
𝑄𝑢𝑒𝑟𝑦𝑢𝑠𝑒𝑟
![Page 8: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/8.jpg)
User Information Need!!
Query ResultsGiven a query,How to retrieve results?
What is Good results?What is Bad results?
![Page 9: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/9.jpg)
Task 2: Given a query (not perfect), how to retrieveDocuments from collection?
Very Large, UnstructuredText Data!!!
F(query, doc)
Can you give me an example?
![Page 10: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/10.jpg)
F(query, doc)
If query term exist in docYes, this is result
If query term NOT exist in docNo, this is not result
Is there any problem in this function?Brainstorm…
![Page 11: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/11.jpg)
Query: Obama’s wife
Doc 1. My wife supports Obama’s new policy on…
Doc 2. Michelle, as the first lady of the United States…
Yes, this is a very challenging task!
![Page 12: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/12.jpg)
Another problem Collection size: 5 billionMatch doc: 5
My algorithm successfully finds all the 5 docs! In… 3 billion results…
![Page 13: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/13.jpg)
User Information Need!!
Query Results
How to help user find the results from all the
retrieved results?
![Page 14: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/14.jpg)
Task 3: Given retrieved results, how to help you find their results?
If retrieval algorithm retrieved 1 billion results from collection, what will you do???
Search with Google, click “next”???
Yes, we can help user find what they need!
![Page 15: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/15.jpg)
Query: Indiana University Bloomington
Can you read it One by one?
You use it??
![Page 16: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/16.jpg)
User Information Need!!
Query Results
1
2
3
User
System
![Page 17: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/17.jpg)
User Information Need!!
Query Results
1
2
3
User
System
They are not independent!
![Page 18: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/18.jpg)
Information Retrieval
Text
Image
Music
Map
……
![Page 19: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/19.jpg)
Information Retrieval
Text
Image
Music
Map
……
documentweb
scholar
blog
news
![Page 20: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/20.jpg)
Index
![Page 21: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/21.jpg)
Documents vs. Database Records• Relational database records are typically made up of well-
defined fields Select * from students where GPA > 2.5
We need a more effective way to index the text!
Text, similar way? Find all the docs including “Xiaozhong”
Select * from documents where text like ‘%xiaozhong%’
![Page 22: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/22.jpg)
Collection C: doc1, doc2, doc3 ……… docN
Query q: q1, q2, q3 ……… qt where qx is the query term
Document doci : di1, di2, di3 ……… dim All dij V
Vocabulary V: w1, w2, w3 ……… wn
![Page 23: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/23.jpg)
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 1 0 0 1
Doc2 0 0 0 1
Doc3 1 1 1 1
DocN 1 0 1 1
………
Query q: 0, 1, 0 ………
![Page 24: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/24.jpg)
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 3 0 0 9
Doc2 0 0 0 7
Doc3 2 11 21 1
DocN 7 0 1 2
………
Query q: 0, 3, 0 ………
Normalization is very important!
![Page 25: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/25.jpg)
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
Normalization is very important!
Weight
![Page 26: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/26.jpg)
Term weighting
TF * IDF
Term frequency: freq (w, doc) / | doc|Or…
Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w
An effective way to weight each word in a document
![Page 27: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/27.jpg)
Index
Space?
Speed?
Retrieval Model?
Ranking?
Semantic?
Document representation meets the requirement of retrieval system
![Page 28: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/28.jpg)
StemmingEducation
Educational
Educate
EducatingEducations
Educat
Very effective to improve system performance.
Some risk! E.g. LA Lakers = LA Lake?
![Page 29: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/29.jpg)
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
I love my cat this is lovely yellow and write
i love cat thi yellow write i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3
We lose something?
![Page 30: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/30.jpg)
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
i - 1love - 1, 2thi - 2cat - 1, 2, 3yellow - 3write - 3
i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2
We still lose something?
![Page 31: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/31.jpg)
Doc 1: I love my cat.Doc 2: This cat is lovely!Doc 3: Yellow cat and white cat.
Inverted index
i – 1:1love – 1:1, 2:1thi – 2:1cat – 1:1, 2:1, 3:2yellow – 3:1write – 3:2
i – 1:1love – 1:2, 2:4thi – 2:1cat – 1:4, 2:2, 3:2, 3:5yellow – 3:2write – 3:4
Why do you need position info?
![Page 32: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/32.jpg)
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
Proximity of query terms query: information retrieval
![Page 33: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/33.jpg)
Doc 1: information retrieval is important for digital library.
Doc 2: I need some information about the dogs, my favorite is golden retriever.
Index – bag of wordsquery: information retrieval
What’s the limitation of bag-of-words? Can we make it better?
n-gram:
Doc 1: information retrieval, retrieval is, is important, important for ……
bi-gram
Better semantic representation!What’s the limitation?
![Page 34: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/34.jpg)
Doc 1: …… big apple ……
Doc 2: …… apple……
Index – bag of “phrase”?
More precision, less ambiguous
How to identify phrases from documents?
Identify syntactic phrases using POS taggingn-gramsfrom existing resources
![Page 35: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/35.jpg)
Noise detection
What is the noise of web page? Non-informative content…
![Page 36: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/36.jpg)
Web Crawler - freshness
Web is changing, but we cannot constantly check all the pages…
Need to find the most important page that change freq
www.nba.com
www.iub.edu
www.restaurant????.com
Sitemap: a list of urls for each host; modification time and freq
![Page 37: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/37.jpg)
Retrieval
![Page 38: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/38.jpg)
Model
Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001).
i.e. some model help you to predict tomorrow stock price…
![Page 39: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/39.jpg)
Hypothesis:
Retrieval and ranking problem = Similarity Problem!
Vector Space Model
Is that a good hypothesis? Why?
Retrieval Function: Similarity (query, Document)
Return a score!!! We can Rank the documents!!!
![Page 40: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/40.jpg)
So, query is a short document
Vector Space Model
![Page 41: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/41.jpg)
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
![Page 42: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/42.jpg)
Collection C: doc1, doc2, doc3 ……… docN
V: w1, w2, w3 ……… wn
Doc1 0.41 0 0 0.62
Doc2 0 0 0 0.12
Doc3 0.42 0.11 0.34 0.13
DocN 0.01 0 0.19 0.24
………
Query q: 0, 0.37, 0 ………
Similarity
Doc Vector
Query Vector
![Page 43: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/43.jpg)
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat cat
dog
2
1
doc 1
doc 2
doc 3
![Page 44: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/44.jpg)
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat
F (q, doc) = cosine similarity (q, doc)
cat
dog
2
1
doc 1
doc 2 = query
doc 3
θ
Why Cosine?
![Page 45: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/45.jpg)
Vector Space Model
Dimension = n = vocabulary size
Query q: q1, q2, q3 ……… qn Same dimensional space!!!Document doci : di1, di2, di3 ……… din All dij V
Vocabulary V: w1, w2, w3 ……… wn
![Page 46: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/46.jpg)
Doc1: ……Cat……dog……cat……Doc2: ……Cat……dogDoc3: ……snake……
Query: dog cat
Try!
![Page 47: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/47.jpg)
Term weighting
Doc [ 0.42 0.11 0.34 0.13 ]
weight, how?
TF * IDF
Term frequency: freq (w, doc) / | doc|Or…
Inverse document frequency1+ log(N/k)N total num of docs in collectionk total num of docs with word w
![Page 48: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/48.jpg)
More TF
Weighting is very important for retrieval model!We can improve TF by…
i.e.freq (term, doc)log [freq (term, doc)]
BM25:
![Page 49: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/49.jpg)
Vector Space Model
But…
Bag of word assumption = Word independent!
Query = Document, maybe not true!
Vector and SEO (Search Engine Optimization)…
synonym? Semantic related words?
![Page 50: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/50.jpg)
How about these…
Pivoted Normalization Method
Dirichlet Prior Method
TF IDFNormalization
+parameter
![Page 51: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/51.jpg)
Language model
Probability distribution over words
P (I love you) = 0.01P (you love I) = 0.00001P (love you I) = 0.0000001
If we have this information… we could build a generative model!
P(text | )
![Page 52: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/52.jpg)
Language model - unigram
Generate text with bag-of-word assumption (word independent):
P (w1, w2,…wn) = P(w1) P(w2)…P(wn)
food orange desk USB computer Apple Unix …. …. …. milk sport superbowl
topic X = ???
![Page 53: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/53.jpg)
food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB
topic 1topic 2
Doc: I’m using Mac computer… remote access another computer… share some USB device…
P(Doc | topic1) vs. P(Doc | topic2)
![Page 54: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/54.jpg)
king ghost hamlet play …. …. romeo juliet iPad iplhone 4s tv apple …… play store
![Page 55: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/55.jpg)
food orange desk USB computer Apple Unix …. …. …. milk sport superbowl
topicX
How to estimate???
If we have enough data, i.e. docs about topic X
10/10000 1000/10000 30/10000
P(“computer” | topic X)
![Page 56: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/56.jpg)
food orange desk USB computer Apple Unix …. …. milk yogurt iPad NBA sport superbowl NHL score information unix USB
doc 1doc 2
query: sport game watch
P(query | doc 1) vs. P(query | doc 2)
![Page 57: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/57.jpg)
a document doc:
query likelihood query term likelihood
Retrieval Problem Query likelihood Term likelihood P(qi | doc)
But document is a small sample of topic… Data like:
Smoothing!
![Page 58: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/58.jpg)
P(qi | doc) What if qi is not observed in doc? P(qi | doc) = 0?
We want give this non-zero score!!!
Smoothing
i.e.
We can make it better!
![Page 59: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/59.jpg)
Smoothing
First, it addresses the data sparseness problem. As a document is only a very small sample, the probability P (qi | Doc) could be zero for those unseen words (Zhai & Lafferty, 2004).
Second, smoothing helps to model the background (non-discriminative) words in the query.
Improve language model estimation by using Smoothing
![Page 60: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/60.jpg)
Smoothing
Another smoothing method:
P (w | )
if the word exist in doc
if the word not exist in doc
P (w | doc)
P (w | collection) Collection Language Model
P (w | ) = (1-λ) ∙P( query | θdoc)+λ∙P(doc| θcollection)
![Page 61: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/61.jpg)
Smoothing
We could use collection language model:
TFIDF is closely related to Language Model and other retrieval models
Term Freq
IDFDoc length norm
![Page 62: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/62.jpg)
Language model
Solid statistical foundation
Flexible parameter setting
Different smoothing method
![Page 63: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/63.jpg)
Language model in library?
If we have a paper… and a query…
Similarity (paper, query) Vector Space Model
If query word not in the paper…
Score = 0
If we use language model…
![Page 64: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/64.jpg)
Language model in library?
Likelihood of query given a paper can be estimated by:
P(query | ) = αP (query | paper) + βP (query | author) +γP (query | journal) +……
Likelihood of query given a paper & author & journal & ……
![Page 65: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/65.jpg)
e.g. what’s the difference between web and doc retrieval???
F (doc, query)
F (web page, query)
vs
web page = doc + hyperlink + domain info + anchor text + metadata + …Can you use those to improve system performance???
![Page 66: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/66.jpg)
Knowledge
![Page 67: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/67.jpg)
![Page 68: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/68.jpg)
Score each topic, level of interest
Topic 1
Topic 2
![Page 69: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/69.jpg)
CI-n … CI-2 CI-1 CI-now
)|({)]|([/)|()]}|([)]|([)|({
)]|([/)|()]}|([)]|([)|({)(
ntoday
nintodayninintoday
nintodayninintoday
n
ZDayPelseZDayPmeanZDayPbZDayPSTDZDayPmeanZDayPifelse
ZDayPmeanZDayPaZDayPSTDZDayPmeanZDayPifTopicScore
Hot topic Diminishing topic Regular topic
CurrentInterestHistorical Interest
![Page 70: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/70.jpg)
“Obama”, Nov 5th 2008 After Election
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 300
1
2
3
4
5
6
Nov 5th CIV:
Wiki:Barack_Obama; Wiki:Election; win; success; Wiki:President_of_the_United_States
Wiki:African_American; PresidentWorld; America; victory; record; first;president ; 44th; History; Wiki:Victory_Records
Entity:first_black_president;
Entity:first_black_president; Celebrate; black; african;
Wiki:Colin_Powell; Wiki:Secretary_of_StateWiki:United_States
Wiki:Sarah_Palin; sarah; palin; hillarySecret; Wiki:Hillary_Rodham_Clinton
Clinton; newsweek; club; cloth
1. Win2. Create history3. First black president
![Page 71: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/71.jpg)
Google web NDCG3 NDCG5 NDCG10 t-testCIV 0.35909366 0.399970894 0.479302401 CILM 0.356652652 0.387120299 0.483420045 Google 0.230423817 0.318737414 0.388792379 **TFIDF 0.27596245 0.333012091 0.437831859 *BM25 0.284599431 0.336961764 0.436466778 *LM (liner) 0.32558799 0.382113457 0.473992963 LM (dirichlet) 0.34665084 0.358128576 0.45150825 LM (twostage) 0.349735965 0.358725227 0.450046444 BEST1: CIV CIV CILM BEST2: CILM CILM CIV Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
Yahoo_web NDCG3 NDCG5 NDCG10 t-testCIV 0.351765133 0.38207777 0.475506721 CILM 0.391807685 0.40623334 0.482464858 Yahoo 0.288059321 0.326373542 0.410969176 TFIDF 0.24320988 0.282799657 0.404092457 ***BM25 0.245263974 0.277579262 0.395953269 ***LM (liner) 0.276208943 0.316889107 0.432428784 *LM (dirichlet) 0.223253393 0.270017519 0.385936078 ***LM (twostage) 0.219225991 0.266537146 0.384349848 ***BEST1: CILM CILM CILM BEST2: CIV CIV CIV Significant test *** t < 0.05 ** t < 0.10 * t < 0.15
![Page 72: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/72.jpg)
Knowledge Retrieval System
Knowledge-based Information Need
Knowledge within Scientific Literature
Matching
Query Knowledge Representation
How to help user propose
knowledge-base queries ?
How to represent
knowledge?
How to match
between the two?
![Page 73: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/73.jpg)
Academic Knowledge
![Page 74: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/74.jpg)
74
Query Recommendation & Feedback
Query Recommendation
Query Feedback
![Page 75: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/75.jpg)
![Page 76: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/76.jpg)
76
Structural Keyword Generation- FeaturesCategory Feature Description or Example
Keyword Content
Text content of the keyword, stemmed, case insensitive, stop words removed
Content_Of_Keyword a vector of all the tokens in the keywordCAP whether the keyword is capitalized
Contain_Digit whether the keyword contains digits, i.e., TREC2002, value = trueCharacter_Length_Of_Keyword number of characters in the target keyword
Token_Length_Of_Keyword number of tokens in the keyword
Category_Length_Of_Keyword number of tokens in the keyword; if the length is more than four, we use four to represent its category length
Title Context
Exist_In_Title whether keyword exists in title (stemmed, case insensitive, stop words removed)
Location_In_Title the position where the keyword appears in the titleTitle_Text_POS unigram and its part of speech in title (in a text window)Title_Unigram unigram of keyword in title (in a text window)Title_Bigram bigram of keyword in title (in a text window)
Abstract Context
Location_In_Abstract which sentence the keyword appears in the abstractKeyword_Position_In_Sentence_O
f_Abstract the keyword’s position in the sentence (beginning, middle or end)
Abstract_Freq how many times a keyword appears in the abstractAbstract_Text_POS unigram and its part of speech in abstract (in a text window)Abstract_Unigram unigram of keyword in abstract (in a text window)Abstract_Bigram bigram of keyword in abstract (in a text window)
![Page 77: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/77.jpg)
Evaluation – Domain Knowledge Generation
F1 Compare Concept Supervised Semi-supervised
Keyword-based
features
Research Question 0.637 0.662
Methodology 0.479 0.516Dataset 0.824 0.816
Evaluation 0.571 0.571
Keyword + Title-based
features
Research Question 0.633 0.667
Methodology 0.498 0.534Dataset 0.824 0.816
Evaluation 0.571 0.571
Keyword + Title +
Abstract-based
features
Research Question 0.642 0.663
Methodology 0.420 0.542Dataset 0.831 0.823
Evaluation 0.621 0.662
F measure comparison for Supervised Learning and Semi-Supervised Learning
GOOD! but not PERFECT…
![Page 78: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/78.jpg)
Knowledge comes from…
• System? Machine Learning, but… low modest performance…
• User? No way! Very high cost! Author won’t contribute…
• System + User? Possible!
![Page 79: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/79.jpg)
WikiBackyard
ScholarWiki
EditTrigger: 1. Wiki page improve; 2. Machine learning model improve; 3. All other wiki pages improve; 4. KR index improve!
![Page 80: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/80.jpg)
User + Machine learning is powerful…YES! It helps!!!
![Page 81: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/81.jpg)
• Knowledge retrieval for scholar publications…• Knowledge from paper• Knowledge from user– Knowledge feedback– Knowledge recommendation
• Knowledge from User vs. from Machine learning
• ScholarWiki (user) + WikiBackyard (machine)
![Page 82: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/82.jpg)
Knowledge via Social Network and Text Mining
![Page 83: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/83.jpg)
CITATION? CO-OCCUR?CO-AUTHOR?
![Page 84: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/84.jpg)
Content of each node?Motivation of each citation?
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Full text citation analysis
![Page 85: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/85.jpg)
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Every word @ Citation Context will VOTE!! Motivation? Topic? Reason??? Left and Right N words??N = ??????????
![Page 86: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/86.jpg)
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
Word effectiveness is decaying based on the distance!!!
Closer words make more significant contribution!!
![Page 87: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/87.jpg)
How about language model? Each node and edge represented by a language model?High dimensional space! Word difference?
![Page 88: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/88.jpg)
Topic modeling – each node is represented by a topic distribution (Prior Distribution); each edge is represented by a topic distribution (Transitioning Probability Distribution)
![Page 89: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/89.jpg)
Supervised topic modeling
1. Each topic has a label (YES! We can interpret each topic)2. We DO KNOW the total number of topics
Each paper is a mix probability distribution of Author Given Keywords
Keywords
![Page 90: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/90.jpg)
Each paper: pzkeyi(paper) = p(zkeyi | abstract, title)
With further study of citation analysis, increasing numbers of researchers have come to doubt the reasonableness of assuming that the raw number of citations reflects an article’s influence (MacRoberts & MacRoberts 1996). On the other hand, full-text analysis has to some extent compensated for the weaknesses of citation counts and has offered new opportunities for citation analysis.
![Page 91: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/91.jpg)
Paper importance
if we have 3 topics (keywords): key1, key2, key3
Domain credit: 100
pub 1
25
pub 2
25
pub 3
25
pub 4
25
P(key1 | text) = 0.6P(key2 | text) = 0.15 P(key3 | text) = 0.25
Key1-Pub1 credit: 25 * 0.6
P(key1 | citation) = 0.8P(key2 | citation) = 0.1 P(key3 | citation) = 0.1
Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]
0.80.2
Evenly share the credits?
Citation is important if 1. citation focusing on important topic 2. other citations focusing on other topics
![Page 92: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/92.jpg)
Paper importance
if we have 3 keywords: key1, key2, key3
Domain credit: 100
pub 1
25
pub 2
25
pub 3
25
pub 4
25
Key1-Pub1 credit: 25 * 0.6
Key1-Citation1 credit: 25 * 0.6*[0.8/(0.8+0.2)]
0.80.2
[25,25,25]
[29,26,28] [27,27,26]
[25,25,25]
Domain publication rankingDomain keyword topical rankingTopical citation tree
Citation number between paper pair is IMPORTANT!
![Page 93: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/93.jpg)
Different citations make different contribution to different topics (keywords) to the citing publication.
![Page 94: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/94.jpg)
Publication/venue/author topic prior
Citation transitioning topic prior
![Page 95: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/95.jpg)
nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 nDCG@ALL0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
NDCG for Review citation recommendationN
DCG
![Page 96: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/96.jpg)
Literature Review Citation recommendation
Input: Paper Abstract
Output: A list of ranked citations
MAP and NDCG evaluation
![Page 97: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/97.jpg)
Given a paper abstract:
1. Word level match (language model)2. Topic level match (KL-Divergence)3. Topic importance
Use Inference Network to integrate each hypothesis
![Page 98: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/98.jpg)
Citation Recommendation
Content MatchPublication
Topical Prior
1. PageRank2. Full-text PageRank (greedy match)3. Full-text PageRank (topic modeling)
Topic match
Inference Network
![Page 99: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/99.jpg)
Input
Output:
1. [3] YES 32. [2] YES 23. [6] NO 04. [8] NO 05. [10] YES 16. [1] NO 0……
MAP(Cite or not?)
NDCG(Important citation?)
![Page 100: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/100.jpg)
nDCG@10 nDCG@30 nDCG@50 nDCG@100 nDCG@300 nDCG@500 nDCG@1000 nDCG@3000 nDCG@5000 [email protected]
0.15
0.2
0.25
0.3
0.35
0.4NDCG for citation recommendation based on Abstract
Based on greedy match, 1 second
Based on topic inference, 30 seconds
![Page 101: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/101.jpg)
CONCLUSION
• Information Retrieval• Index• Retrieval Model• Ranking• User feedback• Evaluation
• Knowledge Retrieval• Machine Learning• User Knowledge• Integration • Social Network Analysis
![Page 102: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/102.jpg)
![Page 103: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/103.jpg)
![Page 104: Information Retrieval to Knowledge Retrieval , one more step](https://reader035.fdocuments.us/reader035/viewer/2022062814/5681688d550346895ddf108a/html5/thumbnails/104.jpg)
Thank you!