1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei...

35
1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007

Transcript of 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei...

Page 1: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

1

Artificial Intelligence techniques for Information Retrieval in Web

Presented by

Hamid R. Chinaei

1 October 2007

Page 2: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

2

Outline

Information Retrieval Document Content User Behavior Markov Chains The Proposed Models Conclusion

Page 3: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

3

IR Architecture

IRSystem

Query String

Documentcorpus

RankedDocuments

1. Doc12. Doc23. Doc3 . .

Page 4: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

4

Document Content

Document Content (set of words + their weights)

Page 5: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

5

User Behavior

Query submissions

Clicks on documents

Time spent reading the document

Query refinements

Page 6: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

6

System Modeling

System

Query1

Query2

Query n

RankedDocuments

DocumentDescriptionClicks +Time

RanksUser

Update

Page 7: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

7

Markov Chains[5]

QU q

S dD

dCSqUpaLCSqUa ),,,|(),(),,,|(

),,( RDQ

],...,[

},...,{ 1

ji

n

dda

ddC

1)1,,(

0)0,,(

cRL

cRL

DQ

DQ

),|( DQRp R

),|1(1),|0(0),( dqRpcdqRpcqda

),( aL

Page 8: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

8

Markov Chains cont’d

)0|1( dqp

0d 1d)0|0( qdp

0q 1q

)1|1( qdp

)()|(.)|( dpdqpqdp

Page 9: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

9

Inference Networks

D1D0 Dn

Q

w1w0 wm

Document Layer

Concept Layer

Query Layer

C N

R

Page 10: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

10

Example

Page 11: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

11

Example Cont’d

Page 12: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

12

Example Cont’d

Page 13: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

13

Example Cont’d

Page 14: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

14

POMDPs

Observation: user query, clicked document by user, Time spent on the document

Rewards : time spent on a document States: the concept the user is looking for Action: Ranking the documents

Oo

ta

t

Oo

tAa

t

bVboPabRb

boabtsbVboPabRbV

)()|(*),(maxarg)(

,,..,)()|(*),(max)(

1

1

Page 15: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

15

POMDPs cont’d

0Q0Q

0Q

q1

T1

U1

q0

U0

q2

T2

U2

a a

T1

d1

Tn

dn…

UP

0Q2Q1Q

Page 16: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

16

Example of a System Belief

Page 17: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

17

Conclusion

Using AI techniques eventually users (not the search engine ) rank the documents– improving any ranking algorithm

Resist the effect of search engine on surviving/taking out web pages [2,3]

Page 18: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

18

Experiment Setup

Data– AOL User Session Collection [1]

Database – MySQL, 277 MB data, 216 MB index Length– At the moment experiments on 1,500,000 clickthrough (one

tenth of available clickthrough),

Application in Java – So far more than 500 line of code without comments and

test cases

Page 19: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

19

Classes

URL Query User (for the purpose of user modeling) Term IR (run class)

Page 20: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

20

Class Diagrams

Page 21: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

21

Data Schema

aolLogTable

– AnonID 1205043– Query “public records”– QueryTime 2006-04-06 03:19:42.0– URLRank 1– URL http://www.searchsystems.net

Page 22: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

22

Example: SearchSytems.net

SearchSystems.net - The Largest Public Records Directory SearchSystems.net is the internet's largest directory of public records databases, Search for all these records public, property, Federal, State,Local,...

www.searchsystems.net/ - 39k - Similar pages <meta name="description" content="SearchSystems.net is the internet's

largest directory of public records databases,Search for all these records public, property, Federal, State, Local, national, vital, Tax, geneaology, court, social security, documents, judgments, probation, laws, civil, suit, court" />

<meta name="keywords" content="records, public, directory, Federal, State, Local, national, vital, Tax, genealogy, court, social security, documents, judgments, probation, laws, civil, suit, court, action, lien, USA, certificates, lawsuits, offenders, court, civil, information" />

Page 23: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

23

Example cont’d

Result set for SearchSystems.net

resultSet= Select a.AnonID AS AnonID, a.Query AS Query, a.QueryTime AS QueryTime, a.URLRank AS URLRank, a.URL AS URLfrom aolLogTable a where a.URL=“http://www.searchsystems.net”;

Page 24: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

24

Sample Results

AnonID Query QueryTime Rank URL10422043 germany 1850 2006-05-07 13:00:28.0 54 http://www.searchsystems.net10432858 tax liens in gretna 2006-05-28 14:30:04.0 2 http://www.searchsystems.net10434732 search public records 2006-05-22 21:12:41.0 1 http://www.searchsystems.net10559651 free unclaimed propert search 2006-03-28 17:10:35.0 3 http://www.searchsystems.net10825800 free criminal offense search 2006-04-06 23:15:20.0 1 http://www.searchsystems.net10971516 public records 2006-05-09 23:01:09.0 1 http://www.searchsystems.net11199274 mentor ohio criminal records 2006-05-22 19:42:12.0 1 http://www.searchsystems.net11412322 texas public records of birth 2006-04-14 10:51:14.0 6 http://www.searchsystems.net11412322 free inmate locator 2006-04-23 17:39:21.0 17 http://www.searchsystems.net11655138 public court records bakersfield 2006-04-09 15:56:52.0 2 http://www.searchsystems.net11752893 free online public records 2006-05-26 20:32:45.0 1 http://www.searchsystems.net

Page 25: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

25

Observation 1

Number of clicks for URLs increases exponentially

www.microsoft.com

www.searchsystems.com

Page 26: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

26

Getting Query Chains

resultSet= Select a.AnonID AS AnonID, a.Query AS Query, a.QueryTime AS QueryTime, a.URLRank AS URLRank, a.URL AS URLfrom aollogtable1 a where a.AnonID= _AnonID

and a.QueryTime< _QueryTimeorder by a.QueryTime desc ;

For the purpose of recursive calls for query chains (see next slide)

Page 27: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

27

Getting Query Chains cont’d

preURLsRecursive(_QueryTime) {

if (resultSet) {result=resultSet.next;QueryTimePrime = resultSet.getTimestamp();if (_QueryTime - QueryTimePrime < timeThresh) {

preURLsRecursive(QueryTimePrime);return result;

}}

}

Page 28: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

28

Sample of Results

User has not clicked any result here

484518 indiana state prison 2006-03-06 13:24:22.0 1 http://www.in.gov484518 morgan county indiana jail 2006-03-06 13:27:38.0 1 http://scican3.scican.net484518 indiana inmate locator 2006-03-06 13:28:54.0 1 http://www.in.gov484518 fugitives of indiana 2006-03-06 13:37:51.0 1 http://www.criminalwatch.com484518 indiana fugitives caught 2006-03-06 13:39:12.0 0484518 west virgina public records wills 2006-03-06 13:40:48.0 0484518 west virgina public records 2006-03-06 13:41:11.0 0484518 west virginia public records 2006-03-06 13:41:18.0 1 http://www.searchsystems.net

Page 29: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

29

Observation 2: Term Weights

We used data logs to obtain weight of word w for URL d, R(w,d),

qi s are queries in which word w occur

qj s are all queries for URL d

Rank(qi,d) is the rank of URL d for query qi

m

jj dqRank

dqRankdwqR

1

),(

)),(/1(),,(

n

ii dwqRdwR

1

),,(),(

Page 30: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

30

Observation 2 cont’d

Top 40 terms for URL SearchSystems.net

– county, records, court, free, public, florida, cases, michigan, germany, probate, tax, pasco, oregon, nc, indiana, deeds, sheriff, ohio, search, hanover, etowah, criminal, texas, property, warrants, databases

Page 31: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

31

Next step

More accurately obtain of word weights for URLs – Use of information in query chains for obtaining

top term of URLs– Use of other methods?

Obtain of document summaries for several URLs and evaluate the results

Page 32: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

32

Thanks

Page 33: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

33

Discussions and Questions

Can proposed model eventually provide us a fix document content? (Does the method converge?)

Any other technique which might be helpful.

Page 34: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

34

References

[1] Jian-Tao Sun, Dou Shen, Hua-Jun Zeng, Qiang Yang, Yuchang Lu, Zheng Chen: Web-page summarization using clickthrough data. SIGIR 2005: 194-201

[2] Alexandros Ntoulas, Junghoo Cho, Christopher Olston: What's new on the web?: the evolution of the web from a search engine perspective. WWW 2004: 1-12

[3] Junghoo Cho, Sourashis Roy: Impact of search engines on page popularity. WWW 2004: 20-29

Page 35: 1 Artificial Intelligence techniques for Information Retrieval in Web Presented by Hamid R. Chinaei 1 October 2007.

35

Reference

[4]G. Pass et al., "A Picture of Search" The First International Conference on Scalable Information Systems, Hong Kong, June, 2006 Copyright (2006) AOL

[5]J. Lafferty, C. Zhai, “Document Language Models, Query Models, and Risk Minimization" SIGIR 2001