Advanced information retrieval Computer engineering department Chapter 13 – Web IR.

Advanced information Advanced information retrievalretrieval

Computer engineering departmentComputer engineering departmentChapter 13 – Web IRChapter 13 – Web IR

• Introduction (History, Definitions, Aims, Introduction (History, Definitions, Aims, Statistics)Statistics)• Tasks of search enginesTasks of search engines - Gathering- Gathering

- Indexing- Indexing- “Searching” (Querying and ranking - “Searching” (Querying and ranking

algorithms)algorithms)- Document and query management- Document and query management

• MetasearchMetasearch• Browsing and web directoryBrowsing and web directory• Users and web searchUsers and web search• Research issuesResearch issues

Searching the webSearching the web

Web Searching and Classical Web Searching and Classical IRIR

Classic IR research

1970s 1980s 1990s 2000s

TREC

then came the web

web searching

Terminology and Terminology and DefinitionsDefinitions

• A web page corresponds to a document in traditional IRA web page corresponds to a document in traditional IR• Web pages are different in their size and in the types of Web pages are different in their size and in the types of

files that constitute themfiles that constitute them– text, graphics, sound, video, GIF, JPEG, ASCII,PDF, etctext, graphics, sound, video, GIF, JPEG, ASCII,PDF, etc

• IR on the web considers as collection of documents the IR on the web considers as collection of documents the part of the web that is publicly indexablepart of the web that is publicly indexable– exclude pages that cannot be indexed (authorization, exclude pages that cannot be indexed (authorization,

dynamic pages)dynamic pages)• Location of web pages by navigationLocation of web pages by navigation• Location of web pages by searching (IR on the web)Location of web pages by searching (IR on the web)

Challenges for web Challenges for web searchsearch• Problem with the dataProblem with the data

– Distributed dataDistributed data– High percentage of volatile dataHigh percentage of volatile data– Large volumeLarge volume– Unstructured dataUnstructured data– Redundant dataRedundant data– Quality of dataQuality of data– Heterogeneous dataHeterogeneous data

• Problem faced by usersProblem faced by users– How to specify a query?How to specify a query?– How to interpret answers?How to interpret answers?

Search enginesSearch engines• PortalsPortals

– identification of real namesidentification of real names– links to books from Amazon.comlinks to books from Amazon.com– send electronic postcardssend electronic postcards– translation to other languagestranslation to other languages– search of other media (metadata)search of other media (metadata)– language-specific searcheslanguage-specific searches– weather, stock price, trafficweather, stock price, traffic

• Business modelBusiness model– targeted advertising with revenue from clickthroughstargeted advertising with revenue from clickthroughs– word of mouth (since no industry standard evaluation)word of mouth (since no industry standard evaluation)– fast response (<1s) 24 hours / 7 days a weekfast response (<1s) 24 hours / 7 days a week– filter out spamfilter out spam

ExamplesExamplesSearch enginesSearch engines URLURLAltaVistaAltaVista www.altavista.comwww.altavista.comExciteExcite www.excite.comwww.excite.comGoogleGoogle www.google.comwww.google.comInfoseekInfoseek www.infoseek.comwww.infoseek.comLycosLycos www.lycos.comwww.lycos.comNorthernLightNorthernLight www.nlsearch.comwww.nlsearch.com

• 25-55% of web covered only!25-55% of web covered only!• Some search engines powered by same IR engineSome search engines powered by same IR engine• Most based in US, EnglishMost based in US, English

Other search enginesOther search engines

• Specialised in different countries/languages Specialised in different countries/languages (e.g., http://www.iol.it/)(e.g., http://www.iol.it/)

• SpecificSpecific– Rank according to popularity (e.g., DirectHit)Rank according to popularity (e.g., DirectHit)– Topic oriented (e.g., SearchBroker)Topic oriented (e.g., SearchBroker)– Personnel or institutional pages, electronic mail Personnel or institutional pages, electronic mail

addresses, images, software appletsaddresses, images, software applets

Tasks of a web search Tasks of a web search engineengine• Document gatheringDocument gathering

– select the documents to be indexedselect the documents to be indexed• Document indexingDocument indexing

– represent the content of the selected documentsrepresent the content of the selected documents– often 2 indices maintained (full + small for frequent queries)often 2 indices maintained (full + small for frequent queries)

• SearchingSearching– represent the user information need into a queryrepresent the user information need into a query– retrieval process (search algorithms, ranking of web pages)retrieval process (search algorithms, ranking of web pages)

• Document and query managementDocument and query management– display the resultsdisplay the results– virtual collection (documents discarded after indexing) vs. virtual collection (documents discarded after indexing) vs.

physical collection (documents maintained after indexing)physical collection (documents maintained after indexing)

Document gatheringDocument gathering

• Document gathering = crawling the Document gathering = crawling the webweb

• CrawlerCrawler– Robot, spider, wanderer, walker, Robot, spider, wanderer, walker,

knowbot, web search agentknowbot, web search agent– Program that traverses the web to send Program that traverses the web to send

new or updated pages to be indexednew or updated pages to be indexed– Run on local server and send requests to Run on local server and send requests to

remote serversremote servers

Crawling the web (1)Crawling the web (1)• Crawling processCrawling process

– Start with set of URLsStart with set of URLs• Submitted by users or companiesSubmitted by users or companies• Popular URLsPopular URLs

– Breath-first or depth-firstBreath-first or depth-first– Extract further URLsExtract further URLs

• Up to 10 millions pages per dayUp to 10 millions pages per day

• Several crawlersSeveral crawlers– Problem of redundancyProblem of redundancy– Web partition Web partition robot per partition robot per partition

Crawling the web (2)Crawling the web (2)• Up-to-date?Up-to-date?

– Non-submitted pages need up to 2 months to be Non-submitted pages need up to 2 months to be indexedindexed

– Search engine learns change frequency of pagesSearch engine learns change frequency of pages– Popular pages (having many links to them) Popular pages (having many links to them)

crawled more frequentlycrawled more frequently• Guideline for robot behavioursGuideline for robot behaviours

– File placed at root of web serverFile placed at root of web server– Indicate web pages that should not be indexedIndicate web pages that should not be indexed– Avoid overloading servers/sitesAvoid overloading servers/sites

Typical anatomy of a large-scale crawler.

Document indexingDocument indexing• Document indexing = building the indicesDocument indexing = building the indices• Indices are variant of inverted filesIndices are variant of inverted files

– meta tag analysismeta tag analysis– stop words removal + stemmingstop words removal + stemming– position data (for phrase searches)position data (for phrase searches)– weightsweights

• tf x idf; tf x idf; • downweight long URLs (not important page)downweight long URLs (not important page)• upweight terms appearing at the top of the documents, or upweight terms appearing at the top of the documents, or

emphasised termsemphasised terms– use de-spamming techniquesuse de-spamming techniques

• hyperlink informationhyperlink information• count link popularitycount link popularity• anchor text from source linksanchor text from source links• hub and authority value of a pagehub and authority value of a page

Maintaining indices over dynamic collections.

Stop-press indexStop-press index• Collection of document in fluxCollection of document in flux

– Model document modification as deletion followed by insertionModel document modification as deletion followed by insertion– Documents in flux represented by a Documents in flux represented by a signed signed record record ((d,t,sd,t,s))– ““s” specifies if “d” has been deleted or inserteds” specifies if “d” has been deleted or inserted ..

• Getting the final answer to a queryGetting the final answer to a query– Main index returns a document set Main index returns a document set DD00..– Stop-press index returns two document setsStop-press index returns two document sets

• DD++ : documents not yet indexed in : documents not yet indexed in DD0 0 matching the querymatching the query• D- : documents matching the query removed from the collection since D- : documents matching the query removed from the collection since DD0 0 was was

constructed.constructed.• Stop-press index getting too largeStop-press index getting too large

– Rebuild the main indexRebuild the main index• signed (signed (d, t, sd, t, s) records are sorted in () records are sorted in (t, d, st, d, s) order and merge-purged into the ) order and merge-purged into the

master (master (t, dt, d) records) records– Stop-press index can be emptied out.Stop-press index can be emptied out.

SearchingSearching• QueryingQuerying

– 1 word or all words must be in the retrieved pages1 word or all words must be in the retrieved pages– normalisation (stop words removal, stemming, etc)normalisation (stop words removal, stemming, etc)– complex queries (date, structure, region, etc)complex queries (date, structure, region, etc)– Boolean expressions (advanced search)Boolean expressions (advanced search)– metadatametadata

• Ranking algorithmsRanking algorithms– use of web linksuse of web links– web page authority analysisweb page authority analysis

• HITS (Hyperlink Induced Topic Search)HITS (Hyperlink Induced Topic Search)• PageRank (Google)PageRank (Google)

Meta-search systemsMeta-search systems• Take the search engine to the documentTake the search engine to the document

– Forward queries to many geographically distributed repositoriesForward queries to many geographically distributed repositories• Each has its own search serviceEach has its own search service

– Consolidate their responses.Consolidate their responses.• AdvantagesAdvantages

– Perform non-trivial query rewriting Perform non-trivial query rewriting • Suit a single user query to many search engines with different query syntaxSuit a single user query to many search engines with different query syntax

– Surprisingly small overlap between crawlsSurprisingly small overlap between crawls• Consolidating responsesConsolidating responses

– Function goes beyond just eliminating duplicatesFunction goes beyond just eliminating duplicates– Search services do not provide standard ranks which can be Search services do not provide standard ranks which can be

combined meaningfullycombined meaningfully

Similarity searchSimilarity search• Cluster hypothesisCluster hypothesis

– Documents similar to relevant documents Documents similar to relevant documents are also likely to be relevantare also likely to be relevant

• Handling “find similar” queriesHandling “find similar” queries– Replication Replication or or duplicationduplication of pages of pages– Mirroring of sitesMirroring of sites

Document similarityDocument similarity• Jaccard coefficientJaccard coefficient of similarity between of similarity between

document and document and • T(d) = set of tokens in document dT(d) = set of tokens in document d

– ..– Symmetric, reflexive, not a metricSymmetric, reflexive, not a metric– Forgives any number of occurrences and any Forgives any number of occurrences and any

permutations of the terms.permutations of the terms.• is a metricis a metric

1d 2d

|)()(||)()(|),('

21

2121 dTdT

dTdTddr

),('1 21 ddr

Finding near duplicates Finding near duplicates algorithmalgorithm

Use of web linksUse of web links• Web linkWeb link: represent a relationship between the : represent a relationship between the

connected pagesconnected pages

• The main difference between standard IR algorithms The main difference between standard IR algorithms and web IR algorithms is the massive presence of web and web IR algorithms is the massive presence of web linkslinks

– web links are source of evidence but also source of noiseweb links are source of evidence but also source of noise

– classical IR: citation-based IRclassical IR: citation-based IR

– web track in TREC, 2000, TREC-9: Small Web task (2GB of web track in TREC, 2000, TREC-9: Small Web task (2GB of web data); Large Web task (100GB of web data, 18.5 million web data); Large Web task (100GB of web data, 18.5 million documents)documents)

Use of anchor textUse of anchor text• represent referenced documentrepresent referenced document

– why?why?• provides more accurate and concise description than the page provides more accurate and concise description than the page

itselfitself• (probably) contains more significant terms than the page itself(probably) contains more significant terms than the page itself

– used by ‘WWW Worm’ - one of the first search engines used by ‘WWW Worm’ - one of the first search engines 19941994

– representation of images, programs, …representation of images, programs, …

• generate page descriptions from anchor textgenerate page descriptions from anchor text

AlgorithmsAlgorithms• Query independent page qualityQuery independent page quality

– global analysisglobal analysis

• PageRank (Google):PageRank (Google): simulates a random walk across the web simulates a random walk across the web and computes the “score” of a page as the probability of and computes the “score” of a page as the probability of reaching the pagereaching the page

• Query dependent page quality Query dependent page quality – local analysislocal analysis

• HITS (Hyperlink Induced Topic Search):HITS (Hyperlink Induced Topic Search): focusses on broad focusses on broad topic queries that are likely to be answered with too many pagestopic queries that are likely to be answered with too many pages

– the more a page is pointed to by other pages, the more popular is the more a page is pointed to by other pages, the more popular is the pagethe page

– popular pages are more likely to include relevant information than popular pages are more likely to include relevant information than non-popular pagesnon-popular pages

PageRank (1)PageRank (1)• Designed by Brin and Page at Stanford UniversityDesigned by Brin and Page at Stanford University• Used to implement GoogleUsed to implement Google• Main idea:Main idea:

– a page has a high rank if the sum of the ranks of its in-a page has a high rank if the sum of the ranks of its in-links is highlinks is high• in-link of page p: a link from a page to page pin-link of page p: a link from a page to page p• out-link of a page p: a link from page p to a pageout-link of a page p: a link from page p to a page

– a high PageRank page has many in-links or few highly a high PageRank page has many in-links or few highly ranked in-linksranked in-links

• Retrieval: use cosine product (content, feature, Retrieval: use cosine product (content, feature, term weight) combined with PageRank valueterm weight) combined with PageRank value

PageRank (2)PageRank (2)• Random Surfer Model : user randomly navigatesRandom Surfer Model : user randomly navigates

– Initially the surfer is at a random pageInitially the surfer is at a random page– At each step the surfer proceedsAt each step the surfer proceeds

• to a randomly chosen Web page with probability to a randomly chosen Web page with probability d d called the called the “damping factor” (e.g. probability of random jump = 0.2)“damping factor” (e.g. probability of random jump = 0.2)

• to a randomly chosen page linked to the current page with to a randomly chosen page linked to the current page with probability 1-probability 1-dd (e.g. probability of following a random outlink = 0.8) (e.g. probability of following a random outlink = 0.8)

• Process modelled by Markov ChainProcess modelled by Markov Chain– PageRankPageRank PR of a page PR of a page aa = probability that the surfer is = probability that the surfer is

at page at page aa on a given time on a given time

PR(a) = KPR(a) = Kdd + K(1- + K(1-dd) ) i=1,ni=1,n PR(a PR(aii)/C(a)/C(aii))

dd set by system set by system a = page pointed by aa = page pointed by aii for i=1,n for i=1,nK normalisation factorK normalisation factor C(aC(aii) = number of outlinks of a) = number of outlinks of aii

HITS: Hypertext Induced Topic HITS: Hypertext Induced Topic SearchSearch

• Originated from Kleinberg, 1997Originated from Kleinberg, 1997

• Also referred to as the “The Connectivity Analysis Approach”Also referred to as the “The Connectivity Analysis Approach”

• Broad topic queries produce large sets of retrieved resultsBroad topic queries produce large sets of retrieved results– abundanceabundance problem problem too many relevant documents too many relevant documents– new type of quality measure needed new type of quality measure needed distinguish the most distinguish the most

“authoritative” pages “authoritative” pages high-quality response to a broad query high-quality response to a broad query

• HITS: for a certain topic, it identifies HITS: for a certain topic, it identifies – good authoritiesgood authorities

• pages that contain relevant information (good sources of content)pages that contain relevant information (good sources of content)– good hubsgood hubs

• page that point to useful pages (good sources of links)page that point to useful pages (good sources of links)

HITS (2)HITS (2)• IntuitionIntuition

– authority comes from inlinksauthority comes from inlinks– being a good hub comes from outlinksbeing a good hub comes from outlinks

– better authority comes from inlinks from good hubsbetter authority comes from inlinks from good hubs– being a better hub comes from outlinks to good being a better hub comes from outlinks to good

authoritiesauthorities

• Mutual reinforcement between hubs and authoritiesMutual reinforcement between hubs and authorities– a good authority page is pointed to by many hub pagesa good authority page is pointed to by many hub pages– a good hub page point to many authority pagesa good hub page point to many authority pages

HITS (3)HITS (3)• Set of pages S that are retrieved Set of pages S that are retrieved

– sometimes sometimes kk (e.g. (e.g. k =k = 200) top-ranked pages 200) top-ranked pages

• Set of pages T that point to or are pointed to by Set of pages T that point to or are pointed to by retrieved set of pages Sretrieved set of pages S– rank pages according to in_degree (number of in-links) - not rank pages according to in_degree (number of in-links) - not

effectiveeffective

• Set of pages T : Set of pages T : – authoritative pages relevant to query should have a large authoritative pages relevant to query should have a large

in_degreein_degree– considerable overlap in the sets of pages that point to them considerable overlap in the sets of pages that point to them

hubshubs

Algorithm for HITS (General Algorithm for HITS (General Principle)Principle)

• Computation of hub and authority value of a page through Computation of hub and authority value of a page through the iterative propagation of “authority weight” and “hub the iterative propagation of “authority weight” and “hub weight”weight”

• Initially all values equal to 1Initially all values equal to 1• Authority weight of page x(p)Authority weight of page x(p)

– if p is pointed to by many pages with large y-values,then it if p is pointed to by many pages with large y-values,then it should receive a large x-valueshould receive a large x-value

x(p) = x(p) = qiqipp y(qi) y(qi)• Hub weight of page y(p)Hub weight of page y(p)

– if p points to many pages with large x-values, then it should if p points to many pages with large x-values, then it should receive a large y-valuereceive a large y-value

y(p) = y(p) = ppqiqi x(qi) x(qi)• After each computation (iteration), weights are After each computation (iteration), weights are

normalisednormalised

Topic distillationTopic distillation• Process of finding ‘quality’ (authority - hub) Web documents Process of finding ‘quality’ (authority - hub) Web documents

related to a query topic, given an initial user information related to a query topic, given an initial user information need. need.

• Extensions of HITSExtensions of HITS– ARC (Automatic Resource Compilation)ARC (Automatic Resource Compilation)

• distance-2 neighbourhood graphdistance-2 neighbourhood graph• anchor (& surrounding) text used in computation of hub and authority anchor (& surrounding) text used in computation of hub and authority

valuesvalues– SALSA, etc.SALSA, etc.

• Problems with HITS Problems with HITS – mutual reinforcing relationship between hostsmutual reinforcing relationship between hosts– automatically generated linksautomatically generated links– non relevant highly connected pagesnon relevant highly connected pages– topic drift: generalisation of the query topictopic drift: generalisation of the query topic

Difference between PageRank Difference between PageRank and HITSand HITS• The PageRank is computed for all web pages The PageRank is computed for all web pages

stored in the database and then prior to the stored in the database and then prior to the query; HITS is performed on the set of retrieved query; HITS is performed on the set of retrieved web pages, and for each query.web pages, and for each query.

• HITS computes authorities and hubs; PageRank HITS computes authorities and hubs; PageRank computes authorities only.computes authorities only.

• PageRank: non-trivial to compute, HITS: easy to PageRank: non-trivial to compute, HITS: easy to compute, but real-time execution is hardcompute, but real-time execution is hard

• Implementation details of PageRank have been Implementation details of PageRank have been reportedreported

Document and query Document and query managementmanagement

• ResultsResults– Usually screens of 10 pagesUsually screens of 10 pages– ClusteringClustering– URL, size, date, abstract, etcURL, size, date, abstract, etc– Various sortingVarious sorting– Most similar documents optionsMost similar documents options– Query refinementQuery refinement

• Virtual collection vs. physical collectionVirtual collection vs. physical collection– document can change over timedocument can change over time– document may be different to the one that has been document may be different to the one that has been

indexedindexed– broken linkbroken link

Metasearch (1)Metasearch (1)• Problems of Web search engines:Problems of Web search engines:

– limited coverage of the publicly indexable Weblimited coverage of the publicly indexable Web– index different overlapping sections of the Webindex different overlapping sections of the Web– based on different IR models based on different IR models different results to the same different results to the same

query query users do not have the time, knowledge to select the most users do not have the time, knowledge to select the most

appropriate search engines with regard to their information needappropriate search engines with regard to their information need

• Possible solution: metasearch enginesPossible solution: metasearch engines– Web server that sends query to Web server that sends query to

• Several search engines, Web directories, DatabasesSeveral search engines, Web directories, Databases– Collect resultsCollect results– Unify (merge) them - Data fusionUnify (merge) them - Data fusion

• Aim: better coverage, increase effectivenessAim: better coverage, increase effectiveness

Metasearch (2)Metasearch (2)• Divided into phasesDivided into phases

– search engine selectionsearch engine selection• topic-dependent, past queries, network traffic, … topic-dependent, past queries, network traffic, …

– document selectiondocument selection• how many documents from each search engine?how many documents from each search engine?

– merging algorithmmerging algorithm• utilise rank positions, document retrieval scores, … utilise rank positions, document retrieval scores, …

MetasearcherMetasearcher URLURL Sources used Sources usedMetaCrawlerMetaCrawler www.metacrawler.comwww.metacrawler.com 1313DogpileDogpile www.dogpile.comwww.dogpile.com 2525SavvySearchSavvySearch www.search.comwww.search.com > 1000> 1000

MetaCrawler MetaCrawler

Browsing Browsing • Web directoryWeb directory

– Catalog, yellow pages, subject categoriesCatalog, yellow pages, subject categories– Many standard search engines provide subject categories nowMany standard search engines provide subject categories now

• Hierarchical taxonomy that classifies human Hierarchical taxonomy that classifies human knowledgeknowledge

Arts & HumanitiesArts & Humanities InvestingInvestingAutomotiveAutomotive Kids & FamilyKids & FamilyBusiness & EconomyBusiness & Economy Life & StyleLife & StyleComputers & InternetComputers & Internet NewsNewsEducationEducation PeoplePeopleEmploymentEmployment Philosophy & ReligionPhilosophy & ReligionEntertainment & LeisureEntertainment & Leisure PoliticsPoliticsGamesGames Science & TechnologiesScience & TechnologiesGovernmentGovernment Social Science … Social Science …

Yahoo!Yahoo!• Around one million classified pagesAround one million classified pages• More than 14 country specific directories More than 14 country specific directories

(Chinese, Danish, French, …)(Chinese, Danish, French, …)• Manual classificationManual classification

(has its problem)(has its problem)• Acyclic graphsAcyclic graphs

• Search within the classified pages/taxonomySearch within the classified pages/taxonomy

• Problem of coverageProblem of coverage

Users and the web (1)Users and the web (1)• Queries on the web: average valuesQueries on the web: average valuesMeasureMeasure Average value Average value RangeRangeNumber of wordsNumber of words2.352.35 0 - 3930 - 393Number of operatorsNumber of operators 0.410.41 0 - 9580 - 958Repetitions of queriesRepetitions of queries 3.973.97 1 - 1.5 million1 - 1.5 millionQueries per user session 2.02Queries per user session 2.02 1 - 1733251 - 173325Screens per queryScreens per query 1.39 1.39 1 - 784961 - 78496

Users and the web (2)Users and the web (2)• Some statisticsSome statistics

– Main purpose: research, leisure, business, education, Main purpose: research, leisure, business, education, • products and services (e-commerce)products and services (e-commerce)• people and company names and home pagespeople and company names and home pages• factoids (from any one of a number of documents)factoids (from any one of a number of documents)• entire, broad documentsentire, broad documents• mp3, image, video, audiomp3, image, video, audio

– 80% do not modify query80% do not modify query– 85% look first screen only85% look first screen only– 64% queries are unique64% queries are unique– 25% users use single keywords (problem for 25% users use single keywords (problem for

polysemic words and synonyms)polysemic words and synonyms)– 10% queries are empty10% queries are empty

Users and the web (3)Users and the web (3)• Results of search engines are identical, independent Results of search engines are identical, independent

ofof– useruser– context in which the user made the requestcontext in which the user made the request

• adding adding context context information for improving search information for improving search results results focus on the user need and answer it focus on the user need and answer it directly directly – explicit contextexplicit context

• query + categoryquery + category– implicit contextimplicit context

• based on documents edited or viewed by userbased on documents edited or viewed by user– personalised searchpersonalised search

• previous requests and interests, user profileprevious requests and interests, user profile

MotivationMotivation• Problem:Problem: Query word could be ambiguous: Query word could be ambiguous:

– Eg: Eg: QueryQuery“Star” “Star” retrieves documents about retrieves documents about astronomy, plants, astronomy, plants, animalsanimals etc. etc.

– Solution: Visualisation Solution: Visualisation • ClusteringClustering document responses to queries along lines of different topics. document responses to queries along lines of different topics.

• Problem 2: Problem 2: Manual construction of topic hierarchies and Manual construction of topic hierarchies and taxonomiestaxonomies– Solution:Solution:

• Preliminary Preliminary clusteringclustering of large samples of web documents. of large samples of web documents.• Problem 3:Problem 3: Speeding up similarity search Speeding up similarity search

– Solution:Solution:• Restrict the search for documents similar to a query to most Restrict the search for documents similar to a query to most

representative representative cluster(s).cluster(s).

ExampleExample

Scatter/Gather, a text clustering system, can separate salient topics in the response to

keyword queries. (Image courtesy of Hearst)

ClusteringClustering• Task :Task : Evolve measures of similarity to Evolve measures of similarity to clustercluster a collection of a collection of

documents/terms into groups within which documents/terms into groups within which similaritysimilarity within a cluster is within a cluster is larger than across clusters.larger than across clusters.

• Cluster Hypothesis:Cluster Hypothesis: GGiven a `suitable‘ clustering of a iven a `suitable‘ clustering of a collection, if the user is interested in document/term collection, if the user is interested in document/term d/td/t, , he is likely to be interested in other members of the he is likely to be interested in other members of the cluster to which cluster to which d/td/t belongs.belongs.

• Similarity measuresSimilarity measures– Represent documents by TFIDF vectorsRepresent documents by TFIDF vectors– Distance between document vectorsDistance between document vectors– Cosine of angle between document vectorsCosine of angle between document vectors

Clustering: ParametersClustering: Parameters• Similarity measure: Similarity measure: (eg: cosine similarity)(eg: cosine similarity)

• Distance measure:Distance measure: (eg: eucledian distance)(eg: eucledian distance)

• Number “k” of clustersNumber “k” of clusters• IssuesIssues

– Large number of noisy dimensionsLarge number of noisy dimensions– Notion of noise is application dependentNotion of noise is application dependent

),( 21 dd

),( 21 dd

Clustering: Formal Clustering: Formal specificationspecification• Partitioning ApproachesPartitioning Approaches

– Bottom-up clusteringBottom-up clustering– Top-down clusteringTop-down clustering

• Geometric Embedding ApproachesGeometric Embedding Approaches– Self-organization mapSelf-organization map– Multidimensional scalingMultidimensional scaling– Latent semantic indexingLatent semantic indexing

• Generative models and probabilistic approachesGenerative models and probabilistic approaches– Single topic per documentSingle topic per document– Documents correspond to mixtures of multiple topicsDocuments correspond to mixtures of multiple topics

Partitioning ApproachesPartitioning Approaches• Partition document collection into Partition document collection into kk clusters clusters • Choices:Choices:

– Minimize intra-cluster distanceMinimize intra-cluster distance– Maximize intra-cluster semblanceMaximize intra-cluster semblance

• If cluster representations are availableIf cluster representations are available– Minimize Minimize – MaximizeMaximize

• Soft clusteringSoft clustering– dd assigned to with `confidence’ assigned to with `confidence’ – Find so as to minimize or Find so as to minimize or

maximizemaximize• Two ways to get partitions - Two ways to get partitions - bottom-up bottom-up

clusteringclustering and and top-down clusteringtop-down clustering

}.....,{ 21 kDDD

i Ddd i

dd21 ,

21 ),(

i Ddd i

dd21 ,

21 ),(

i Dd

ii

Dd ),(

iD

i Ddi

i

Dd ),(

iD idz ,

idz , i Dd

iidi

Ddz ),(, i Dd

iidi

Ddz ),(,

Bottom-up clustering(HAC)Bottom-up clustering(HAC)• Initially Initially GG is a collection of singleton groups, each with one is a collection of singleton groups, each with one

document document • RepeatRepeat

– Find Find , , in in GG with max similarity measure, with max similarity measure, ss(())– Merge group Merge group with group with group

• For each For each keep track of best keep track of best • Use above info to plot the hierarchical merging process Use above info to plot the hierarchical merging process

(DENDOGRAM)(DENDOGRAM)• To get desired number of clusters: cut across any level of the To get desired number of clusters: cut across any level of the

dendogramdendogram

d

DendogramDendogram

A dendogram presents the progressive, hierarchy-forming merging process pictorially.

Similarity measureSimilarity measure• Typically Typically ss(() decreases with increasing ) decreases with increasing

number of merges number of merges • Self-SimilaritySelf-Similarity

– Average pair wise similarity between documents in Average pair wise similarity between documents in

– = inter-document similarity measure (say = inter-document similarity measure (say cosine of tfidf vectors)cosine of tfidf vectors)

– Other criteria: Maximium/Minimum pair wise Other criteria: Maximium/Minimum pair wise similarity between documents in the clusterssimilarity between documents in the clusters

21 ,21

2

),(1)(dd

ddsC

s

),( 21 dds

Switch to top-downSwitch to top-down• Bottom-upBottom-up

– Requires quadratic time and spaceRequires quadratic time and space• Top-down or Top-down or move-to-nearestmove-to-nearest

– Internal representation for documents as well as Internal representation for documents as well as clustersclusters

– Partition documents into `k’ clustersPartition documents into `k’ clusters– 2 variants2 variants

• ““Hard” (0/1) assignment of documents to clustersHard” (0/1) assignment of documents to clusters• ““soft” : documents belong to clusters, with fractional soft” : documents belong to clusters, with fractional

scoresscores– Termination Termination

• when assignment of documents to clusters ceases to when assignment of documents to clusters ceases to change much ORchange much OR

• When cluster centroids move negligibly over When cluster centroids move negligibly over successive iterationssuccessive iterations

Top-down clusteringTop-down clustering•Hard kHard k-Means: Repeat…-Means: Repeat…

– Choose Choose kk arbitrary ‘centroids’ arbitrary ‘centroids’– Assign each document to nearest centroidAssign each document to nearest centroid– Recompute centroidsRecompute centroids

•Soft Soft k-Means : k-Means : – Don’t break close ties between document Don’t break close ties between document

assignments to clustersassignments to clusters– Don’t make documents contribute to a single Don’t make documents contribute to a single

cluster which wins narrowlycluster which wins narrowly

Research issuesResearch issues• ModellingModelling• QueryingQuerying• Distributed architectureDistributed architecture• RankingRanking• IndexingIndexing• Dynamic pagesDynamic pages• BrowsingBrowsing• User interfaceUser interface• Duplicated dataDuplicated data• MultimediaMultimedia

Advanced information retrieval Computer engineering department Chapter 13 – Web IR.

Documents

Transcript of Advanced information retrieval Computer engineering department Chapter 13 – Web IR.