Relevance Rankingand Clustering
Small steps towards making the
library catalogue more useful
Kent Fitch, 16 Sep 2006
Motivation
Help people find what they’re looking for
The problem
• A reference librarian often has lots of context when someone walks up to them and says “The Civil War”
• Location• Age• Clothing• What’s on the local syllabus• Books they’re carrying• Past interactions• …
• A computer program has 13 characters
Diversion – improving the context?
• IP addr• ANU, DFAT, BHP, Nicholls Primary
• Search history• “spanish history”, “franco”, “gettysburg”
• Referrer• ANU Library, Wikipedia, MySpace
• Browser• Visually impaired user?
Relevance ranking“The Civil War”: more relevant if
• Occurs in Title/Subject/Author rather than notes/TOC; main Title/Author rather than added entry…
• Occurs as a phrase or near phrase rather than as scattered words• Occurs as an exact match• Occurs multiple times (especially the unusual words)• Occurs as the only or main words (e.g., as the only subject rather
than as 1 of 10)• Is a collection level record• Is widely held• Is held by one of your libraries• Is on the shelf at one of your libraries• Is available online• Is highly rated (sales/reviews) on Amazon or LibraryThing• Is widely cited by other books or by credible web pages• Is available for inexpensive purchase and quick delivery new or
second hand
Relevance Ranking
Two approaches
– TeraText Gateway• Issue a series of searches on each successive
criteria• Very hard to incorporate non-binary factors (such
as quality of phrase match, number of holdings, …)
– Lucene• Combine a “score” for each criteria with an innate
“score” for each work
Relevance Ranking
Example
http://ll01.nla.gov.au/
ClusteringRelevance ranking only takes you so far
Relevant to what?• English civil war• US civil war• Spanish civil war• Angolan civil war• The church and civil wars• Post-colonial civil wars
Relevant to whom?• Audience• Date published• Form• Picture book• Movie• Thesis…
Clustering
Group results by various criteria
• Subjects (hierarchy or parts/facets)• Material type/form• Genre• When published• Audience• Classification (Dewey, LC)• Author
Extracting data from the MARC record for ranking and clustering
• What’s a “title”?• Deriving ranking and clustering fields
– Can we use LC/Dewey code names as “subjects”?http://ll01.nla.gov.au/search.jsp?topic=class%253A632%2BPlant%2Binjuries%252C%2Bdiseases%252C%2Bpests
– Can we reliably set “audience” based on 650 0 v Juvenile fictionGenre: “percussion xylophone” based on 048 a pb01Genre: “bibliography” and “technical report” based on 008 040308s2003 xraa bt f000 0 engSubject: “United States -- Florida” based on 043 a n-us-fl
Clustering
Example
http://ll01.nla.gov.au/
Please Help
http://ll01.nla.gov.au/ is a prototype
• What do you like and dislike about it?
• How can it be improved?
Top Related