Recommendation Engines for Scientific Literature

Recommendation Engines for Scientific

Literature

Kris Jack, PhDData Mining Team Lead

➔ 2 recommendation use cases

➔ literature search with Mendeley

➔ use case 1: related research

➔ use case 2: personalised recommendations

Summary

Use Cases

2) Personalised Recommendations● given a user's profile (e.g. interests)● find articles of interest to them

1) Related Research● given 1 research article● find other related articles

Two types of recommendation use cases:

Use Cases

My secondment(Dec-Feb):

Literature Search Using Mendeley

● Use only Mendeley to perform literature search for:● Related research● Personalised recommendations

Challenge!

Eating your own dog food...

Queries: “content similarity”, “semantic similarity”, “semantic relatedness”, “PubMed related articles”, “Google Scholar related articles”

Found:

Queries: “content similarity”, “semantic similarity”, “semantic relatedness”, “PubMed related articles”, “Google Scholar related articles”

Found:

Summary of Results

Strategy Num Docs Found

Comment

Catalogue Search 19 9 from “Related Research”

Group Search 0 Needs work

Perso Recommendations 45 Led to a group with 37 docs!

Found:

Summary of Results

Eating your own dog food... Tastes good!

Strategy Num Docs Found

Comment

Catalogue Search 19 9 from “Related Research”

Group Search 0 Needs work

Perso Recommendations 45 Led to a group with 37 docs!

Found:

64 => 31 docs, read 14 so far, so what do they say...?

Use Cases

Use Case 1: Related Research

User study (e.g. Likert scale to rate relatedness between documents). (Beel & Gipp, 2010)

TREC collections with hand classified 'related articles' (e.g. TREC 2005 genomics track). (Lin & Wilbur, 2007)

Try to reconstruct a document's reference list (Pohl, Radlinski, & Joachims, 2007; Vellino, 2009)

7 highly relevant papers (related research for scientific articles)

Q1/4: How are the systems evaluated?

Paper reference lists (Pohl et al., 2007; Vellino, 2009)

Usage data (e.g. PubMed, arXiv) (Lin & Wilbur, 2007)

Document content (e.g. metadata, co-citation, bibliographic coupling) (Gipp, Beel, & Hentschel, 2009)

Collocation in mind maps (Jöran Beel & Gipp, 2010)

Q2/4: How are the systems trained?

bm25 (Lin & Wilbur, 2007)

Topic modelling (Lin & Wilbur, 2007)

Collaborative filtering (Pohl et al., 2007)

Bespoke heuristics for feature extraction (e.g. in-text citation metrics for same sentence, paragraph). (Pohl et al., 2007; Gipp et al., 2009)

Q3/4: Which techniques are applied?

Topic modelling slighty improves on BM25 (MEDLINE abstracts) (Lin & Wilbur, 2007):- bm25 = 0.383 precision @ 5- PMRA = 0.399 precision @ 5

Seeding CF with usage data from arXiv won out over using citation lists (Pohl et al., 2007)

Not yet found significant results that show content-based or CF methods are better for this task

Q4/4: Which techniques have most success?

Progress so far...

Q1/2 How do we evaluate our system?

Construct a non-complex data set of related research:● include groups with 10-20 documents (i.e. topics)● no overlaps between groups (i.e. documents in common)● only take documents that are recognised as being in English● document metadata must be 'complete' (i.e. has title, year, author, published in, abstract, filehash, abstract, tags/keywords/MeSH terms)

→ 4,382 groups → mean size = 14 → 60,715 individual documents

Given a doc, aim to retrieve the other docs from its group● tf-idf with lucene implementation

Progress so far...

Given a doc, aim to retrieve the other docs from its group● tf-idf with lucene implementation

Progress so far...

Given a doc, aim to retrieve the other docs from its group

author

publishedIn

fileHash

abstract

generalKeyw

keywords

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

Metadata Presence in Documents

Evaluation Data Det

Catalogue

metadata field

Progress so far...

Q2/2 What are our results?

abstracttitle

generalKeywordmesh-term

authorkeyword

tf-idf Precision per Field for Complete Data Set

metadata field

Progress so far...

tag abstract mesh-term title general-keyword author keyword0

tf-idf Precision per Field when Field is Available

metadata field

Progress so far...

BestCombo = abstract+author+general-keyword+tag+title

bestComboabstract

titlegeneralKeyword

mesh-termauthor

keywordtag

tf-idf Precision for Field Combos for Complete Data Set

metadata field(s)

Progress so far...

BestCombo = abstract+author+general-keyword+tag+title

tagbestCombo

abstractmesh-term

titlegeneral-keyword

authorkeyword

tf-idf Precision for Field Combos when Field is Available

metadata field(s)

Future directions...?

Evaluate multiple techniques on same data set

Construct public data set● similar to current one but with data from only public groups● analyse composition of data set in detail

Train:● content-based filtering● collaborative filtering● hybrid

Evaluate the different systems on same data set

...and let's brainstorm!

Use Cases

Use Case 2: Perso Recommendations

Cross validation on user libraries (Bogers & van Den Bosch, 2009; Wang & Blei, 2011)

User studies (McNee, Kapoor, & Konstan, 2006; Parra-Santander & Brusilovsky, 2009)

7 highly relevant papers (perso recs for scientific articles)

Q1/4: How are the systems evaluated?

CiteULike libraries (Bogers & van Den Bosch, 2009; Parra-Santander & Brusilovsky, 2009; Wang & Blei, 2011)

Documents represent users and their citations documents of interest (McNee et al., 2006)

User search history (N Kapoor et al., 2007)

Q2/4: How are the systems trained?

CF (Parra-Santander & Brusilovsky, 2009; Wang & Blei, 2011)

LDA (Wang & Blei, 2011)

Hybrid of CF + LDA (Wang & Blei, 2011)

BM25 over tags to form user neighbourhood (Parra-Santander & Brusilovsky, 2009)

Item-based and content-based CF (Bogers & van Den Bosch, 2009)

User-based CF, Naïve Bayes classifier, Probabilistic Latent Semantic Indexing, textual TF-IDF-based algorithm (uses document abstracts) (McNee et al., 2006)

Q3/4: Which techniques are applied?

CF is much better than topic modelling (Wang & Blei, 2011)

CF-topic modelling hybrid, slightly outperforms CF alone (Wang & Blei, 2011)

Content-based filtering performed slightly better than item-based filtering on a test set with 1,322 CiteULike users (Bogers & van Den Bosch, 2009)

User-based CF and tf-idf outperformed Naïve Bayes and Probabilistic Latent Semantic Indexing significantly (McNee et al., 2006)

BM25 gave better results than CF but the study was with just 7 CiteULike users so small scale (Parra-Santander & Brusilovsky, 2009)

Advantage Disadvantage

Content-based

Human readable form of their profile

Quickly absorb new content without need for ratings

Tends to over-specialise

CF Works on an abstract item-user level so you don't need to 'understand' the content

Tends to give more novel and creative recommendations

Requires a lot of data

Our progress so far...

Construct an evaluation data set from user libraries● 50,000 user libraries● 10-fold cross validation● libraries vary from 20-500 documents● preference values are binary (in library = 1; 0 otherwise)

Train:● item-based collaborative filtering recommender

Evaluate:● train recommender and test how well it can reconstruct the users' hidden testing libraries● mulitple similarity metrics (e.g. cooccurrence, loglikelihood)

Cross validation:● 0.1 precision @ 10 articles

Usage logs:● 0.4 precision @ 10 articles

Number of articles in user library

Future directions...?

Q2/2 What are our results?Evaluate multiple techniques on same data set

Construct data set● similar to current one but with more up-to-date data● analyse composition of data set in detail

Train:● content-based filtering● collaborative filtering (user-based and item-based)● hybrid

Evaluate the different systems on same data set

...and let's brainstorm!

➔ 2 recommendation use cases

➔ similar problems and techniques

➔ good results so far

➔ combining CF with content would likely improve both

Conclusion

www.mendeley.com

Beel, Jöran, & Gipp, B. (2010). Link Analysis in Mind Maps : A New Approach to Determining Document Relatedness.  Mind, (January). Citeseer. Retrieved from http://scholar.google.com/scholar?hl=en&btnG=Search&q=intitle:Link+Analysis+in+Mind+Maps+:+A+New+Approach+to+Determining+Document+Relatedness#0Bogers, T., & van Den Bosch, A. (2009). Collaborative and Content-based Filtering for Item Recommendation on Social Bookmarking Websites. ACM RecSys ’09 Workshop on Recommender Systems and the Social Web. New York, USA. Retrieved from http://ceur-ws.org/Vol-532/paper2.pdfGipp, B., Beel, J., & Hentschel, C. (2009). Scienstein: A research paper recommender system. Proceedings of the International Conference on Emerging Trends in Computing (ICETiC’09) (pp. 309–315). Retrieved from http://www.sciplore.org/publications/2009-Scienstein_-_A_Research_Paper_Recommender_System.pdfKapoor, N, Chen, J., Butler, J. T., Fouty, G. C., Stemper, J. A., Riedl, J., & Konstan, J. A. (2007). Techlens: a researcher’s desktop. Proceedings of the 2007 ACM conference on Recommender systems (pp. 183-184). ACM. doi:10.1145/1297231.1297268Lin, J., & Wilbur, W. J. (2007). PubMed related articles: a probabilistic topic-based model for content similarity. BMC Bioinformatics, 8(1), 423. BioMed Central. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/17971238McNee, S. M., Kapoor, N., & Konstan, J. A. (2006). Don’t look stupid: avoiding pitfalls when recommending research papers. Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work (p. 180). ACM. Retrieved from http://portal.acm.org/citation.cfm?id=1180875.1180903Parra-Santander, D., & Brusilovsky, P. (2009). Evaluation of Collaborative Filtering Algorithms for Recommending Articles. Web 3.0: Merging Semantic Web and Social Web at HyperText ’09 (pp. 3-6). Torino, Italy. Retrieved from http://ceur-ws.org/Vol-467/paper5.pdfPohl, S., Radlinski, F., & Joachims, T. (2007). Recommending related papers based on digital library access records. Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries (pp. 418-419). ACM. Retrieved from http://portal.acm.org/citation.cfm?id=1255175.1255260Vellino, A. (2009). The Effect of PageRank on the Collaborative Filtering Recommendation of Journal Articles. Retrieved from http://cuvier.cisti.nrc.ca/~vellino/documents/PageRankRecommender-Vellino2008.pdfWang, C., & Blei, D. M. (2011). Collaborative topic modeling for recommending scientific articles. Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 448–456). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=2020480

References

Recommendation Engines for Scientific Literature

Technology

Transcript of Recommendation Engines for Scientific Literature

PROMOTION RECOMMENDATION › files › meetings › 05-18... · PROMOTION RECOMMENDATION The University of Michigan College of Literature, Science, and the Arts Emanuel Gull, assistant

TogoDoc: Smart Recommendation and Efﬁcient …Smart Recommendation and Efﬁcient Management of Life Science Literature Wataru Iwasaki (Univ Tokyo) 0 5,000,000 10,000,000 15,000,000

Recommendation Engine Acceleration Recommendation Engines · * Benchmarks obtained with Apache Spark framework; other recommender engine software may deviate from these results *

Web-Scale Recommendation Engines - UCLA · Web-Scale Recommendation Engines CS130 Fall 2012 Andrew Look Senior Software Engineer Shopzilla. About Shopzilla Comparison Shopping Engine

Building Search & Recommendation Engines

Recommendation engines and the new era of data-driven engagement

Partner Webinar: Recommendation Engines with MongoDB and Hadoop

Jane Recommendation Engines

Fully Automated QA System For Large Scale Search And Recommendation Engines Using Spark

PROMOTION RECOMMENDATION · PROMOTION RECOMMENDATION The University of Michigan College of Literature, Science, and the Arts ... Miller Post-doctoral Fellow, University of California,

PROMOTION RECOMMENDATION College of …...PROMOTION RECOMMENDATION The University of Michigan College of Literature, Science, and the Arts Genevieve Zubrzycki, associate professor

Swaraj Engines Ltd (NSE - SWARAJENG) - Oct'13 Katalyst Wealth Alpha Recommendation

Mendeley: Recommendation Systems for Academic Literature

DAML PI meeting kickoff · DAML Program Overview ... Inference engines Ontology-based natural ... (OWL) as a W3C Candidate Recommendation. Candidate Recommendation is an explicit

Tom Rampley Recommendation Engines: an Introduction.

A Literature of Stirling Engines Walker

Hybrid Recommendation - University of Pittsburghpeterb/2480-202/HybridRecommendation.pdfThree basic recommendation engines •Collaborative Filtering: exploiting other likely-minded

How to Design Retail Recommendation Engines with Neo4j

Building Online Buzz and Authority Using Social Media Recommendation Engines

Collaborative Filtering and Artificial Neural Network ... · Recommendation Engines: A Deep Learning Approach. Based on the reference of CF, recommendation engines have been communicated