Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of...

54
Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego [email protected] http://www.sdsc.edu/pb/Talks/ 1 ICMLA 2008

Transcript of Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of...

Page 1: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Machine Learning in the New World of Scholarly

Communication

Philip E. BourneUniversity of California San Diego

[email protected]://www.sdsc.edu/pb/Talks/

1ICMLA 2008

Page 2: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Disclaimer

I am not an expert in machine learning, but have applied SVMs to biological systems on occasion

2Protein Motions: Gu et al. PLoS Comp. Biol., 2006 2(7) e90P-P Interfaces: Chung et al. Proteins 2006 62(3) 630-640

Page 3: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

So Why am I Here?

There are events happening in what I broadly refer to as “scholarly communication” which I

believe offer new opportunities for those interested in machine learning

What are those opportunities and how can they be exploited?

3

Page 4: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What If…

• What if … negative data was as easily obtainable as positive data

• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge on a scale not previously possible

• What if … that knowledge included rich media

• What if … the value of that knowledge could be weighted according to the authority of the source

4

Page 5: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Some big “Ifs”

But.. Would offer a much richer medium to learn from ..

Take home message – parts of this medium are already here and those

of us generating that medium are keen to collaborate

Page 6: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Some big “Ifs”

Lets take a step back and see where we are today

Page 7: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Today’s Research Cycle

Research[Grants]

JournalArticle

ConferencePaper

PosterSession

Feds

Societies

Publishers

Reviews

BlogsCommunity Service/Data

Page 8: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Tomorrows Research Cycle

• The relationship between scientist and publisher is quite different

• The publisher is a warehouse for the workflow of scientific endeavor not just a repository for the end product

8

Page 9: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Tomorrows Research Cycle: Evidence

• Publishers hubs:– Elsevier portals– PLoS collections

• Open Access/open review e.g. Biology Direct

• NIH Roadmap requires data be accessible• New Resources:

– www.researchgate.net– MetaLab (Borya Shakhnovich)

Page 10: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What If…

• What if … negative data was as easily obtainable as positive data

• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

• What if … that knowledge included rich media

• What if … the value of that knowledge could be weighted according to the authority of the source

10

Page 11: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Example: The Protein Structure Initiative The X-ray Crystallography Pipeline

What if … negative data was as easily obtainable as positive data

Basic Steps

Target Selection

Crystallomics• Isolation,• Expression,• Purification,• Crystallization

DataCollection

StructureSolution

StructureRefinement

Functional Annotation Publish

Remains more of an Art than a Science

http://kb.psi-structuralgenomics.org/

Page 12: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Positive and Negative Data are Required by the NIH to be deposited immediately

• Data are described by an ontology

• Perhaps some underlying principles can be learnt, particularly as the amount of data is increasing rapidly

http://pepcdb.pdb.org/PepcDB/documentation/pepcDB-v9.3.jpg

Page 13: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What If…

• What if … negative data was as easily obtainable as positive data

• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

• What if … that knowledge included rich media

• What if … the value of that knowledge could be weighted according to the authority of the source

13

Page 14: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

A First Step is to Have Open and Usable Access to the Scientific

Literature..

We are making steps in that direction

Page 15: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

15

NIH Public Access Policy“The research supported by the National Institutes of Health (NIH) is essential to improving human health. Public access to this research is vital – today and for generations to come.”From a letter from NIH Director Zerhouni to grantees, February 3rd, 2005

Page 16: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

16

More and more authors care about improving access to their papers…

“Faced with the option of submitting to an open-access or closed-access journal, we now wonder whether it is ethical for us to opt for closed access on the grounds of impact factor or preferred specialist audience.”

-- Costello and Osrin in The Lancet

Page 17: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

17

Where are we Today?

• NIH and other government funders have mandated open access

• Full text increasingly on-line and potentially usable

• Traditional publishers have used the internet as a distribution medium, but the power of the medium has yet to be realized

• Data increasingly on-line but not integrated with the publication derived from it

Page 18: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

The Growth of Open Access Literature

18

Page 19: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Open Access(Creative Commons License)

1. All published materials available on-line free to all (author pays model)

2. Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s)

3. Copyright remains with the author

19

Page 20: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Open Access(Creative Commons License)

1. All published materials available on-line free to all (author pays model)

2. Unrestricted access to all published material in various formats eg XML provided attribution is given to the original author(s)

3. Copyright remains with the author The catalyst

PLoS Comp Biol 2008 4(3) e100003720

Page 21: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Community Reaction?

Most scientists have no idea that this implies that anyone can take their material and enhance it e.g., via

mashup and effectively republish it

21

What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Page 22: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Consider an Example

What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Fink & Bourne 2007 CT Watch 3(3) 26-31

Page 23: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Database and Journal Integration- The Test Bed

http://www.pdb.org/

Journals

Database

23

http://www.pubmedcentral.nih.gov

Page 24: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

The Protein Data Bank

• Paper not published unless data are deposited – strong data to literature correspondence

• Highly structured data conforming to an extensive ontology

• DOI’s assigned to every structure – http://www.doi.org

http://www.pdb.org24

Page 25: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Seamless Integration between Data and the Literature – What

Does That Imply?

• Improving semantic consistency in the literature – best done at the point of authoring

• Post processing to establish semantic content

• New forms of visualization and interaction at the presentation layer

25

Page 26: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Seamless Integration between Data and the Literature – What

Does That Imply?

• Improving semantic consistency in the literature – best done at the point of authoring

• Post processing to establish semantic content

• New forms of visualization and interaction at the presentation layer

26

Page 27: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

1. A link brings up figures from the paper

0. Full text of PLoS papers stored in a database

2. Clicking the paper figure retrievesdata from the PDB which is

analyzed

3. A composite view ofjournal and database

content results

BioLit: Tools for New Modes of Scientific Dissemination

• Biolit integrates biological literature and biological databases and includes:– A database of journal

text– Authoring tools to

facilitate database storage of journal text

– Tools to make static tables and figures interactive

4. The composite view haslinks to pertinent blocks

of literature text and back to the PDB

1.

2.

3.

4.

The Knowledge and Data Cycle

http://biolit.ucsd.edu27

Page 28: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

http://biolit.ucsd.eduNucleic Acids Research 2008 36(S2) W385-389

PSP Washington DC Feb. 2008 28

Page 29: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

29

Page 30: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

30

Page 31: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

31

Page 32: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

ICTP Trieste, December 10, 2007 32

Page 33: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

33

What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Immunology Literature

Cardiac DiseaseLiterature

Page 34: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Semantic Consistency is Best Done at the Point of Authoring

What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

Page 35: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Author Paper

Word File in Docx formatPublisher

BioLit Plugin Project

35

Page 36: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

BioLit Plugin Project

• Leverages Office Open XML used in Microsoft Office 2007

• Custom schema attached to document and used to automatically XML tag ontology terms and database identifiers within a research paper

• Ontology tagging assists publication of scientific research by aiding efficient and accurate automated categorization and promotion of information dissemination

• Conversion of manuscript to NLM DTD for direct submission to publisher

Automated Ontology & ID Tagging within Microsoft Word Documents

36

Page 37: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

BioLit Plugin ProjectRather than Post-processing the Document the

Author Controls the Semantic Tagging

37

Page 38: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Plugin Architecture

38

Page 39: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Ontologies are Stored in a Local Database

39

Page 40: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

User Configurable Selection

• Fully user configuration ontology and database identifier selection

• All searches occur within the user’s desktop computer

• Desired ontologies are downloaded and installed automatically, and update periodically

• BioLit installer XML file provides the application with the information needed to download and install ontologies.

40

Page 41: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What If…

• What if … negative data was as easily obtainable as positive data

• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

• What if … that knowledge included rich media

• What if … the value of that knowledge could be weighted according to the authority of the source

41

Page 42: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What Do We Mean by Rich Media?

Non traditional ways of conveying scientific data and knowledge ..

Video, podcasts, postercasts, blogs…

What if … that knowledge included rich media

Page 43: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

YouTube for Scientists www.scivee.tv

43What if … that knowledge included rich media

Page 44: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Motivation

44What if … that knowledge included rich media

Page 45: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Pubcast – Video Integrated with the Full Text of the Paper

45

Page 46: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Pubcast

With voice to text conversion, presentation materials etc. new knowledge is available to supplement already existing knowledge from

the paper

46What if … that knowledge included rich media

Page 47: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Postercasts

What if … that knowledge included rich media

Again additional knowledge can be used which until now has not been captured

Page 48: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

What If…

• What if … negative data was as easily obtainable as positive data

• What if … the source of learning was expanded dramatically from noisy data to include automatically captured human knowledge

• What if … that knowledge included rich media

• What if … the value of that knowledge could be weighted according to the authority of the source

48

Page 49: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

First You Have to Identify the Source

What if … the value of that knowledge could be weighted according to the authority of the source

http://openid.nethttp://www.researcherid.com

Page 50: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

How Do we Weight the Various Knowledge Sources?

• Peer reviewed literature

• Reviews (papers, grants, proceedings)

• Blog postings• Database entries

What if … the value of that knowledge could be weighted according to the authority of the source

Page 51: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

How Do we Weight the Various Knowledge Sources?

• A token system• Tokens can be

authenticated by any user of that content

• Page ranking• ??

What if … the value of that knowledge could be weighted according to the authority of the source

PLoS Comp Biol Editorial this month

Page 52: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

In Conclusion

• Scholarly communication is in a state of rapid change

• Content easily available for machine learning is expanding and includes new content types

• New opportunities are here already

Page 53: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Acknowledgements

• SciVee Team– Apryl Bailey– Tim Beck

– Leo Chalupa

– Marc Friedman– Alex Ramos– Willy Suwanto

• BioLit Team• J. Lynn Fink• Sergey Kushch• Marco Martinez• Greg Quinn• Parker Williams

CT Watch 2007, 3(3) 26-31

53

Page 54: Machine Learning in the New World of Scholarly Communication Philip E. Bourne University of California San Diego pbourne@ucsd.edu

Questions?

[email protected]://www.sdsc.edu/pb/Talks/

54