Content Mining for Machines and Humans
-
Upload
thecontentmine -
Category
Science
-
view
87 -
download
3
Transcript of Content Mining for Machines and Humans
Content-Mining for Machines and HumansPeter Murray-Rust
contentmine.orgWellcomeTrust, London, 2015-03-06
• Extract 100 million facts (CC0) from the scientific literature per year
• Grow communities and give everyone the tools and know-how to mine science
What is “Content”?
http://www.plosone.org/article/fetchObject.action?uri=info:doi/10.1371/journal.pone.0111303&representation=PDF CC-BY
SECTIONS
MAPS
TABLES
CHEMISTRYTEXT
MATH
contentmine.org tackles these
Machine-Human symbioses
• Wikipedia• Open StreetMap
We aim to make it trivial for a human+machine to mine the scientific literature. By building Communities
Workshops and Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application in a morning
Start simple: bagOfWords, Stemming, Regex, templates
Oxford 2013
Berlin 2014
Delhi 2014
Jenny Molloy with mascot AMI
• CRAWL the web for scientific documents (articles, grey literature, repositories)• quickSCRAPE pages (text, graphics, images, data)• NORMA-lize page to semantic form
…Open semantic science …• MINE pages with your methods and tools (AMI)
• CAT-alogue results in searchable index• Automate daily process (CANARY)
contentmine.org Infrastructure
quickscrapeCrawlFeed Norma Index &
Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
PluginsRegex
SequencesSpecies
Bespoke
ScrapersXPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
https://commons.wikimedia.org/wiki/File:Flickr_-_DVIDSHUB_-_RSP_Warrior_Challenge_Prepares_Soldiers_Mentally,_Physically_%281%29.jpg
CRAWLing the Literature
NO Central Table of Contents
Massive technical, political, legal opposition
Little interest from Academia
Tedious
Few general tools
The Right to Read is The Right To Mine
PMR in 2012: http://blog.okfn.org/2012/06/01/the-right-to-read-is-the-right-to-mine/
SCRAPE
https://en.wikipedia.org/wiki/Gleaning#mediaviewer/File:Millet_Gleaners.jpg PublicDomain
HTML
XML quickscrape*
*Scrapers created by Richard Smith-Unna + Community
HTMLPDFXMLPNGSVGCSVDOCLaTeXCIF…
Non-standard per-publisher site
https://en.wikipedia.org/wiki/W._Heath_Robinson#mediaviewer/File:Robinson%28WH%29-%28%27Uncle_Lubin%27%29.jpg PublicDomain
NORMA-lization of Scientific Literature
PDFs, Broken HTMLPNGs for Math, etc.
NORMA
UnicodeDiacriticsWell-formedSectionedTaggedSVG diagrams
AMI-plugins• BagOfWords, Stemming and Regular Expressions• Species• Biological Sequences• Chemical compounds & reactions
• Farming * (Rory Aaronson)
• Crystallography * (Saulius Grazulis, COD)• Clinical Trials * (Amy Price)
• Phylogenetics * (Ross Mounce)
• Phytochemistry * (Chris Steinbeck, PMR)
* subcommunities
http://chemicaltagger.ch.cam.ac.uk/
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
UNITS
TICKS
QUANTITYSCALE
TITLES
DATA!!2000+ points
VECTOR PDF
Dumb PDF
CSV
SemanticSpectrum
2nd Derivative
Smoothing Gaussian Filter
Automaticextraction
https://blogs.ch.cam.ac.uk/pmr/2014/06/25/content-mining-we-can-now-mine-images-of-phylogenetic-trees-and-more/ for story of extraction
Thinning Topology
Serialization
Newick
https://en.wikipedia.org/wiki/Irrigation#mediaviewer/File:Pump-enabled_Riverside_Irrigation_in_Comilla,_Bangladesh,_25_April_2014.jpg CC BY-SA 3.0
Daily Stream of 100,000 Open Facts
Twitter?Indexed by CAT
Phytochemistry extraction
O. dayi
“volatile composition of “
A.sibeiri
A. judaica
Displayed by CAT (CottageLabs)
Workshops (1-hour -> full day or more)
2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US• JISC , London
Upcoming• LIBER • Cochrane• BL• Wellcome Trust (April)• WHO
Collaborators
• Wikimedia/Wikidata• Mozilla• Open Knowledge• LIBER (European Research Libraries)• British Library• Wellcome Trust• EBI (Eur. Bioinf. Inst.)• JISC• Open Access Button• SPARC• Creative Commons• CORE
contentmine.org proposed Services
• Workshops• Repository indexing• Funder Compliance• Publication enhancement• Extraction of scientific data
contentmine.org team
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type Culture Collection
https://en.wikipedia.org/wiki/Track_gauge#mediaviewer/File:IndianGauges.JPG CC-BY
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
Daily stream of 300,000 facts
https://commons.wikimedia.org/wiki/File:Rapid_stream.jpg Public Domain
https://en.wikipedia.org/wiki/The_Cat_and_the_Canary_%281927_film%29#mediaviewer/File:Thecatandthecanary-windowcard-1927.jpg Public domain
CAT and CANARY
AMI Demo
http://www.mdpi.com/2218-1989/2/1/39/pdf
https://bitbucket.org/AndyHowlett/ami2-poc
ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor
May take time to start if not connected to web
Output:./target/output/reactionsexample/
SVG: ./page1annotated.svg
CML: image.g.1.4.svg.reaction0.cml AvogadroViewer: