TheContentMine: Mining for Everyone

12
ContentMine: Mining for Ever Peter Murray-Rust BL_Labs, London, 2014-11-27

Transcript of TheContentMine: Mining for Everyone

Page 1: TheContentMine: Mining for Everyone

TheContentMine: Mining for Everyone

Peter Murray-Rust

BL_Labs, London, 2014-11-27

Page 2: TheContentMine: Mining for Everyone

The Right to Read is the Right to Mine

http://contentmine.org

Page 3: TheContentMine: Mining for Everyone

ContentMine

• 1-2 year Shuttleworth Funding from 2014-03• Free to everyone, Open Source, updated daily• Structured Text, and Image/Diagram Mining• Workshops for training and training trainers• Bottom-up community development– Bioscience (EuropePMC, BBSRC)– Disease Ebola– Astrophysics (Stray Toaster)– Chemistry (TSB, EBI, PennState - Citeseer)

• We fight for Justice and Freedom

Page 4: TheContentMine: Mining for Everyone

ContentMine People• Jenny Molloy• Ross Mounce• Peter Murray-Rust + volunteers (Bioscience, disease)• Richard Smith-Unna + 20 quickscrape volunteers• Steph Unna• Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard

Jones)• Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth)• Advisory Board RSN

Page 5: TheContentMine: Mining for Everyone

ContentMine Workshops (1-hour -> full day or more)

2014-May->Nov• Budapest/Shuttleworth• Leicester Univ• Electronic Theses and Dissertations• Austrian Science Fund AT• OKFest DE• Eur. Bioinformatics Institute• Open Science Rio de Janeiro BR• Sci DataCon , Delhi IN• Univ of Chicago US• OpenCon 2014, Wash DC. US

Upcoming• JISC• LIBER • BL• Wellcome Trust• WHO

Page 6: TheContentMine: Mining for Everyone

Ebola Collaborators (Atlanta)Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin

Page 7: TheContentMine: Mining for Everyone

Regular Expressions(Easier than Crosswords or Sudoku)

Ebola EbolaMali (not Malicious)

Mali\W (end of word)

Bat or bat [Bb]at (alternatives)bat or bats bats? (optional letter)Bat or Bats or bat or bats

[Bb]ats?

Sudden onset [Ss]udden\s+onset (space/s)Panthera leo or Gorilla gorilla

[A-Z][a-z]+\s+[a-z]+(ranges of letters)

Page 8: TheContentMine: Mining for Everyone

Ebola regex• <compoundRegex title="ebola">• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>• <regex weight="1.0" fields="marburg">(Marburg)</regex>• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagic\s+fever)</regex>• <regex weight="0.8" fields="sudden_onset">([Ss]udden\s+onset)</regex>• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omiting\s+diarrho?ea)</regex>• <regex weight="0.5" fields="guinea">(Guinea)</regex>• <regex weight="0.5" fields="sierra_leone">(Sierra\s+Leone)</regex>• <regex weight="0.5" fields="liberia">(Liberia)</regex>• <regex weight="0.5" fields="mali">(Mali)\W</regex>• <regex weight="0.6" fields="contact_tracing">([Cc]ontact\s+tracing)</regex>• <regex weight="0.5" fields="bat">\W([Bb]ats?\W)</regex>• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>• <regex weight="0.5" fields="drc">(Democratic Republic\s*(\s*of)?(\s*the)?\s*Congo)(DRC)</regex>• <regex weight="0.6" fields="safe_burial">([Ss]afe\s+burial\s+practice?s)</regex>• <regex weight="1.0" fields="etu">([Ee]bola\s+treatment\s+units?)(ETU)</regex>• </compoundRegex>

I

15 mins to create, 15 mins to install and testOr run online at CottageLabs

Page 9: TheContentMine: Mining for Everyone

Results of Regex on Ebola• <resultsList xmlns="http://www.xml-cml.org/ami">• <results xmlns="">• <source xmlns="http://www.xml-cml.org/ami"• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">• <regex xmlns="" weight="1.0" fields="[ebola]">• <pattern>(Ebola)</pattern>• </regex>• <hits xmlns="">• <hit ebola="Ebola" />• </hits>• </regex>• </result>• <result>• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">• <regex xmlns="" weight="0.5" fields="[sierra_leone]">• <pattern>(Sierra\s+Leone)</pattern>• </regex>• <hits xmlns="">• <hit sierra_leone="Sierra Leone" />• </hits>• </regex>• </result>

Page 10: TheContentMine: Mining for Everyone

Demo of Content Mining

ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.

Page 11: TheContentMine: Mining for Everyone

Bacterial WP_phylogenetic tree

Our machines have read and interpreted 4300 in an hour with > 95% accuracy

Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)

WP: Clostridium_butyricum

Genbank ID

American Type Culture Collection

Page 12: TheContentMine: Mining for Everyone

RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs

QueuesRepos

Scientificliterature

SciencePlugins

ScienceVolunteers

Collaboration with Open Access Button