ContentMine Architecture
-
Upload
petermurrayrust -
Category
Software
-
view
95 -
download
0
Transcript of ContentMine Architecture
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
quickscrapeCrawlFeed
Norma Index &Transform
XML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
Plugins
Regex
SequencesSpecies
Bespoke
Scrapers
XPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
Starting points
• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CMDir(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG) good
• PDF,XML,HTML -> Norma -> CMDir(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR -> CMDir(sHTML,TXT,SVG) variable
Conversions
• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG
fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.
slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate
Raw HTMLNot wellformedBad charactersemantics
ScholarlyHTML
Well-formed XHTML
PNG
TaggedSections
CaptionedFigures
Tables
CaptionedTables
XMLHtmlTidyJsoupHtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journalStylesheets
End points
• Norma -> CMDir(OpenSHTML-SVG) • Norma -> CMDir(sHTML. sections) -> AMI ->
all text + species, chemistry, sequences)• Norma -> CMDir(TXT (unsectioned)) ->
AMI -> bagOfWords, regex, • Norma -> CMDir(PNG) -> AMI -> phylo, bar/xy-
plots, • Norma -> CMDir(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry
PDFNon-UnicodePixel glyphsNo wordsNo structures
ScholarlyHTML
SVG
High-levelgraphics
PDF2SVG
characters
SentencesParastables
PNG OCR
TaggedSections
SVGBuilder
CaptionedFigures
NORMA
XSLT1/2