Architecture of ContentMine Components contentmine.org
-
Upload
thecontentmine -
Category
Science
-
view
117 -
download
0
Transcript of Architecture of ContentMine Components contentmine.org
![Page 1: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/1.jpg)
Architecture of TheContentMine
These slides are for enlightenment and presentations. Use http://discuss.contentmine.org/t/overall-architecture/142 for up-to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2)
Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)
![Page 2: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/2.jpg)
catalogue
getpapers
query
DailyCrawl
EuPMC, arXivCORE , HAL,(UNIV repos)
ToCservices
PDF HTMLDOC ePUB TeX XML
PNGEPS CSV
XLSURLsDOIs
crawl
quickscrape
normaNormalizerStructurerSemanticTagger
Text
DataFigures
ami
UNIVRepos
search
LookupCONTENTMINING
Chem
Phylo
Trials
CrystalPlants
COMMUNITY
plugins
Visualizationand Analysis
PloSONE, BMC, peerJ… Nature, IEEE, Elsevier…
Publisher Sites
scrapersqueries
taggers
abstract
methods
references
CaptionedFigures
Fig. 1
HTML tables
30, 000 pages/day Semantic ScholarlyHTML
Facts
Latest 20150908
![Page 3: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/3.jpg)
quickscrape Norma Index &Transform
XML
URL
DOI
DOC
CSV
sHTML
Plugins
SequencesSpecies
BespokeScrapers XPath
Taggers
Per- Journal
Chemistry
Phylogenetics Plants
AMI
BadHTML
OCR
Diagrams
CAT-alogue index
getpapersquery
Titles+ links
DailyCrawl/feed
EuPMC
JToCs
Latest 20150908; limited in scope
![Page 4: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/4.jpg)
Starting points for ingestion(getpapers/quickscrape/Norma)
• Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CTree(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG|image) good
• PDF,XML,TXT,HTML -> Norma -> CTree(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR|TXT2HTML -> CTree(sHTML,TXT,SVG) variable
20150908
![Page 5: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/5.jpg)
Norma Conversions
• Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG
fast, variable• PDF -> PDF2SVG-N -> sHTML, SVG, images/.
slow, accurate-ish• PDF -> PDF2TXT-N -> TXT fast, variable• PDF -> PDF2Image-N -> PNG fast, accurate
20150908
![Page 6: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/6.jpg)
Norma End points
• Norma -> CTree(OpenSHTML-SVG) -> everything?• Norma -> CTree(sHTML. sections) -> AMI -> all
text + species, chemText, sequences)• Norma -> CTree(TXT (unsectioned)) -> AMI ->
bagOfWords, regex, IDs, species?• Norma -> CTree(PNG) -> AMI -> phylo, bar/xy-
plots, • Norma -> CTree(SVG) -> AMI -> phylo, bar/xy-
plots, chemistry
![Page 7: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/7.jpg)
Pre/early Norma toolchainTransforming PDF and PNG into higher value components
20150908Diagram autogenerated from *.dot graph
![Page 8: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/8.jpg)
getpapers/quickscrape/Norma workflow
20150908Diagram autogenerated from *.dot graph
![Page 9: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/9.jpg)
20150908Diagram autogenerated from *.dot graph
Getpapers/quickscrape/Norma: commonest uses
![Page 10: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/10.jpg)
20150908Diagram autogenerated from *.dot graph
AMI: inputs and outputs for common plugins
![Page 11: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/11.jpg)
Earlier diagrams
Probably significantly out of date, but may contain useful info.
![Page 12: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/12.jpg)
NORMALIZE
NormaConvert PDF,XMLTo sHTMLTag sections
Normalized Scientific Literature
AMIIndexTransformExtractSearch
PDF2SVGXSL stylesheetsTaggers
normalizationParameters
“Permanent” Filestore
Temporary Filestore
Extracted factsindexes
PluginsRegex
![Page 13: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/13.jpg)
PDFNon-UnicodePixel glyphsNo wordsNo structures
ScholarlyHTML
SVG
High-levelgraphics
PDF2SVG
characters
SentencesParastables
PNG OCR
TaggedSections
SVGBuilder
CaptionedFigures
NORMA
XSLT1/2
![Page 14: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/14.jpg)
Raw HTMLNot wellformedBad charactersemantics
ScholarlyHTML
Well-formed XHTML
PNG
TaggedSections
CaptionedFigures
Tables
CaptionedTables
XMLHtmlTidyJsoupHtmlUnit
XSLT1/2
XSLT1/2
NORMA
Per-journalStylesheets
![Page 15: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/15.jpg)
RSU: Richard Smith-UnnaPMR: Peter Murray-RustCL: CottageLabs
QueuesRepos
Scientificliterature
SciencePlugins
ScienceVolunteers
Collaboration with Open Access Button
![Page 16: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/16.jpg)
quickscrapeCrawlFeed Norma Index &
Transform
TXTXML
URL
DOI
Scientificliterature
Repositories DOC
CSV
sHTML
PluginsRegex
SequencesSpecies
Bespoke
ScrapersXPathPer-Journal
TaggersPer- Journal
MetadataChemistry
Phylogenetics Farming
AMI
BadHTML
OCR
Diagrams
Open NORMA-lized Scientific Literature + Facts
CANARY pipeline
CAT-alogue index
![Page 17: Architecture of ContentMine Components contentmine.org](https://reader035.fdocuments.us/reader035/viewer/2022062522/589df4581a28ab1e718b4923/html5/thumbnails/17.jpg)