A Technical Introduction to the Semantic Search Engine ... · Erik Fäßler Technical Introduction...

31
Erik Fäßler Technical Introduction to Semedico 1 Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena, Jena, Germany http://www.julielab.de A Technical Introduction to the Semantic Search Engine SeMedico Erik Fäßler Talk in the Semesterprojekt Entwicklung einer Suchmaschine für Alternativmethoden zu Tierversuchen January 12, 2018 Humboldt-Universität zu Berlin

Transcript of A Technical Introduction to the Semantic Search Engine ... · Erik Fäßler Technical Introduction...

Erik Fäßler TechnicalIntroductiontoSemedico 1

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin

Erik Fäßler TechnicalIntroductiontoSemedico 2

SeMedico Front Page

Erik Fäßler TechnicalIntroductiontoSemedico 3

SeMedico Auto Completion

Erik Fäßler TechnicalIntroductiontoSemedico 4

SeMedico Result View I

Erik Fäßler TechnicalIntroductiontoSemedico 5

SeMedico Result View II

Erik Fäßler TechnicalIntroductiontoSemedico 6

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Erik Fäßler TechnicalIntroductiontoSemedico 7

MEDLINE Document Storage I •  MEDLINE comes in (G)ZIPed XML

•  30K documents per file <PubmedArticleSet>

<PubmedArticle><MedlineCitation> <PMID>1234567</PMID>

<Article> <Journal>...</Journal> <ArticleTitle>...</ArticleTitle> <Abstract>...</Abstract> <AuthorList>...</AuthorList> <MeshHeadings>...</MeshHeadings>

</Article></MedlineCitation><MedlineCitation> <PMID>...</PMID> ...

</MedlineCitation></PubmedArticle></PubmedArticleSet>

Erik Fäßler TechnicalIntroductiontoSemedico 8

MEDLINE Document Storage II •  Import of MEDLINE citations into database table

•  Size of MEDLINE: 27M abstracts

pmid xml

1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>

2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>

3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>

4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>

... ... ...

Erik Fäßler TechnicalIntroductiontoSemedico 9

pmid xml

1 1234567 <MedlineCitation><PMID>1234567</PMID>...</MedlineCitation>

2 1729454 <MedlineCitation><PMID>1729454</PMID>...</MedlineCitation>

3 1785742 <MedlineCitation><PMID>1785742</PMID>...</MedlineCitation>

4 2264674 <MedlineCitation><PMID>2264674</PMID>...</MedlineCitation>

... ... ...

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

From the Database into the Pipeline I

Erik Fäßler TechnicalIntroductiontoSemedico 10

From the Database into the Pipeline II

UIMAMedlineDBReader•  DBconcurrencyhandling•  ParsingofXML•  PopulatingUIMACASinstance

•  Title/Abstract•  Authors•  JournalInfo•  etc.

JULIELabServer

PostgreSQL totextanalysiscomponents

CAS

CommonAnalysisSystem

Erik Fäßler TechnicalIntroductiontoSemedico 11

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Erik Fäßler TechnicalIntroductiontoSemedico 12

SeMedico UIMA JCoRe Pipeline I

Sentences Tokens Abbreviations PartsofSpeech

GeNo:Genes/Proteins

•  Recognition•  Normalization

(NCBIGene)

Semanticlayer

MolecularEventExtraction(BioSem)

MeSHTerms(Dictionary)

Ontologyclasses(GO,GRO;Dictionary)

EventCertaintyAssessment

Scale1to61:Negation6:Nodoubt

Species(LINNAEUS)

fromreader

toconsumer

https://github.com/JULIELab/,Hahn&Matthiesetal.,LREC2016

Erik Fäßler TechnicalIntroductiontoSemedico 13

SeMedico UIMA JCoRe Pipeline II

ElasticSearchCASConsumer•  TransformsCASinto

preanalyzedJSONdocument•  Transformationconfigurable

viaAPI•  JULIELabESpluginrequired

fromanalysispipeline

ElasticSearch

CAS

title

abstract

species

genes

events

preanalyzedJSON{

“title”:{…},“abstract”:{…},“authors”:{…},“…”:{…}

}

transformationAP

I

http

Erik Fäßler TechnicalIntroductiontoSemedico 14

Full texts from Pubmed Central

•  SeMedico integrates the open access subset of PMC

•  Use a specific reader from JCoRe: jcore-pmc-reader

•  The rest of the analysis is basically the same

•  But:

Matthies,Franz,&Hahn,Udo(2017).ScholarlyinformationextractionisgoingtomakeaquantumleapwithPubMedCentral(PMC)®—Butmovingfromabstractstofulltextsseemsharderthanexpected.in:MedInfo2017:PrecisionHealthcarethroughInformatics–Proceedingsofthe16thWorldCongressonMedicalandHealthInformatics.Hangzhou,China,21-25August2017,521-525.

Erik Fäßler TechnicalIntroductiontoSemedico 15

SeMedico System Overview JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

DocDoc

DocMEDLINE

Erik Fäßler TechnicalIntroductiontoSemedico 16

Concept Database I Name Description NumberofConcepts

MedicalSubjectHeadings(MeSH)

Biomedicalvocabulary,multihierarchy

26K

MeSHSupplementaryConcepts Chemicals,proteinsetc.connectedtoMeSH

150K

NCBIGene GeneDatabase 650K(inSeMedico)

NCBITaxonomy Taxonomicalclassificationofspecies

1.1M

GeneOntology(GO) Ontologyaboutgeneproductsandrelatedprocesses

50K

GeneRegulationOntology(GRO) Ontologyaboutgeneregulationprocesses

507

Erik Fäßler TechnicalIntroductiontoSemedico 17

Concept Database II

•  Concepts are arranged taxonomically •  Squamous Cell Carcinoma IS-A Carcinoma

•  Neo4j is a graph database •  Terminologies and arbitrary relations between

concepts can be modeled explicitly •  Appropiate query language:

•  “Get descendants of concept” •  “Compute shortest path between two

concepts”

Erik Fäßler TechnicalIntroductiontoSemedico 18

Neo4j Example Graph

type1

type2 type3

type4

Tauopathies

Erik Fäßler TechnicalIntroductiontoSemedico 19

Neo4j Concept Node Properties

Erik Fäßler TechnicalIntroductiontoSemedico 20

Zooming Out

Erik Fäßler TechnicalIntroductiontoSemedico 21

Concept IDs

ConceptDatabase

tid2341

tid914

tid42

CASabstract

speciesncbitax:9606

genesmTOR

ncbigene:2475

JSON{

“abstract”:{[“human”,“tid914”,“mTOR”,“tid42”]}

}transformationAP

I

ElasticSearch

SeMedicoWebApplicationJavaServlet

query:“match:tid914”facet“tid42”:{“name”:“mTOR”,“synonym”:“FRAP”,“description”:“…“}

Erik Fäßler TechnicalIntroductiontoSemedico 22

ElasticSearch I

• Manages Lucene index

•  Seamless index updates, no downtime

•  Easy to use index distribution model

•  Full text search

•  Faceting

• Highlighting

Erik Fäßler TechnicalIntroductiontoSemedico 23

ElasticSearch II

•  Lucene generates index terms via “text analysis” –  Tokenization, case folding, synonym enrichment, stemming –  ElasticSearch does the same on sent document text

•  How to integrate UIMA?

•  First idea: Create a Lucene UIMA analyzer, but –  Moves (a lot!) processing requirements into the ElasticSearch

cluster –  Requires to load dictionaries, machine learning models –  Memory that is lost to Lucene and ElasticSearch –  Overall: Diminishes search performance

?

Erik Fäßler TechnicalIntroductiontoSemedico 24

ElasticSearch III

•  JULIE Lab ElasticSearch plugin to exactly specify index terms without ES-internal analysis –  https://github.com/JULIELab/elasticsearch-mapper-preanalyzed

•  Employs the JSON format created for the Solr JsonPreAnalyzedParser –  https://lucene.apache.org/solr/guide/6_6/working-with-external-

files-and-processes.html#WorkingwithExternalFilesandProcesses-JsonPreAnalyzedParser

•  Created by JULIE Lab internal (currently) CAS consumer

Erik Fäßler TechnicalIntroductiontoSemedico 25

ElasticSearch IV Preanalyzed Format {"v":"1",

"str":"Immunohistochemistry performed to evaluate the expression of phosphorylated mTOR (p-mTOR), phosphorylated p70S6K (p-p70S6K), phosphorylated 4E-binding protein 1 (p-4E-BP1), and Ki-67 using 105 surgically resected ESCC correlated with treatment outcome.",

"tokens":[{"t":”immunohistochemistry","s”:0,"e”:20,"i":1},

{"t":”tid94702","s”:0,"e”:20,"i”:0},

{"t":”perform","s”:21,"e”:30,"i":1},

{"t":”evaluat","s”:34,"e”:42,"i":1},

{"t":”event","s”:34,"e”:42,"i”:0}, …

]

}

Erik Fäßler TechnicalIntroductiontoSemedico 26

ElasticSearch V Simple Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”cancer” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": "mtor" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}

Erik Fäßler TechnicalIntroductiontoSemedico 27

ElasticSearch VI Concept Query { "query": { "bool": { "must": [ { "match": { "abstracttext": { "query": ”tid52310” }}}, { "nested": { "path": "events", "inner_hits": {}, "query": { "bool": { "must": [{ "match": { "events.allarguments": “tid42" }}], "filter": { "range": { "events.likelihood": { "lte": 5}}}}}}}]}},

"fields": [ "abstracttext", "title" ]}

Erik Fäßler TechnicalIntroductiontoSemedico 28

ElasticSearch VII Highlighting

Erik Fäßler TechnicalIntroductiontoSemedico 29

References •  Semedico

–  Faessler, Erik, & Hahn, Udo (2017). SEMEDICO: A comprehensive semantic search engine for the life sciences. in: ACL 2017 – Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Vancouver, British Columbia, Canada, August 1, 2017, 91–96.

•  GeNo –  Wermter, Joachim, & Tomanek, Katrin, & Hahn, Udo (2009). High-performance gene name

normalization with GeNo. in: Bioinformatics, 25, 815-821.

•  BioSem –  Bui, Q., Mulligen, E. van, Campos, D., & Kors, J. (2013). A Fast Rule-based Approach for

Biomedical Event Extraction. In Proceedings of the BioNLP 2013 Shared Task Workshop (pp. 104–108). Sofia, Bulgaria: Association for Computational Linguistics.

•  Certainty Assessment –  Engelmann, Christine, & Hahn, Udo (2014). An empirically grounded approach to extend the

linguistic coverage and lexical diversity of verbal probabilities. in: CogSci 2014 - Proceedings of the 36th Annual Cognitive Science Conference. Cognitive Science Meets Artificial Intelligence: Human and Artificial Agents in Interactive Contexts. Québec City, Québec, Canada, July 23-26, 2014., 451-456.

•  JCoRe –  Hahn, Udo, & Matthies, Franz, & Faessler, Erik, & Hellrich, Johannes (2016). UIMA-based

JCoRe 2.0 goes GitHub and Maven Central: State-of-the-art software resource engineering and distribution of NLP pipelines. in: LREC 2016 – Proceedings of the 10th International Conference on Language Resources and Evaluation. Portorož, Slovenia, 23-28 May 2016, 2502-2509.

Erik Fäßler TechnicalIntroductiontoSemedico 30

Conclusion

DocDoc

DocMEDLINE

JULIELabServer

PostgreSQL

CR

AE

AE

AE

CO

ElasticSearchConceptDatabase

SeMedicoWebApplicationJavaServlet

Frontend(Tapestry/JavaScript)

NCBIGene

http://www.semedico.org/

Erik Fäßler TechnicalIntroductiontoSemedico 31

Jena University Language & Information Engineering (JULIE) Lab Friedrich Schiller University Jena,

Jena, Germany

http://www.julielab.de

A Technical Introduction to the Semantic Search Engine SeMedico

Erik Fäßler

TalkintheSemesterprojektEntwicklungeinerSuchmaschinefürAlternativmethodenzuTierversuchen

January12,2018Humboldt-UniversitätzuBerlin