Post on 16-Apr-2017
Semantic Data Normalization For
Efficient Clinical Trial Research
September 8th, 2016
• The specifics of clinical data• What is RDF and how we can use it together with TA?• Semantic annotations and their limitations• What is semantic data normalization?• Current state and next steps
Outline
September 8th, 2016
• Unstructured (Semi-Structured)• Abundant• Redundant• Ambiguous• Aggregated
Clinical Data
September 8th, 2016
In order to transform your clinical data into information and even knowledge, you will have to analyze it!
… but before that you have to make it ready for the analysis!
September 8th, 2016
What is RDF
RDF data model resolves all syntax level ambiguities
It helps you express all data in a common data model
ID GRAA_HUMAN STANDARD; PRT; 262 AA. AC P12544; DT 01-OCT-1989 (Rel. 12, Created) DT 01-OCT-1989 (Rel. 12, Last sequence update) DT 15-JUN-2002 (Rel. 41, Last annotation update) DE Granzyme A precursor (EC 3.4.21.78) (Cytotoxic T-lymphocyte proteinase DE 1) (Hanukkah factor) (H factor) (HF) (Granzyme 1) (CTL tryptase) DE (Fragmentin 1). GN GZMA OR CTLA3 OR HFSP. OS Homo sapiens (Human).
<PubmedArticle> <MedlineCitation Owner="NLM" Status="In-Process"> <PMID Version="1">21500419</PMID> <DateCreated> <Year>2011</Year> <Month>04</Month> <Day>15</Day> </DateCreated> <Article PubModel="Print"> <Journal> <ISSN IssnType="Electronic">1520-6882</ISSN> <JournalIssue CitedMedium="Internet"> <Volume>82</Volume> <Issue>20</Issue> <PubDate> <Year>2010</Year> <Month>Oct</Month> <Day>15</Day> </PubDate> </JournalIssue>
Linked DataHow well interlinked is the linked data cloud?
•Many interesting queries are difficult to be expressed in SPARQL•String functions could not be index•Often there are misplaced identifiers
biopax-2:SHORT-NAME
biopax-2:XREF
P29965
UNIPROT
CD40L_HUMAN
cpath:CPATH-94138
cpath:CPATH-LOCAL-8467065
biop
ax-2
:PHY
SICA
L-EN
TITY
biopax-2:ID
biopax-2:DB
biopax-2:PHYSICAL-ENTITY
cpath:CPATH-LOCAL-8749236
uniprot:P29965
CD40L_HUMAN
uniprot:mnemonic
TNF5_HUMAN uniprot:mnemonic
uniprot:mnemonicCD4L_HUMAN
#5September 8th, 2016
Semantic Annotations
pmid:17714090
broader
umls:C0035204
broader
broaderTransitiveCOPD
Bronchial Diseases
Respiration Disorders
umls:C0006261
Chronic Obstructive Airway Diseases
broa
der
Asthma umls:C000496
Asthma and chronic obstructive pulmonary disease (COPD) are chronic airway diseases characterized by airflow obstruction. The beta(2)-adrenoceptor mediates bronchodilatation in response to exogenous and endogenous beta-adrenoceptor agonists. Single nucleotide polymorphisms in the beta(2)-adrenoceptor gene (ADRB2) cause amino acid changes (e.g. Arg16Gly, Gln27Glu) that potentially alter receptor function. Recently, a large cohort study found no association between asthma susceptibility and beta(2)-adrenoceptor polymorphisms. In contrast, asthma phenotypes, such as asthma severity and bronchial hyperresponsiveness, have been associated with beta(2)-adrenoceptor polymorphisms.
broaderTransitive mentionsmentions
Ian A Yang
journal
Clinical and experimental pharmacology … author
September 8th, 2016
• Good for:– Generation of machine readable meta data– Semantic indexing of large sets of documents– Providing additional background knowledge
• Limitations:– Incomplete knowledge extraction– Does not capture completely the context
Semantic Annotations
September 8th, 2016
• What is it?– A text analytics approach that aims to capture the full
context of the information and to provide clear references to concepts/objects in order to be easily interpreted by machines.
• How we do it?– Work on sentence level– Extract the key phrases from the sentence– Identify the main concept– Identify all the qualifiers and negations– Model the extracted data as RDF
Semantic Data Normalization
September 8th, 2016
Semantic Data Normalization
September 8th, 2016
• Condition text:– “Advanced Biliary Tract Adenocarcinoma” (Study ID = NCT01506973)
• Text Analysis– One phrase is identified in the Condition text– Advanced Biliary Tract Adenocarcinoma
• Data Schema– One annotation object is created– Main concept is “Adenocarcinoma”– Qualifier concepts are “Advanced” and “Biliary tract”
Semantic Data Normalization
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:conditionText “Advanced Biliary Tract Adenocarcinoma”
ct:conditionAnnotation ConditionAnnotationID
ca:hasDisease C0001418
ca:hasPhrase “Advanced Biliary Tract Adenocarcinoma”
ca:hasQualifiers QualifierGroupID
C0205179 C0005423
cg:hasQualifiers
• Study Conditions– Multiple phrases in a text– Pre-coordinated concepts vs. post-coordinated– Scoring of matching concepts
• Study Interventions– Drug, route, form– Drug dosage
• Adverse Events– Normalization of AE– Post-coordinated concepts
• Eligibility Criteria– Semantic sectioning and categorization– Negations– Diseases, findings, treatments, age and gender
Demo Example
September 8th, 2016
Intervention Annotation Model - Drugs
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasIntervention
in:drugAnnotation DrugAnnotationID
da:hasDrug 111418
da:hasAdministrationRoute
do:hasSingleDose
DrugDosageID
SingleDoseID PeriodIDdo:hasPeriod
NCT01506973_1_2
SCTID:111418
SCTID:121681
da:hasDosage
do:hasFrequency
FrequencyID
Value UnitDenominato
r ValueDenominato
r Unit
da:hasAdministrationForm
Criteria Annotation Model
September 8th, 2016
NCT01506973
rdf:type ClinicalTrial
ct:hasCriteriaSection
cs:hasCriterion Criterion
cr:hasText
cr:hasAnnotation
CriteriaSection
AnnotationId
sa:Negation
rdf:type “Inclusion”/”Exclusion”/”Not defined”
cs:hasText…No extensive intraductal components on core biopsy, defined as intraductal carcinoma.Patients must not have recurrent invasive breast cancer. …Patients must not have recurrent invasive breast cancer.
“Disease”/”Drug”/…rdf:type
“True”/”False”/…Property 1Property 2Property N
• Work with ClinicalTrials.gov data as public show case– > 215K clinical studies– > 76 million RDF statements
• Coverage– Conditions (197,154 objects)
– Diseases, Findings, Body locations, Qualifiers
– Interventions (rdf:type = ‘Drug’ and rdf:type = ‘Biologics’) – (381,590 objects)– Drugs, Dosages, Administration form, Administration route, Population group
– Adverse Events – (1,226,754 objects)– Diseases, Findings, Body locations, Qualifiers
– Criteria (semantic sectioning and categorization, negations) – (7,216,361 objects)– Diseases, Findings, Drugs, Population groups
• In total more than 80 millions of RDF triples
Current Status
September 8th, 2016
• Directly mine the public enhanced CT.gov version• Apply the same approach over your internal clinical trials data• Once the data is semantically normalized you can “slice and
dice” it as your use case requires• Examples
– Top-bottom data exploration– Linked data browsing
How Can I Use This?
September 8th, 2016
Next Steps
• Release RDFized version of ClinicalTrials.gov• Pre-loaded in GraphDB Free• Pre-loaded on Ontotext S4 Cloud• As RDF serialization distribution
• Release all semantically structured information under free for non-commercial use license
• Extend the data schema to support not only concepts but also tokens which cannot be normalized to ontology instances