Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas...

35
Date Natural Language Processing of the clinical narrative in French to support Public Health Aurélie Névéol , CR1 CNRS

Transcript of Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas...

Page 1: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Date

Natural Language Processing of the

clinical narrative in French to

support Public Health

Aurélie Névéol, CR1 CNRS

Page 2: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

2/33

Natural Language Processing

is needed to support

Epidemiology and Public Health

…how?

Page 3: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

3/33

Benefit of CT venography in the diagnosis of

pulmonary embolism and thromboembolic disease?

Prevalence of Incidental Findings?

NLP can produce supporting evidence

to address public health issues

Page 4: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

4/33

NLP can create biomedical knowledge

from multiple and heterogeneous documents

Large

thrombus

burden in

the proximal

LAD artery

Obstruction

totale des

artères

segmentaires

Electronic Health

Records Biological Data

Repositories

Publications

Protocole

Social

Media

Page 5: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

5/33

CABeRneT: Automatic Understanding of

Biomedical Text for Translational Research

Publications

Thrombose de la veine

ovarienne droite sur 1.5cm de haut

Patient record

Thrombose de la

veine ovarienne droite

sur 1.5cm de haut

C1267486

Entire Right

Ovarian Vein

C0040053

Thrombosis

LOCATION OF

Links

NLP

Analysis

Structured

Representation

New therapeutic

insight

Retrospective

Analysis

http://cabernet.limsi.fr

Page 6: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

In practice…

Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avaisdemandés le 24 novembre 2004pour suspicion d’emboliepulmonaire.L’examen n’objective pas d’EP, nide TVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.

Page 7: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

7/33

Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.

Personnal Health Identifiers

De-identification

needed to process health data.

Deidentification method[Grouin et Névéol 2014, Grouin et

al. 2014]

Study of reidentification risks[Grouin et al. 2015]

Page 8: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

8/33

Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.

Entities and Relations

Representation Schema

18 entity types

37 relations

[Deléger et al. 2014; Deléger,

Campillos et al. 2017]

Page 9: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

9/33

Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.

Assertions, hedges

4 modalities

8 aspects

Abbreviations

…Annotated Corpus[Campillos, Deléger et al. 2016]

Page 10: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

10/33

Je vois ce jour en consultationMonsieur Marc DURAND, né le18/05/1954 avec les résultats d’unangioscanner thoracique etphleboscanner que j’avais demandés le24 novembre 2004 pour suspiciond’embolie pulmonaire.L’examen n’objective pas d’EP, ni deTVP des membres inférieurs. Onobserve cependant un processusganglio-tumoral hilaire gauche avecépanchement pleural bilatéral etcolapsus pulmonaire associés.

Normalization

10

Entity Linking with the UMLS

3 Millions concepts

170,000 with French terms

C0023216C0149871C0034065

C0582103

Page 11: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

11/33

A multidisciplinary endeavour

Computer Science:

• Knowledge Representation

• Natural Language Processing

• Methods

• Annotated corpus, tools

Medecine, Public Health, Epidemiology:

• Retrospective analysis of patient records

• Biomedical Information retrieval

Page 12: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

12/33

Challenges of clinical NLP

• Data processing and analysis• Data confidentiality: deidentification

• Legal and Technical access to data

• Accommodating the complexity of biomedical language• Great language variety

• Several « sublanguages » according to Zellig Harris’ definition

• Make use of vast knowledge resources• UMLS ~3 million concepts

• Terms associated to concepts are primarily in English

Page 13: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

13/33

CABeRneT at a glance

Task 1

Preparation of

annotated

corporaTask 3

Relation

extraction

Task 4

retrospective

analysis

Task 2

Entity, concept

extraction

Task 5

Linking EHRs

Task 6

Evaluation

Page 14: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

14/33

Addressing an

open public health question

What is the prevalence of Incidental Findings in patients with suspected Pulmonary Embolism or suspected thromboembolic disease? [Pham et al. 2014]

• Corpus comprising 615 deidentified radiology reports

• Annotations for entities, relations, modalities, sections

• Binary classification for Incidental Findings

Clinical insight

• Overall prevalence: 15%

• Classification to be based on follow-up

P R F

Words 0.43 0.32 0.37

Words + annotations 0.67 0.50 0.57

+ Sections 0.76 0.81 0.80

NLP insight

• Complex analysis useful

Page 15: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

15/33

Linking

from the electronic health record

• Information retrieval based on patient record [D’hondt et al. 2014]

• Participation to TREC 2014: retrieving articles from the literature

based on clinical case description

• Methodological work on English

• Redundancy in electronic health records [D’hondt et al. 2015, 2016]

• Was shown to have impact on language models [Cohen et al.

2013]

• Links between documents within the EHR:

• Identification of (near)-identical documents (duplicates)

• Identification of subsequent document versions

Page 16: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

16/33

Preparation of annotated corpora

• Legal aspects

• CNIL, IRB, other…

• Implications for distribution, e.g. through shared task

• Choice of a representation scheme

• Existing schemes... reviewed in [Deléger, Campillos et al. 2017]

• Links to a knowledge source• Which source, or sources?

• Strict guidance from knowledge source?

• Choice of annotation tool and method

• Inline vs. Standoff annotations

• Use of pre-annotations?

• Human input [Grouin et al. 2014]

Page 17: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

17/33

QUAERO French Medical Corpushttps://quaerofrenchmed.limsi.fr [Névéol et al. 2014]

• Legal aspects

• Open source text: MEDLINE titles and OPUS EMEA subset

• Used in CLEF eHealth 2015, 2016

• Choice of a representation scheme

• Links to a knowledge source: UMLS• Which source: all sources in UMLS

• Strict guidance: concepts, not terms

• Choice of annotation tool and method

• Inline vs. Standoff annotations: a little of both…

• Use of pre-annotations: yes [Névéol et al. 2010]

• Human input: two annotators, one revisor

Page 18: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

18

QUAERO French Medical Corpuscontents

• Two Corpora

• MEDLINE titles: scientific littérature, short, low redundancy

• EMEA documents: drug handouts, long, high redundancy

• Ten entities of clinical interest are annotated

• Defined according to UMLS Semantic Groups [Bodenreider &

McCray, 2003]

• Anatomy, Chemicals & Drugs, Devices, Disorders, Geographic

Areas, Living Beings, Objects, Phenomena, Physiology,

Procedures

• Embedded and discontinuous entities

Page 19: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

• Corpora were pre-annotated automatically

• Two annotators initially contributed using

– Detailed annotation guide

– Reference UMLS tools: EHTOP (in French)

http://www.hetop.eu/hetop/ and UTS metathesaurus browser (in

English) https://uts.nlm.nih.gov/metathesaurus.html

• One expert annotator later contributed towards

– Annotation harmonization

– Annotation revision

19

QUAERO French Medical Corpusannotation methodology

Page 20: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

20/33

Annotation decisions

• Concept without a French term

• NCI parathyroide intrathyroïdale C3272635

• GO pupaison C1326578

• Term not associated with concept

• MeSH « Système ABO » -> « système ABO de groupes

sanguins » C0000778

• MeSH « français » -> « France » C0016674

• Concept not in the UMLS

• Daronrix (vaccin contre la grippe)

• IONSYS (dispositif antalgique)

Page 21: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

C1529600C1529600 C0087111

C0001675

C0026769

C0026769

C0700589 C0344221 C0021900

C0042149

ME

DL

INE

EM

EA

21

Contraception by intrauterine devices

What is Tysabri used for?

Tysabri is used to treat adults with highly active multiple sclerosis (MS).

QUAERO French Medical Corpus

corpus excerpt

Page 22: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

22

EMEA MEDLINE

Train. Dev. Test Train. Dev. Test

Tokens 14,944 13,271 12,042 10,552 10,503 10,871

Entities 2,695 2,260 2,204 2,994 2,977 3,103

Unique Entities 923 756 658 2,296 2,288 2,390

Unique CUIs 648 523 474 1,860 1,848 1,909

QUAERO French Medical Corpuscorpus release

• Data Format

• Stand-off (BRAT) and BioC

• Evaluation Tool

• Brateval

• Dataset statistics

• ~20% of CUIs assigned do not have a French term associated

Page 23: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

23

CLEF eHealth 2015-2016information extraction [Névéol et al., 2015, 2016]

• Task: Automatically identify clinically relevant entities in

medical text in French

• Objective: Establish the state-of-the-art for a language other

than English on core biomedical NLP tasks:

• Named Entity Recognition (with embedded entities)

– Mention level :“diabète de type 2”, “DNID”, “diabète non

insulino dépendant

• Entity Normalization

– Concept level: for instance the three mentions above can be

normalized to the same UMLS concept: C0011860

Page 24: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Plain Entity Recognition

– Given: plain text

MEDLINE

La contraception par les dispositifs intra utérins

EMEA

Dans quel cas Tysabri est-il utilisé?

Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).

– Expected: plain entity annotations

MED

LINE

EMEA

24

Page 25: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Normalized Entity Recognition

– Given: plain textMEDLINE

La contraception par les dispositifs intra utérins

EMEA

Dans quel cas Tysabri est-il utilisé?

Tysabri est utilisé dans le traitement des adultes atteints de sclérose en plaques ( SEP ).

– Expected: normalized entity annotations

25

C1529600C1529600 C0087111

C0001675

C0026769

C0026769

C0700589 C0344221 C0021900

C0042149

MED

LINE

EMEA

Page 26: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

– Given: plain entity annotations

– Expected: normalized entity annotations

MED

LINE

EMEA

Entity Normalization

26

C1529600C1529600 C0087111

C0001675

C0026769

C0026769

C0700589 C0344221 C0021900

C0042149

MED

LINE

EMEA

Page 27: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Methods used

27

• For entity recognition (7 teams in 2015, 5 teams in 2016)– Machine learning, e.g. (cascading) CRF

– Lexical matching

– sources: dictionary built from training data or existing terminologies

– matching method: n-gram, bag-of word

– Machine translation + Metamap

• For entity normalization (3 teams in 2015, 2 teams in 2016)– Lexical matching

– Machine translation + Metamap

Page 28: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Evaluation metrics

• Precision, recall and F-measure

– P =

– R =

– F =

• Primary metric was exact match micro-averaged F1 over all

entity types

– Inexact match results were generally higher but exhibited the

same trend as exact match28

true positives

true positives + false positives

true positives

true positives + false negatives

(1+ β)2 x P x R

β2 x (P + R)

Page 29: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Results: plain entity recognition, EMEA

(2016)

29

Zipf 0.734 0.434 0.546

Page 30: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Results: plain entity recognition, MEDLINE

(2016)

30

Zipf 0.726 0.300 0.425

Page 31: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

CépiDC corpus excerpt [Lavergne et al. 2016]

31

Malnutrition dehydration

Advanced mixed dementia (late stage)

Idiopathic Parkinson Disease

Recent angioedema in upper extremities, no CT (unlikely to be drug induced) .

2013;2;85;4;1;DENUTRITION DESHYDRATATION; E46 E86

2013;2;85;4;2. DEMENCE MIXTE EVOLUEE (stade sévère); F03

2013;2;85;4;5. Maladie de Parkinson idiopathique Angioedème des

membres sup récent non exploré par TDM (à priori pas de cause

médicamenteuse); G200 R600

Coded

in 2013Deceased was an 85 y. o. female

Death occured in a hospice or

retirement home

Death certificate line number

ICD10 codes

Page 32: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Results: ICD10 coding, CépiDC

32

Zipf 0.531 0.245 0.336

# systems 5 4 3 2 1 0

# codes 29,100 25,215 20,743 15,933 10,685 7,714

Page 33: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Lessons learned

• Text pre-processing is important

- (Formatting yielded technical problems in 2015)

• Using terminology information in more than one language was

successful

– Terms translated from English helped

– Even extensive French resources were limited

33

Page 34: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

Additional remarks

• 2 out of 7 teams were from non-French speaking countries

• This continues to be the only biomedical international challenge

addressing a language other than English

–Tasks were challenging

– ICD10 coding yielded higher performance

• Both knowledge-based and machine learning methods showed

potential for ICD10 coding

34

Page 35: Natural Language Processing of the clinical narrative in ... · L’examenn’objectivepas d’EP,ni de TVP des membres inférieurs. On observe cependant un processus ganglio-tumoral

35/33

Acknowledgments

LIMSI

L. Campillos

L. Deléger

E. D’hondt

C. Grouin

T. Hamon

T. Lavergne,

A-L. Ligozat

F. Morlane-Hondère

C. Rabary

X. Tannier

MD. Tapi-Nzali

P. Zweigenbaum

HEGP

A. Burgun

J-B Escudié

A-S Jannot

A-D. Pham

B. Rance

Harvard Children’shospital

G. Savova

P. Chen

CHU de Rouen

S-J Darmoni

N. Griffon

J. Grosjean