download

28
Department of Computer Science University of Bari Knowledge Acquisition & Machine Learning Lab CILC 2006 Convegno Italiano di Logica Computazionale 26-27 giugno 2006, Dipartimento di Informatica, Learning for Biomedical Information Extraction with ILP Margherita Berardi Vincenzo Giuliano Donato Malerba

description

 

Transcript of download

Page 1: download

Department of Computer Science

University of Bari

Knowledge Acquisition &Machine Learning Lab

CILC 2006Convegno Italiano di Logica

Computazionale26-27 giugno 2006, Dipartimento di Informatica, Bari

Learning for Biomedical

Information Extraction with ILP

Margherita Berardi Vincenzo GiulianoDonato Malerba

Page 2: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Outline of the talk

IE for Biomedicine Looking around IE problem formulation

which representation model on data? which features?

which framework for reasoning? Mutual Recursion in IE Text processing & domain

knowledge Application to studies on

mitochondrial genome Conclusions & Future work

Page 3: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATION

Page 4: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

What is “Information Extraction”

Filling slots in a database from sub-segments of text.As a task:

October 14, 2002, 4:00 a.m. PT

For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.

Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.

"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“

Richard Stallman, founder of the Free Software Foundation, countered saying…

NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..

IE

Page 5: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

IE from Biomedical Texts: Motivation

Complexity of biological systems: Too many specialized biological tasks Several entities interacting in a single phenomenon Many conditions to simultaneously verify

Complexity of biomedical languages: Several nomenclatures, dictionaries, lexica tending to quickly become obsolete

Too much to read!

Genome decoding increasing amount of published literature

Page 6: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

IE History Message Understanding Conference (MUC) DARPA [’87-’95],

TIPSTER [’92-’96] Most early work dominated by hand-built models

E.g. SRI’s FASTUS, hand-built FSMs. But by 1990’s, some machine learning: Lehnert, Cardie,

Grishman and then HMMs: Elkan [Leek ’97], BBN [Bikel et al ’98] Wrapper Induction: initially hand-build, then ML [Soderland ’96],

[Kushmeric ’97], … Most learning attempts based on statistical approaches

Learning of production rules constrained by probability measures (e.g., HMMs, Probabilistic Context-free Grammars)

Some recent logic-based approaches Rapier (Califf ’98) SRV (Freitag ’98) INTHELEX (Ferilli et al. ’01) FOIL-based (Aitken ’02) Aleph-based (Goadrich et al. ’04)

Page 7: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Learning Language in biomedicine

BioCreAtIvE - Critical Assessment for Information Extraction in Biology (http://biocreative.sourceforge.net/)

BioNLP, Natural language processing of biology text (http://www.bionlp.org)

ACL/COLING Workshops on Natural Language Processing in Biomedicine

SIGIR Workshops on Text Analysis for Bioinformatics Special Interest Group in Text Mining since ISMB’03 (Intelligent

Systems for Molecular Biology): BioLINK (Biology Literature, Information and Knowledge)

PSB (Pacific Symposium on Biocomputing) tracks Genomic tracks in TREC (Text Retrieval Conference) PASCAL challenges on information extraction

http://nlp.shef.ac.uk/pascal/ Workshops: IJCAI, ECAI, ECML/PKDD, ICML (Learning Language in

Logic since ’99, challenge task on Extracting Relations from Biomedical Texts)

Page 8: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Is there “Logic” in language learning? IE systems limitations, in general:

Portability (domain-dependent, task-dependent) Scalability (work well on “relevant” data)

Statistics-based approaches wide coverage, scalability, no semantics, no domain knowledge

Logic-based approaches: natural encoding of natural language statements and queries

in first-order logic, human-comprehensible models, domain knowledge refinement of models

[R. J. Mooney, Learning for Semantic Interpretation: Scaling Up Without Dumbing Down, ICML Workshop on Language Learning in Logic, 1999]

Page 9: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

IE problem formulation for HmtDB HmtDB resource of variability data associated to

clinical phenotypes concerning human mithocondrial genome

(http://www.hmdb.uniba.it/)

Page 10: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Textual Entity ExtractionEx: “Cytoplasts from two unrelated patients with MELAS (mitochondrial myopathy, encephalopathy, lactic acidosis, and strokelike episodes) harboring an A-*G transition at nucleotide position 3243 in the tRNALeU(UUR) gene of the mitochondrial genome were fused with human cells lacking endogenous mitochondrial DNA (mtDNA)”

pathology associated to the mutation under study, substitution that causes the mutation, type of the mutation, position in the DNA where the mutation occurs, gene correlated to the mutation.

By modelling the sentence structure:

substitution(X) follows (Y,X), type (Y)

Extractors cannot be learned independently!!!

Page 11: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Textual Entity Extraction Each entity is characterized by some slots defining a template

The task is to learn rules to fill slots (template filling)

Relations in data may allow:

intra-template dependencies to be learned

context-sensitive application of “extractors”

MutationSampled populationDNA sample tissueDNA screening method…

Title

Abstract

Introduction

Methods

Page 12: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

The learning task Classification

Each class (slot) is a concept (target predicate), each model (template filler) induced for the class is a logical theory explaining the concept (set of predicate definitions)

Predefined models of classification should be provided

Importance of domain knowledge and first-order representations

Usefulness of mutual recursion (concept dependencies)

ILP = Inductive Learning Logic Programming From IL: inductive reasoning from observations and

background knowledge From LP: first-order logic as representation formalism

Page 13: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

ATRE (Apprendimento di Teorie Ricorsive da Esempi)

http://www.di.uniba.it/~malerba/software/atre/

Given a set of concepts C1, C2, ... , Cr a set of objects O described in a language LO a background knowledge BK described in a

language LBK a language of hypotheses LH that defines the space

of hypotheses SH a user’s preference criterion PCFinda (possibly recursive) logical theory T for the

concepts C1, C2, ... , Cr , such that T is complete and consistent with respect to the set of observations and satisfies the preference criterion PC.

Page 14: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

ATRE Main Characteristics

Learning problem: induce recursive theories from examples

ILP setting: learning from interpretations Observation language: ground multiple-head clauses Hypothesis language: non-ground definite clauses Constraints: linkedness + range-restrictedness Generalization model: generalized implication Search strategy for a recursive theory: separate-and-

parallel-conquer Continuous and discrete attributes and relations Background knowledge: intensionally defined

Page 15: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Data preparation

ATRE’s observation language: multiple-head clauses

Enumeration of positive and negative examples (expert users manual annotations + unlabelled tokens)

Descriptions of examples: which features? Statistical (frequencies) Lexical (alphanumeric, capitalized, …) Syntactical (nouns, verbs, adjectives, …) Domain-specific (dictionaries)

Page 16: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Page 17: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Text processing The GATE (A General Architecture for Text

Engeneering) framework (http://gate.ac.uk/) ANNIE is the IE core:

Tokeniser Sentence Splitter POS tagger Morphological Analyser Gazetteers Semantic tagger (JAPE transducer) Orthomatcher (orthographic coreference)

Some domain specific gazetteers have been added (diseases, enzymes, genes, methods of analysis)

Page 18: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Text processing Some reg. expr. to capture some domain specific

patterns (alphanumeric strings, appositions, etc.)

Shallow acronym resolution

Screening operations: Some POSs (nouns, verbs, adjectives, numbers,

symbols) Punctuation stopwords (glimpse.cs.arizona.edu. )

Stemming (Porter)

Page 19: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Text description word_to_string(token)Numerical: lenght(token), word_frequency(token),

distance_word_category(token1,token2) Structural: s_part_of(token1,token2), first(token), last(token),

first_is_char(token), first_is_numeric(token), middle_is_char(token), middle_is_numeric(token), last_is_char(token), last_is_numeric(token), single_char(token), follows(token1,token2)

Lexical: type_of(token), type_POS(token)Domain dependent: word_category(token)

Page 20: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Application

We considered 71 documents selected by biologists

Expert users manually annotated occurrences of entities of interest, namelyMutation: position, type, substitution, type_position, locusSubjects: nationality, method, pathology, category,

number

The extraction process (both learning and recognition) is locally performed to text portions of interest, automatically classified

Page 21: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Textual portions of papers were categorized in five classes: Abstract, Introduction, Materials & Methods, Discussion and Results

The abstract of each paper was processed

0,00

10,00

20,00

30,00

40,00

50,00

60,00

70,00

80,00

90,00

100,00

Abstract Introduction Methods Results Discussion

Co

rrec

tly

clas

sifi

ed (

%)

Avg. No. of categories correctly classified

Page 22: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

An A-to-G mutation at nucleotide position (np) 3243 in the mitochondrial tRNALeu(UUR) gene is closely associated with various clinical phenotypes of diabetes mellitus.

[annotation(3)=substitution, annotation(4)=no_tag, annotation(5)=no_tag, annotation(6)=no_tag, annotation(7)=position, annotation(8)=no_tag, annotation(9)=locus, annotation(10)=no_tag, annotation(11)=no_tag, annotation(12)=no_tag, annotation(13)=no_tag, annotation(14)=no_tag, annotation(15)=no_tag, annotation(16)=pathology],

[part_of(1,2)=true, contain(2,3)=true, …, contain(2,16)=true,

word_to_string(3)=‘A-to-G', word_to_string(4)='mutation', word_to_string(5)='nucleotid', word_to_string(6)='position',word_to_string(7)='3243', word_to_string(8)='mitochondri', word_to_string(9)='trnaleu(uur)', word_to_string(10)='gene', word_to_string(11)='clos', word_to_string(12)='associat', word_to_string(13)='variou', word_to_string(14)='clinic', word_to_string(15)='phenotyp', word_to_string(16)='diabetes_mellitus', type_of(3)=upperinitial, …, type_of(7)=numeric, type_POS(3)=jj, type_POS(4)=nn, …, type_POS(15)=nns, word_frequency(3)=3, word_frequency(4)=6, …, word_frequency(16)=1, word_category(9)=locus, word_category(16)=disease, distance_word_category(9,16)=1, follows(3,4)=true, follows(4,5)=true,…, follows(14,15)=true, follows(15,16)=true]).

Example description

Page 23: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Background knowledge follows(X,Z) follows(X,Y)=true, follows(Y,Z)=true

char_number_char(X)=true first_is_char(X)=true, middle_is_numeric(X)=true, last_is_char(X)=true

number_char_char(X)=true first_is_numeric(X)=true, middle_is_char(X)=true, last_is_char(X)=true

char_char_number(X)=true first_is_char(X)=true, middle_is_char(X)=true, last_is_numeric(X)=true

Domain knowledge: word_to_string(X)=transition word_to_string(X)=transversion word_to_string(X)=substitution word_to_string(X)=replacement

Page 24: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Experiments Mutation template 6-fold cross validation The user manually annotates 355 tokens (8.65 per

abstract) About 11% positives

Page 25: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Experiments

Page 26: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Learned theoriesannotation(X1)=position follows(X2,X1)=true, type_of(X1)=numeric, follows(X1,X3)=true,

word_category(X3)=gene, word_to_string(X2)=position.

annotation(X1)=type follows(X1,X2)=true, word_frequency(X2) in [8..140],

follows(X3,X1)=true, annotation(X3)=substitution

annotation(X1)=position follows(X2,X1)=true, annotation(X2)=substitution, follows(X3,X1)=true,

follows(X1,X4)=true, word_frequency(X4) in [6..6], annotation(X3)=type, follows(X1,X5)=true, annotation(X5)=locus, word_frequency(X1) in [1..2]

Page 27: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Wrap-up IE in Biomedicine The ILP approach to IE within a multi-relational

framework allows to implicitly define Domain knowledge Learning from users’ interaction Relational representations Learning relational patterns to allow context-sensitive

application of models Recursive Theory Learning in IE: ATRE Efforts on text processing level:

Ambiguities Data sparseness Noise

Encouraging results on a real-world data set

Page 28: download

CILC 2006, 26-27 giugno 2006, Dipartimento di Informatica, Bari

Where from here? Test on available corpus for Bio IE

Genia BioCreative NLPBA Genic interaction challenges

Investigation of semisupervised approaches: online extension of dictionaries

How to encapsulate taxonomical knowledge? Can information extracted by ATRE be really used as

background knowledge for genomic database mining?