Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen...

Post on 01-Jan-2016

212 views 0 download

Tags:

Transcript of Kevin.cohen@gmail.com Research Opportunities in Biomedical Text Mining Kevin Bretonnel Cohen...

kevin.cohen@gmail.comhttp://compbio.ucdenver.edu/Hunter_lab/Cohen

Research Opportunities in Biomedical Text Mining

Kevin Bretonnel CohenBiomedical Text Mining Group Lead

Information extraction

•Also known as “relation extraction”•Limited to one or a small number of

types of facts– Contrast information retrieval or

question-answering

Information extraction

Information extraction: relationships between things

BINDING_EVENT

Binder:

Bound:

2

Information extraction

Met28 binds to DNA.

BINDING_EVENTBinder: Met28Bound: DNA

2

Why text mining is difficult

•Variability

•Pervasive ambiguity at every level of analysis

5

Why text mining is difficult

Met28 binds to DNA…binding of Met28 to DNA……Met28 and DNA bind……binding between Met28 and DNA……Met28 is sufficient to bind DNA……DNA bound by Met28…

2(6)

Why text mining is difficult

…binding of Met28 to DNA……binding under unspecified conditions

of Met28 to DNA……binding of this translational variant

of Met28 to DNA……binding of Met28 to upstream

regions of DNA…

2(6)

Why text mining is difficult

…binding under unspecified conditions of this translational variant of Met28 to upstream regions of DNA…

3(6)

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

NACT:neoadjuvant chemotherapy (PMID 8898170)

N-acetyltransferase (PMID 10725313)

Na+-coupled citrate transporter (PMID 12177002 )

Why text mining is difficult

6

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•(liver), (testis) and (brain in rat)

•liver, (testis and brain in rat)

•(liver, testis and brain in rat)6

Why text mining is difficult

NaCT is expressed in liver, testis and brain in rat and shows preference for citrate over dicarboxylates… (GeneRIF 266998:12177002)

•shows preference for (citrate over dicarboxylates)

•shows preference (for citrate) (over dicarboxylates) 7

Why text mining is difficult

regulation of cell migration and proliferation(PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

!proliferation and regulation of cell migration

! regulation of proliferation and cell migration regulation of cell migration and regulation of cell

proliferation

7

Why text mining is difficult

regulation of cell migration and proliferation (PMID …)

serine phosphorylation, translocation, and degradation of IRS-1 (PMID 16099428)

!degradation of IRS-1, translocation, and serine phosphorylation

!serine phosphorylation, serine translocation, and serine degradation (of IRS-1) 7

2.5 types of solutions

•Rule-based– Patterns– Grammars

•Statistical/machine learning– Labelled training data– Noisy training data

•Hybrid statistical/rule-based5

Classic work in molecular biology information

extraction: pattern-based

•Blaschke et al. (1999): The beginning of biologists working in BioNLP– Gene names assumed to be known a priori– Patterns assume two gene names and an

“action word”proteinA action_word proteinB– Action words: acetylate, acetylates,

acetylated, acetylation, etc.– Not traditionally evaluated

Classic work in molecular biology information

extraction: pattern-based

•Blaschke et al. (2002): Biologists begin to be aware of linguistics

•Proteins assumed to be known a priori

[proteins] (0-5) [verbs] (6-10) [proteins]

•(Why not 0-5 twice? Different weight of rule)

•P 0.45, R 0.40 (traditional evaluation)

The Colorado solution: OpenDMAP

Classic DMAP

•Direct Memory Access Parser (Riesbeck, 1986; Martin, 1991; Fitzgerald, 1995)– Belonging to the conceptual parser

family– Going as directly as possible from lexical

input to concepts in memory.– Mostly toy prototype implementations

with no real evaluation

Slide from Zhiyong Lu

New Features in OpenDMAP

•Open Source – Implemented in java– Available at www.sourceforge.net

•OpenDMAP patterns are – Richer (capable of using external information

such as protein names and linguistic analyses rather than just strings and concepts)

– More flexible in terms of concept ordering

•First time in biomedical domain– Well constructed ontologies– Open Biomedical Ontologies (e.g. Gene

Ontology)Slide from Zhiyong Lu

Framed-based Representations

•Common representation for ontologies

•A unique name that refers to a concept

•A list of attributes (slots) with admissible values

•Frame slots describe logical relations between framesConcept: Protein Transport

Slots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns: [transported entity] translocation to [transport destination]

Slide from Zhiyong Lu

Transport Frame in Protégé

Slide from Zhiyong Lu

Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Slots Defines Logical Relations Between Concepts

Concept: Protein TransportSlots: [transported entity]: protein or molecular complex [transporting entity]: protein or molecular complex [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Relation linkSlide from Zhiyong Lu

Slots Defines Logical Relations Between Concepts

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: Protein TransportSlots: [transport origin]: cellular component [transport destination]: cellular componentPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: proteinPhrasal-patterns: none

Concept: molecular complexPhrasal-patterns: none

Concept: cellular componentPhrasal-patterns: none

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrialSubsumption link

Relation link

Slide from Zhiyong Lu

Patterns for Cellular Locations

•Names and synonyms from Gene Ontology terms

•Linguistic variationsConcept: cellular component

Concept: mitochondrion Phrasal-patterns: := mitochondrion := mitochondria := mitochondrial

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

Concept: nucleusPhrasal-patterns: := nucleus := nuclei := nuclear

Subsumption link

Relation link

Concept: proteinPhrasal-patterns: none

Concept: mitochondrionPhrasal-patterns: := mitochondria := mitochondrion := mitochonrial

Concept: cellular componentPhrasal-patterns: none

Concept: Protein TransportPhrasal-patterns:[transported entity] translocation to [transport destination]

Concept: molecular complexPhrasal-patterns: none

Bax translocation to mitochondria

Pattern Matching Process

Awaiting

Recognized

Active

Slide from Zhiyong Lu

New Features in OpenDMAP Patterns

Data demonstrate that TFII-I, through a Src-dependent mechanism, translocates reversibly from the cytoplasm to the nucleus, leading to the transcription activation of growth-regulated genes.

[transported entity dep:x] _ [action c-action-transport head:x]

(by the? [transporting entity])? @ (to the? [transport destination]

@ (from the? [transport origin])

_ wildcarddep:x/head:x placement of linguistic constraints ? optional concept match@ optional concept ordering + match

Slide from Zhiyong Lu

Other Transport Patterns

Pattern: [transport destination] [action c-action-transport] _(of the? [transported entity]? (by the? transporting entity])?GeneRIF: … nuclear translocation of the NF-kappaB (p65/p50) heterodimers

Pattern: [transported entity dep:x]? _ [transport destination][action c-action-transport head:x] (by the? transporting entity])?GeneRIF: … is sufficient to degrade the AHR and that nuclear translocation

Pattern: [transported entity] (is|are|was|were) [action c-action-transport-passive] @ (by the? transporting entity])@ (from the? [transport origin]) @ (to the? [transport destination])GeneRIF: the YY1 factor is translocated to the cytoplasm … Slide from Zhiyong Lu

Evaluation of information extraction (and many

other NLP tasks)

•Standard paradigm: “corpus”—body of texts with “gold standard” answers marked

•“Weakly annotated” data: publications with metadata only

•Test suites—see previous lectures

Evaluation of NLP systems

•Precision (aka specificity) and recall (aka sensitivity). Tradeoffs between them.

•Against a “gold standard” of human generated representations of texts– Humans don’t always agree, therefore

calculate inter-annotator agreement

•Post-hoc judgments (particularly of IR relevance)

•“Shared task” paradigm – TREC Genomics (IR)– BioCreative (IE)

Evaluation of NLP systems

•Precision: – True positives / (True positives + False

positive)

•Recall: – True positives / (True positives + False

negatives)

•F-measure: “harmonic mean” of precision and recall

Evaluation of NLP systems

•Formal definition:

•Typical definition: β = 1, so…

(1 + β2) * precision * recall

(β2 * precision) + recallFβ =

Evaluation of NLP systems

•Typical definition:

•…or just F: β is usually assumed to be 1

2 * precision * recall

precision + recallF1 =

Evaluation of NLP systems

•β allows you to weight precision and recall differently– Increasing β weights precision more

highly– Decreasing β weights recall more highly

•Rarely used, but designated by value of β, e.g. F0.5 or F2

OpenDMAP performance

•Performance of any rule-based information extraction system is a function of two things:– Overall architecture and abilities of the

system– Quality of the rules

OpenDMAP performance•Protein transport

– Complete frame filled: P 0.75, R 0.49, F 0.59– Incomplete frames: P 0.75, R 0.67, F 0.71– Gold standard gene names: complete frame

P 0.77, R 0.67, F 0.72, incomplete frame P 0.75, R 0.85, F 0.81

•Cell-type-specific gene expression– Without gold standard gene names: P 0.64,

R 0.16, F 0.26– With gold standard gene names: P 0.85 R

0.36 F 0.51

OpenDMAP performance

•Protein-protein interactions– BioCreative II shared task: Placed 1st

•F 0.29 ten percent higher than #2 system, more than 3 standard deviations above the mean—similar recall to others, but precision of 0.39 more than 20% higher than #2 system

•BioCreative II.5: – Another team placed 1st using

OpenDMAP and a much larger set of (automatically learned) rules

OpenDMAP performance

•“Event” recognition– E.g. phosphorylation, expression,

binding, localization (weird definition of “event”)

– Ranked 19 out of 24 groups , but…

OpenDMAP performance

•“Event” recognition– E.g. phosphorylation, expression,

binding, localization (weird definition of “event”)

– Ranked 19 out of 24 groups , but…had the highest precision (0.71-0.72)

Paths to improving OpenDMAP performance

•Increase recall– Pattern learning? See Haibin’s lecture

•Increase precision– Leverage what we know about biology– Huge knowledge-base construction

effort underway here over the course of past two years

Adding even small amounts of knowledge to

the system helps•Livingston (2011): Gene activation task

•Original system: enzymes and substrates both allowed to be of type protein

•Enhancement: – Gene Ontology annotations– Potential enzymes must have annotation

catalytic activity– Potential substrates must have annotation

receptor activity

Adding even small amounts of knowledge to

the system helps

Original Added knowledge

Difference

Precision 0.16 0.36 0.20

Recall 0.24 0.18 -0.06

F-measure 0.19 0.24 0.05

Nominalization

•Nominalization: noun derived from a verb– Verbal nominalization: activation,

inhibition, induction – Argument nominalization: activator,

inhibitor, inducer, mutant

Nominalizations are dominant in biomedical

textsPredicate Nominalization All verb forms

Express 2,909 1,233

Develop 1,408 597

Analyze 1,565 364

Observe 185 809

Differentiate 737 166

Describe 10 621

Compare 185 668

Lose 556 74

Perform 86 599

Form 533 511 Data from CRAFT corpus

Relevant points for text mining

•Nominalizations are an obvious route for scaling up recall

•Nominalizations are more difficult to handle than verbs…

•…but can yield higher precision (Cohen et al. 2008)

Alternations of nominalizations: positions

of arguments

•Any combination of the set of positions for each argument of a nominalization– Pre-nominal: phenobarbital induction,

trkA expression– Post-nominal: increases of oxygen– No argument present: Induction

followed a slower kinetic…– Noun-phrase-external: this enzyme can

undergo activation

Result 1: attested alternations are

extraordinarily diverse•Inhibition, a 3-argument predicate—Arguments 0 and 1 only shown

Implications for system-building

•Distinction between absent and noun-phrase-external arguments is crucial and difficult, and finite state approaches will not suffice; merging data from different clauses and sentences may be useful

•Pre-nominal arguments are undergoer by ratio of 2.5:1

•For predicates with agent and patient, post/post and pre/post patterns predominate, but others are common as well

What can be done?

•External arguments:– semantic role labelling approach

•…but, very important to recognize the absent/external distinction, especially with machine learning

– pattern-based approach•…but, approaches to external arguments

(RLIMS-P) are so far very predicate-specific

What can be done?

•Pre-nominal arguments: – apply heuristic that we have identified

based on distributional characteristics– for most frequent nominalizations,

manual encoding may be tractable

So, how do you dotext mining?

Two approaches that are not coexisting peacefully

Two approaches to NLP

Knowledge-based Statistical/machine learning

First approach to NLP

•Rule-based

•AI, linguisticsOntologiesKnowledge bases

•Patterns (regular, context-free…)

•Procedures

K-based: procedural

•Patterns (regular, context-free, …)

•Procedures

if (currentWordEndsWith-ing) {

if (previousWordIsThe) {

if (nextWordIsOf) {

K-based: regex

•Patterns (regular, context-free, …)

•Procedures

$geneName = “[A-Za-z]+-?[0-9]”;

$input =~ /interaction of ($geneName) with ($geneName)/;

$interactionAssertion->setGene1($1);

$interactionAssertion->setGene2($2);

K-based: CFGs

•Patterns (regular, context-free, …)

•Procedures

NounPhrase -> NounPhrase+ Conjunction NounPhrase

NounPhrase -> Predeterminer Determiner+ Adjective+ Noun

Knowledge-based approachesWhy they work

•Patterns are real– Psychologically– Formally adequate (mostly)

•Intuition works

•No need for training data

Knowledge-based approaches

Why they’re hard

•Knowledge takes time to get

•Process of developing large rule sets can be slow– Consider English syntax…

Second approach to NLP

•Mosteller & Wallace

•Bayesian

•Other machine learning techniques

Statistical/ML approaches

•Frame the NLP task as a series of classification problems– Which POS is this?– Which word meaning?– Which phrasal grouping?

Statistical approachesWhy they work

•Statistics can be proxy for knowledge

•Some interesting stuff is frequent enough to be tractable

Knowledge-based or statistical: what to do??

Knowledge-based vs. statistical approaches

•Pragmatic answer #1: if you must pick one...– Is it cheaper to label more training data,

or to put time into developing patterns?

Knowledge-based vs. statistical approaches

•Researcher’s answer:– Use one as the baseline for the other

Knowledge-based vs. statistical approaches

•Pragmatic answer #2: combine them– Do both together/iteratively– Statistical solution first, then rule-based

post-processing

the 2.5th approach

“Natural language processing is never pure and rarely simple.”

Which works better?

Pestian et al. (2007)

A rapprochement

Conceptual features for information retrieval

•Task: retrieve sentences that contain mentions of mutations.

•Keyword approach: 1,092

•Recognize mutation mentions: additional 2,171

Conceptual features indocument classification

Caporaso et al. (2005)

Conceptual features indocument classification

Caporaso et al. (2005)

Untapped conceptual types

Malignancies (F = 0.84)

Jin et al. (2006)

Mouse strains

•CAST/EiJ

•C57BL

•SJL/J

•SEG

•C3H/He

•RIII

• DBA/1

Caporaso et al. (2005)

Mutations

•Ala64->Gly

•Ala64Gly

•A376G

Caporaso et al. (2007)

Point/Counterpoint

Contradictory findings

•TREC 2003: “...searching in the MeSH and substance name fields, along with filtering for species, accounted for the best performance” (Hersh and Bhupatiraju 2003, Caporaso et al. 2005)

•TREC 2004: “Approaches that attempted to map to controlled vocabulary terms did not fare as well” (Hersh et al. 2004)

Understanding the TREC 2004 results

• Poor choice of concepts – MeSH terms only, which is known to have problems even

if manually indexed

• “Conceptual” systems weren’t very good (or didn’t try very hard) at concept recognition– Even synonymy not detected well (1 case)– Methods not described, so presumably not a focus of the

work (2 cases)

• Hersh et al. (2004) overstate role of concepts in these systems– Synonym source only (1 case)– Only one of several features (1 case)

I’m convinced in theory, but will it scale?

•Jin et al. (2006): for malignancy mentions, relatively small amount of training data sufficed

•Caporaso et al. (2007): mutation patterns were learnable with small person-hour investment

Conclusion

•Statistical and conceptual approaches to text mining can coëxist peacefully– Statistical and rule-based concept

recognizers can work well– Concepts are good features for

statistical systems