TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science...

65
TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester, UK Department of Computer Science School of Information Science and Technology University of Tokyo, JAPAN

Transcript of TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science...

Page 1: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

TM and NLP for BiologyResearch Issues in HPSG Parsing

Junichi TSUJII

School of Computer ScienceNational Centre for Text Mining

University of Manchester, UK

Department of Computer ScienceSchool of Information Science and Technology

University of Tokyo, JAPAN

Page 2: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

2

Increments

: accumulation

Increase in Medline

2002

2000

1998

199219941996

1990

1988

1980198219841986

1978

1970197219741976

1968

1966

1964

0

100,000

200,000

300,000

400,000

500,000

600,000

incr

emen

ts

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

acc

um

ula

tio

n

G-protein coupled receptor

Before 19889 papers

1992256 papers2005

14,000 papers

MEDLINE alone

More than 0.5 million per year More than 1.3 thousand per day

Articles added

Medline Access

1997: 0.163 M accesses/month2006: 82.027 M accesses/month

[D.L.Banville 2006]

500 times more

Page 3: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

3

NaCTeMwww.nactem.ac.uk

• First such centre in the world • Funding: JISC, BBSRC, EPSRC• Consortium investment

• Chair in TM (Prof. J. Tsujii, Univ. Tokyo)

• Location: Manchester Interdisciplinary Biocentre (MIB) www.mib.ac.uk funded by the Wellcome Trust

• Initial focus: biomedical academic community• Extend services to industry• Extend focus to other domains (social

sciences)

Page 4: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

4

Consortium

• Universities of Manchester, Liverpool• Service activity run by MIMAS (National

Centre for Dataset Services), within MC (Manchester Computing)

• Self-funded partners– San Diego Supercomputing Center – University of California, Berkeley – University of Geneva – University of Tokyo

• Strong industrial & academic support– IBM, AZ, EBI, Wellcome Trust, Sanger Institute,

Unilever, NowGEN, MerseyBio, …

Page 5: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

5

Page 6: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

6

Page 7: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

7

Page 8: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

8

Page 9: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

9

Page 10: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

10

Page 11: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

11

NLP and TM

Text Mining

Text as a bag of words

Words as surface strings

Natural Language Processing

Language as a complex system linking surfacestrings of characters with their meanings Text and words as structured objects

NLP-based TM

Linking text with knowledge

Page 12: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

12

Non-Trivial Mappings

Language Domain Knowledge Domain

Concepts and Relationships among Them

Linguistic expressions

Motivated Independently of language

TerminologyParsingParaphrasing

From surface diversities and ambiguities to

conceptual invariants

Page 13: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

13

Example

Page 14: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

14

Non-trivial Mapping

Language Domain Knowledge Domain 

Independently motivated of Language

Same relationswith differentStructures

Full-strength Straufen protein lacking this insertion is able to assocaite with osker mRNA and activate its translation, but fails to …..

[A] protein activates [B] (Pathway extraction)

Since ……., we postulate that only phosphorylated PHO2 protein could activate the transcription of PHO5 gene.

Transcription initiation by the sigma(54)-RNA polymerase holoenzyme requires an enhancer-binding protein that is thought to contact sigma(54) to activate transcription.

[sentence] > ([arg1_activate] > [protein])Retrieval usingRegional Algebra

Page 15: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

15

Predicate-argument structureParser based on Probabilistic HPSG (Enju)

S

p53 has been shown to directly activate the Bcl-2 protein

NP

VP

ADVP

S

VP

VP

VP

NP arg1arg2

arg2

arg3

Page 16: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

16

述語 /項構造確率HPSG解析器 (Enju) の出力

The protein is activated by it

DT NN VBZ VBN IN PRP

dt np vp vp pp np

np pp

vp

vp

s

arg1arg2mod

Semantic Retrieval SystemUsing Deep Syntax

MEDIE

Passive

Passive and Infinitival Clause

Page 17: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

17

Page 18: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

18

Page 19: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

19

Page 20: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

20

Page 21: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

21

Page 22: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

22

Page 23: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

23

Page 24: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

24

Page 25: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

25

Page 27: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

27

Predicate-argument structureParser based on Probabilistic HPSG (Enju)

S

p53 has been shown to directly activate the Bcl-2 protein

NP

VP

ADVP

S

VP

VP

VP

NP arg1arg2

arg2

arg3

Page 28: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

28

Page 29: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

29

Page 30: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

30

Page 31: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

31

Penn Treebank GENIA

Coverage 99.7% 99.2%

F-Value (PArelations) 87.4% 86.4%

Sentence Precison 39.2% 31.8%

Processing Time 0.68sec 1.00sec

Performance of Semantic Parser

Page 32: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

32

Scalability of TM Tools

The number of papers 14,792,890

The number of abstracts 7,434,879

The number of sentences 70,815,480

The number of words 1,418,949,650

Compressed data size 3.2GB

Uncompressed data size 10GB

Target Corpus: MEDLINE corpus

Suppose, for example, that it

takes one second for parsing one

sentence….70 million seconds, that is, about 2 years

Page 33: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

33

TM and GRID

• Solution– The entire MEDLINE were parsed by

distributed PC clusters consisting of 340 CPUs

– Parallel processing was managed by grid platform GXP [Taura2004]

• Experiments– The entire MEDLINE was parsed in 8 days

• Output– Syntactic parse trees and predicate

argument structures in XML format– The data sizes of compressed/uncompressed

output were 42.5GB/260GB.

Page 34: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

34

Efficient Parsing for HPSG

Page 35: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

35

Background: HPSG• Head-Driven Phrase Structure Grammar (HPSG) [Pollard and Sag, 1994]

– Lexicalized and Constraints-based Grammar–A few Rule Schema General constraints on linguistic constructions

–Constraints embedded in Lexicon Word-Specific Constraints

–Constraints between phrase structures and semantic structures

Page 36: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

36

I like it

Parsing by HPSG

Page 37: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

37

HEAD nounSUBJ < >COMPS < >

I

HEAD nounSUBJ < >COMPS < >

it

HEAD verbSUBJ COMPS

like

<NP><NP>

Parsing by HPSG

Assignment of Lexical Entries

Page 38: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

38

HEAD nounSUBJ < >COMPS < >

I

HEAD verbSUBJ COMPS

HEAD nounSUBJ < >COMPS < >

like it

1< >

2< >2

HEAD verb

SUBJ

COMPS < >

HEAD nounSUBJ < >COMPS < >

1< >

Head-Complement

Application of

Rule Schema

Page 39: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

39

HEAD nounSUBJ < >COMPS < >

HEAD nounSUBJ < >COMPS < >

I

HEAD verbSUBJ COMPS

like it

1< >2< >

2

HEAD verbSUBJ < >COMPS < >

1< >HEAD verbSUBJ COMPS < >

1

Subject-Head

Application of

Rule Schema

Page 40: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

40

Inefficiency of HPSG Parsing

• Complex DAG : Typed-feature structures– Abstract machine for Unification (LiLFeS)

• Unification: Expensive Operation (⇔ CFG Approximation: CFG Filtering )

• Assignment of Lexical Entries– High reduction of search space / Super

tagging

Page 41: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

41

Filtering with CFG (1/5)• 2-phased parsing

– Approximate HPSG with CFG with keeping important constraints.

– Obtained CFG might over-generate, but can be used in filtering.

– Rewriting in CFG is far less expensive than that of application of rule schemata, principles and so on.

CompileHPSG CFGFeature

Structures

Input Sentences

Built-in CFG Parser

LiLFeS UnificationParsing

+

Output

Complete parse trees

Page 42: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

47

System Overview

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

HEAD nounSUBJ < >

COMPS < >

P High

Supertagger

I like itInputsentence

CFG Filtering

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

I like it

HEAD nounSUBJ < >

COMPS < >

HEAD verbSUBJ <NP>

COMPS <NP>

HEAD nounSUBJ < >

COMPS < >

...

Deterministic Shift/Reduce Parser

I like it

Page 43: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Experiment Results

LP(%) LR(%) F1(%) Avg. timeStaged/Deterministic model

86.93 86.47 86.70 30ms/snt

Previous method 1( Supertagger+ChartParser)

87.35 86.29 86.81 183ms/snt

Previous method 2( Unigram + ChartParser )

84.96 84.25 84.60 674ms/snt

6 times faster20 times faster than the initial model

Page 44: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

49

Domain/Text Type Adaptation

Page 45: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

50

F-score Training Time ( Sec )

Baseline ( PTB-trained, PTB-applied) 89.81 0

Baseline (PTB-trained, GENIA-applied) 86.39 0

Retraining ( GENIA ) 88.45 14,695

Retraining ( PTB+GENIA) ) 89.94 238,576

Structure with RefDist 88.18 21,833

Lexical with RefDist 89.04 12,957

Lex/Structure with RefDist 90.15 31,637

Page 46: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

51

Adaptation with Reference Distribution

)(

)|()|(

,)|()|(1

)|(

w ww

ww

l

lw

Tt wsyniilex

wsyniilexE

i

i

tqwlpZ

tqwlpZ

tp

Lexical Assignment Syntactic Preference

Original model

j

jjs

stgZ

stpM )|(exp1

)|(

Feature function

Feature weight)|(0 stp

Page 47: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

52

83

84

85

86

87

88

89

90

0 2000 4000 6000 8000

Number of Sentence of the GENIA Training Set

F-s

core

Baseline (PTB)

Simple Retraining ( GENIA)

Retraining (GENIA+PTB)Structure with Ref.DistLexical with RefDist

Lexical/Structure woth RefDist

Page 48: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

53

83

84

85

86

87

88

89

90

0 10000 20000 30000

Training Time ( Sec )

F- s

core

Retrinaing(GENIA)

Structure with RefDistLexicon woth RefDist

Lex/Str with RefDist

Page 49: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

54

F-score Training Time ( Sec )

Baseline ( PTB-trained, PTB-applied) 89.81 0

Baseline (PTB-trained, GENIA-applied) 86.39 0

Retraining ( GENIA ) 88.45 14,695

Retraining ( PTB+GENIA) ) 89.94 238,576

Structure with RefDist 88.18 21,833

Lexical with RefDist 89.04 12,957

Lex/Structure with RefDist 90.15 31,637

Page 50: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

55

Tool1: POS Tagger

• General-Purpose POS taggers, trained by WSJ– Brill’s tagger, TnT tagger, MX POST, etc. – 97%

• General-Purpose POS taggers do not work well for MEDLINE abstracts

The peri-kappa B site mediates human immunodeficiency DT NN NN NN VBZ JJ NNvirus type 2 enhancer activation in monocytes … NN NN CD NN NN IN NNS

Page 51: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

56

Errors seen in TnT tagger (Brants 2000)

A chromosomal translocation in … DT JJ NN IN… and membrane potential after mitogen binding. CC NN NN IN NN JJ… two factors, which bind to the same kappa B enhancers… CD NNS WDT NN TO DT JJ NN NN NNS … by analysing the Ag amino acid sequence. IN VBG DT VBG JJ NN NN… to contain more T-cell determinants than … TO VB RBR JJ NNS IN Stimulation of interferon beta gene transcription in vitro by NN IN JJ JJ NN NN IN NN IN

Page 52: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

57

Performance of GENIA Tagger

Training corpus WSJ GENIA

WSJ 97.0 84.3

GENIA 75.2 98.1

WSJ+GENIA 96.9 98.1

Training corpus

WSJ GENIA

WSJ 96.7 84.3

GENIA 80.1 97.9

WSJ+GENIA 96.5 97.5

• GENIA tagger (Ref.) TnT tagger

No degradation of the taggertrained by the mixed corpus

Some degradations (0.2 ~ 0.4) were observed, compared withthe taggers trained by “pure” corpora

Page 53: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

58

CRF-based POS + Active LearningGENIA

3,000 sentences : 98.420,000 sentences: 98.58

Page 54: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

59

10,000 sentences: 96.76Best Performance: 97.18

CRF-based POS + Active LearningPTB

Page 55: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

60

Applications

Page 56: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

GENIA Event Annotation - example

LinkCauseLinkCause

– For an identified event in the given sentence,• classify the type of events and record the text span giving the clue of it (ClueType).• identify the theme of the events and record the text span linking the theme to the event

(LinkTheme).• identify the cause of the events and record the text span linking the cause to the event

(LinkCause).

• record the environment (location, time) of the events (ClueLoc, ClueTime).

LinkThemeLinkTheme

ClueLocClueLoc

ClueTypeClueType

ClueTypeClueType

Page 57: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Gene_expression• Theme patterns observed (2,958)

– Protein 2,308– DNA      591– RNA      25– Peptide 4– Protein Protein 2– Erroneous 27

• Keywords– coexpress, nonexpress, overexpress,

express, biosynthesis, product, synthesize, constitute, …

coexpression

Page 58: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Transcription

• Theme patterns observed (929)– DNA      449– RNA     272– Protein 167– Peptide 2– Erroneous22

• Keyword– Transcrib, transcript, synthesi, express,

Page 59: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Localization• Theme patterns observed

(730)– Protein 608– Lipid 31– Atom 29– Other_organic_compound

14– DNA 12– Virus 5– Carbohydrate 5– RNA 4– Inorganic 4– Peptide 3• Keywords

– Translocation, sectetion, release, localization, mobilization, uptake, secrete, import, transport, translocate, sequester, influx, mograte, localisation, move, delivery, export, …

• ClueLoc

– NONE 241

– nuclear 140

– to the nucleus 12

– into the nucleus 11

– Cytoplasmic 8

– in the cytoplasm 7

– macrophages 5

– nuclear … in t lymphocytes4

– monocytes 4

– in the nucleus 4

– in the cytosol 4

– in colostrum 4

– from the cytoplasm to the nucleus 4

Page 60: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

Localization• Keywords and Locations

– translocation (166)• nuclear 108• NONE

38• …

– secretion (100)• NONE

57• name_of_cells 43

– release (80)• NONE

51• name_of_cells 19• …

– localization (30)• nuclear 25• intracellular 3

– uptake (24)• NONE 14• name_of_cells 20

• Keywords and Themes– translocation (166)

• Protein 161• Virus

4• RNA

1– secretion (100)

• Protein 98• Lipid

1• Peptide 1

– release (80)• Protein 67• Other_organic_compoun 6• Lipid

3– localization (30)

• Protein 30– uptake (24)

• Lipid15

• Carbohydrate 5• Protein 4

Page 61: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

69

Future Plan

Kitano’s group, Kell’s group

Page 62: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

70

Page 63: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

71

Page 64: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

72

Future Directions

• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific

characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs

– Standardized Interfaces (API) of NLP tools

• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions

in DBs

• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation

Page 65: TM and NLP for Biology Research Issues in HPSG Parsing Junichi TSUJII School of Computer Science National Centre for Text Mining University of Manchester,

73

Future Directions

• Domain Adaptation + Inter-operability– High performance can be obtained by using domain specific

characteristics and domain semantics– Differences among abstracts, full papers, comments in DBs

– Standardized Interfaces (API) of NLP tools

• Text Archives – Abstracts + Full Papers + Comments/Summary Descriptions

in DBs

• Combining NLP tools with Mining tools – Knowledge Discovery (Disease Gene Association)– Hypotheses Generation– Automatic Data Interpretation