Hierarchical Text Categorization and its Application to Bioinformatics

Hierarchical Text Categorizationand

its Application to Bioinformatics

Stan Matwin and Svetlana Kiritchenko

joint work with Fazel Famili (NRC), and

Richard Nock (Université Antilles-Guyane)

School of Information Technology and Engineering

University of Ottawa

2

Outline

• What is hierarchical text categorization (HTC)• Functional gene annotation requires HTC• Ensemble-based learning and AdaBoost• Multi-class multi-label AdaBoost• Generalized local hierarchical learning

method• New global hierarchical learning algorithm• New hierarchical evaluation measure• Application to Bioinformatics

3

Text categorization

• Given: dj D - textual documents

C = {c1, …, c|C|} – predefined categories

• Task: <dj, ci> DC {True, False}

c1

c7

c6

c5c4

c3

c2

TC

4

Hierarchical text categorization

• Hierarchy of categories: ≤ CC - reflexive, anti-symmetric, transitive binary relation on C

c1

c7c6c5c4

c3c2

HTC

5

Advantages of HTC

• Additional, potentially valuable information– Relationships between categories

• Flexibility– High levels: general topics– Low levels: more detail

6

Outline



7

Text classification and bioinformatics

• Clustering and classification of gene expression data– DNA chip time series – performance data

– Gene function, process,… – genetic knowledge - GO

– Literature will connect the two - domain knowledge

• Validation of results from performance data

8

Example: Gene Ontology

9

From data to knowledge via literature

• Functional annotation of genes from biomedical literature

10

Other applications

• Web directories

• Digital libraries

• Patent databases

• Biological ontologies

• Email folders

11

Outline



12

Boosting

• not a learning technique on its own, but a method in which a family of “weakly” learning agents (simple learners) is used for learning

• based on the fact that multiple classifiers that disagree with one another can be together more accurate than its component classifiers

• if there are L classifiers, each with an error rate < 1/2, and the errors are independent, then the prob. that the majority vote is wrong is the area under binomial distribution for more than L/2 hypotheses

13

Why do we have committees (ensembles)?

14

Boosting – the very idea

• Train an ensemble of classifiers, sequentially• Each next classifier focuses more on the

training instances on which the previous one has made a mistake

• The “focusing” is done thru the weighting of the training instances

• To classify a new instance, make the ensemble vote

16

Boosting - properties

• If each hl is only better than chances, boosting can attain ANY accuracy!!

• No need for new examples, additional knowledge, etc

• Original AdaBoost is on single-labeled data

17

Outline



18

AdaBoost.MH [Schapire and Singer, 1999]

• (di, Ci) ((di, l), Ci[l]), l C• Initialize distribution P1(i,l) = 1/(mk) .• For t = 1, …, T:

– Train weak learner using distribution Pt .– Get weak hypothesis ht: DC .

• Update:

• The final hypothesis:

19

BoosTexter [Schapire and Singer, 2000]

• “Weak” learner: decision stump

word w

occurs doesn’t occur

1][,:

1][,:1 ),(

),(ln

21

lCdwit

lCdwit

l

ii

ii

liP

liP

q

1][,:

1][,:0 ),(

),(ln

21

lCdwit

lCdwit

l

ii

ii

liP

liP

q

20

Thresholds for AdaBoost

• AdaBoost often underestimates its confidences

• 3 approaches to selecting better thresholds– single threshold for all classes– individual thresholds for each class– separate thresholds for each subtree rooted in the

children of a top node (for tree-hierarchies only)

21

Thresholds for AdaBoost

22

Outline



23

Hierarchical consistency

• if (dj, ci) True,

then (dj, Ancestor(ci)) True

c1

c7c6c5c4

c3c2

c1

c7c6c5c4

c3c2

consistent inconsistent

24

Hierarchical local approach

c1

c7c6c5c4

c3c2

c8 c9

25


c1

c7c6c5c4

c3c2

c8 c9

26


c1

c7c6c5c4

c3c2

c8 c9

27


c1

c7c6c5c4

c3c2

c8 c9

28


c1

c7c6c5c4

c3c2

c8 c9

consistent classification

29

Generalized hierarchical local approach

• stop classification at an intermediate level if none of the children categories seem relevant

• a category node can be assigned only after all its parent nodes have been assigned

c1

c7c6c5c4

c3c2

c8 c9

30

Outline



31

New global hierarchical approach

• Make a dataset consistent with a class hierarchy– add ancestor category labels

• Apply a regular learning algorithm– AdaBoost

• Make prediction results consistent with a class hierarchy– for inconsistent labeling make a consistent decision

based on confidences of all ancestor classes

32

New global hierarchical approach

• Hierarchical (shared) attributes

sportsteam, game,winner, etc.

hockeyNHL, Senators, goalkeeper, etc.

footballSuper Bowl, Patriots,

touchdown, etc.

33

Outline



34

Evaluation in TCc1

c7c6c5c4

c3c2

Correct category

Incorrect category

predicted total

predictedcorrectly precision

categoryin total

predictedcorrectly recall

0,)1(

measure-F2

2

RP

RP

35

Weaknesses of standard measures

P(H1) = P(H2) = P(H3)

R(H1) = R(H2) = R(H3)

F(H1) = F(H2) = F(H3)

c1

c7c6c5c4

c3c2

H1c1

c7c6c5c4

c3c2

H2c1

c7c6c5c4

c3c2

H3

Ideally, M(H1) > M(H3) and M(H2) > M(H3)

36

Requirements for a hierarchical measure

1. to give credit to partially correct classification

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

M(H1) > M(H2)

H1 H2

37


2. to punish distant errors more heavily:– to give higher evaluation for correctly classifying one

level down comparing to staying at the parent node

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

M(H1) > M(H2)

H1 H2

38


2. to punish distant errors more heavily:– gives lower evaluation for incorrectly classifying one

level down comparing to staying at the parent node

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

M(H1) > M(H2)

H1 H2

39


3. to punish errors at higher levels of a hierarchy more heavily

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

c1

c7c6c5c4

c3c2

c8 c9 c10 c11

M(H1) > M(H2)

H1 H2

40

Advantages of the new measure

• Simple, straight-forward to calculate

• Based solely on a given hierarchy (no parameters to tune)

• Satisfies all three requirements

• Has much discriminating power

• Allows to trade off between classification precision and classification depth

41

Our new hierarchical measure

c1

c7c6c5c4

c3c2

Correct category

Incorrect category

Correct category+ all its ancestors(excluding root)

predicted total

predictedcorrectly precision

categoryin total

predictedcorrectly recall

0,)1(

measure-F2

2

RP

RP

42

Our new hierarchical measure

c1

c7c6c5c4

c3c2

H1 c1

c7c6c5c4

c3c2

H2 c1

c7c6c5c4

c3c2

H3

correct: {c4} {c2, c4}

predicted: {c2} {c2}

{c4} {c2, c4}

{c5} {c2, c5}

{c4} {c2, c4}

{c7} {c3, c7}

1|}{||}{|

)(2

21

c

cHhP

21

|},{||}{|

)(42

21

cc

cHhR

21

|},{||}{|

)(42

22

cc

cHhR

21

|},{||}{|

)(52

22

cc

cHhP 0

|},{||{}|

)(73

3 cc

HhP

0|},{|

|{}|)(

423

ccHhR

43

Measure consistency

• Definition [Huang & Ling, 2005]:f, g – measures on domain R = {(a,b)|a,b , f(a)>f(b), g(a)>g(b)}S = {(a,b)|a,b , f(a)>f(b), g(a)<g(b)}f is statistically consistent with g if |R|>|S|

• Experiment: – 100 randomly chosen hierarchies– New hierarchical F-measure and standard accuracy

were consistent on 85% of random classifiers (|R|>5|S|)

44

Measure discriminancy

• Definition [Huang & Ling, 2005]:f, g – measures on domain P = {(a,b)|a,b , f(a)>f(b), g(a)=g(b)}Q = {(a,b)|a,b , f(a)=f(b), g(a)>g(b)}f is statistically more discriminating than g if |P|>|Q|

• Examples:

c1

c7c6c5c4

c3c2

H1c1

c7c6c5c4

c3c2

H2c1

c7c6c5c4

c3c2

H3

For one accuracy value - 3 different hierarchical values

45

Results: Hierarchical vs. Flat

levels branching Flat H. AdaBoost

2 2 68.30 76.22

3 2 58.35 74.21

4 2 44.90 73.22

5 2 20.88 72.70

2 3 53.47 63.45

3 3 29.51 60.69

4 3 2.67 58.22

2 4 41.35 55.25

3 4 6.98 50.70

2 5 29.99 47.87

Synthetic data (hierarchical attributes)

46


levels branching Flat H. AdaBoost

2 2 61.69 65.95

3 2 42.47 51.53

4 2 24.49 40.18

5 2 8.45 32.61

2 3 41.53 48.02

3 3 14.50 29.97

4 3 0.79 21.91

2 4 26.72 35.01

3 4 2.46 19.70

2 5 17.14 27.12

Synthetic data (no hierarchical attributes)

47


dataset Flat H. AdaBoost

newsgroups 75.51 79.26

reuters 87.06 88.31

Real data

48

Results: Hierarchical vs. Local

levels branching Local H. AdaBoost

2 2 73.42 76.22

3 2 69.40 74.21

4 2 68.18 73.22

5 2 68.44 72.70

2 3 61.99 63.45

3 3 58.81 60.69

4 3 57.40 58.22

2 4 54.26 55.25

3 4 50.66 50.70

2 5 47.26 47.87

Synthetic data (hierarchical attributes)

49


levels branching Local H. AdaBoost

2 2 59.83 65.95

3 2 44.00 51.53

4 2 33.44 40.18

5 2 26.03 32.61

2 3 43.87 48.02

3 3 26.33 29.97

4 3 17.97 21.91

2 4 32.51 35.01

3 4 17.96 19.70

2 5 26.04 27.12

Synthetic data (no hierarchical attributes)

50


dataset Local H. AdaBoost

newsgroups 80.01 79.26

reuters 89.11 88.31

Real data

51

Outline



Application to Bioinformatics

• Functional annotation of genes from biomedical literature

53

Learning (from fully-annotated genes in the db)

ID Symbol Name Medline reference Evidence GO ID …

S0007287 15S_RRNA PMID:6261980 ISS GO:0003735S0007287 15S_RRNA PMID:6280192 IGI GO:0006412

S0004660 AAC1ADP/ATP

translocatorPMID:2167309 TAS GO:0005743

… … … … … … …

Genomic database (SGD)

retrieve GO codes and IDs of Medline entries from the db records

1

54


Medline

retrieve the corresponding Medline abstracts2

PMID Abstract

PMID:6261980 Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast

Sor F, Fukuhara H.

We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. …

PMID:6280192 Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA.

Fox TD, Staempfli S.

A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. …

PMID:2167309 Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast.

Gawaz M, Douglas MG, Klingenberg M.

AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. …

… …

55


Training set

form the training set: words from Medline abstracts(features) and GO codes (categories)

3

Abstract GO ID Nucleotide sequence of the gene for the mitochondrial 15S ribosomal RNA of yeast

Sor F, Fukuhara H.

We have determined the nucleotide sequence of a DNA segment carrying the entire 15S ribosomal RNA gene of yeast mitochondrial genome. …

GO:0003735

Suppressor of yeast mitochondrial ochre mutations that maps in or near the 15S ribosomal RNA gene of mtDNA.

Fox TD, Staempfli S.

A polypeptide chain-terminating mutation in the yeast mitochondrial oxi 1 gene has been shown to be an ochre (TAA) mutation by DNA sequence analysis. …

GO:0006412

Structure-function studies of adenine nucleotide transport in mitochondria. II. Biochemical analysis of distinct AAC1 and AAC2 proteins in yeast.

Gawaz M, Douglas MG, Klingenberg M.

AAC1 and AAC2 genes in yeast each encode functional ADP/ATP carrier (AAC) proteins of the mitochondrial inner membrane. …

GO:0005743

… …

56


LearningAlgorithm

Classifier:Abstracts GO codes

learn a classifier from the training set4

57

Classification (for genes with missing annotation)

Gene Abstract

YLL057C

Cloning and characterization of a sulfonate/alpha-ketoglutarate dioxygenase from Saccharomyces cerevisiae.

Hogan DA, Auchtung TA, Hausinger RP.

The Saccharomyces cerevisiae open reading frame

YLL057c is predicted to encode a gene product with 31.5% amino acid sequence identity to Escherichia coli taurine/alpha-ketoglutarate dioxygenase and 27% identity to Ralstonia eutropha TfdA, a herbicide-degrading enzyme. …

Medline

retrieve Medline abstracts mentioning the gene1

58

Classification (for genes with missing annotation)

Classifier:Abstracts GO codes

classify these abstracts in GO codes2

Gene GO code GO function

YLL057C GO:0006790 sulfur metabolism

59

Results

dataset level branching Flat Local H. AdaBoost

biol. process 12 5.41 15.06 59.27 59.31

mol. function 10 10.29 8.78 43.36 38.17

cell. component 8 6.45 44.18 72.07 73.35

60

Conclusion

We have presented:• hierarchical categorization task

(categories are partially ordered)• generalized hierarchical local approach• new hierarchical global approach

(hierarchical AdaBoost)• new hierarchical evaluation measure• application to gene annotation task

61

Future work

• to try global hierarchical approach with other learning algorithms

• to extend the gene annotation training sets with similar documents from Medline

• to perform similar task for other organisms• to use gene annotations in gene classification

and clustering

62

Gene expression analysis with functional annotations

GO:0006790

GO:0006798

GO:0007315

GO:0007289

GO:0002132

GO:0002166

Hierarchical Text Categorization and its Application to Bioinformatics

Documents

Transcript of Hierarchical Text Categorization and its Application to Bioinformatics