Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...

Post on 18-Aug-2020

3 views 0 download

Transcript of Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...

Cancer Hallmark Text Classification

Using Convolutional Neural Networks

Simon Baker, Anna Korhonen, Sampo Pyysalo

Cambridge Language Technology Lab

Introduction and motivation

• A major goal of cancer research is to understand the biological

mechanisms involved: how tumorous growths starts in the

body, how they are sustained, and how they turn malignant.

• Cancer is often described in the biomedical literature by its

hallmarks: a set of interrelated biological properties and

behaviours that enable cancer to thrive in the body.

2

Introduction and motivation

• The hallmarks of cancer were first introduced in the seminal

paper of Hanahan et al. (2000), the most cited paper in the

journal Cell.

• The paper introduces six hallmarks, which were then extended

in a follow-up paper (Hanahan et al. 2011) by another four,

forming the set of ten hallmarks that are known today.

3

Introduction and motivation

In recent work. A corpus comprised of over 1,800 abstracts from

biomedical publications annotated with the ten hallmarks of

cancer (Baker et al. 2016).

A machine learning based method for classifying abstracts

according to the hallmarks is also proposed. The approach

utilizes conventional NLP pipeline that extracts a feature-rich

representation that is used to train support vector machine

(SVM) classifiers.

The method achieves a an average F-score of 77%.

4

Introduction and motivation

A conventional pipeline method is expensive:

• Computationally demanding.

• Requires handcrafting and feature engineering.

• Error propagation through the pipeline.

Our goal is to overcome these challenges by applying

Convolutional Neural Networks to this task.

5

The Hallmarks of Cancer

• Sustaining proliferative signalling

• Evading growth suppressors

• Resisting cell death

• Enabling replicative immortality

• Inducing angiogenesis.

• Activating invasion & metastasis

• Genome instability & mutation

• Tumor-promoting inflammation

• Deregulating cellular energetics

• Avoiding immune destruction

6

Data

• Corpus of 1853 scientific abstracts.

• Labelled with zero or more hallmarks.

• Inter-annotator agreement on 155 subset of Keppa = 0.81.

• Split data into three sets: train (70%), development (10%), test

(10%).

• We used a sampling strategy that preserves the overall

distribution of the 10 classes.

• We train ten independent binary classifiers (one for each

hallmark).

7

Data

25%

13%

23%

6%

8%

16%

18%

13%

6%

6%

75%

87%

77%

94%

92%

84%

82%

87%

94%

94%

Proliferative signaling

Evading growth

Resisting cell death

Replicative immortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promoting inflamation

Cellular energetics

Avoiding immune destruction

Postives Negatives

8

Convolutional Neural Networks

We base our CNN architecture on the simple model of (Kim

2014).

9

Convolutional Neural Networks

We implemented the neural network using Keras.

Model hyperparameters and the training setup were initially fixed

to those applied by Kim 2014, summarized in the following:

• Word vector size:300 (Google News vectors)

• Filter sizes: 3, 4, 5

• Number of filters: 300 (100 of each size)

• Dropout probability: 0.5

• Minibatch size: 50

10

Convolutional Neural Networks

Adapting the model to our task:

• Oversampling positive examples

• Pre-train embeddings

• Tune filter-sizes

11

Oversampling

• We oversampled the positive examples (2X, 4X, 8X, 16X).

We selected balanced oversampling strategy where the

number of classes are equal.

• Oversampling improves F-score to 86.1% compared to 85.1%

without oversampling.

12

Pre-training Embeddings

We consider a variety of word embeddings:

- general-domain Google News vectors.

- PubMed (PM).

- PMC.

- Wikipedia texts.

- PMC-based vectors introduced for the BioASQ shared task.

- Finally, we consider two variants of PubMed-based vectors

introduced by (Chiu et al 2016).

13

Pre-training Embeddings

14

Pre-training Embeddings

On Development data:

85.6

86.1

84.9

85.2

85.2

85.3

86.1

84.0 84.5 85.0 85.5 86.0 86.5

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

PM

PMC

GoogleNews

F-score (%)

15

Pre-training Embeddings

On Development data:

97.3

97.6

97.2

97.2

97.3

97.2

97.5

97.0 97.1 97.2 97.3 97.4 97.5 97.6 97.7

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

PM

PMC

GoogleNews

AUC (%)

16

Selecting Filter Sizes

• The base model uses three filter sizes: 3,4,5.

• We investigate what happens to performance when changing

filter sizes (1-10).

• And the number of filter sizes (1-5).

• We keep the total number of filters fixed for each filter size.

17

Selecting Filter Sizes

18

Baseline

• CNN with original Kim 2014 hyperparameters

• SVM with Bag-of-Words features

• SVM with rich features (Baker et al 2016)

19

Baseline – Feature-rich SVM

Tokenisation

Lemmatisation Dependency

Parsing

Named Entitiy Recognition

Feature Encoding

GR

Classifier Feature

Selection

LBo

W

Information flow

Extracted features

POS Tagging

Data Cleaning

Metadata Extraction

Verb Class Clustering

N-gram Extraction

No

un

Big

ram

Ve

rb C

lass

es

Ch

em &

Me

SH

Nam

ed

En

titi

es

1 2

3 4 5

6 7 8 9

10 11 12

Input Article

20

Results (Average F-score %)

69.2

76.8

76.6

81

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

21

Results (Average AUC %)

93.1

94.9

97.1

97.8

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

22

Results (F-score %)

50 55 60 65 70 75 80 85 90 95

Proliferative signaling

Evading growth

Resisting cell death

Replicativeimmortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promotinginflamation

Cellular energetics

Avoiding immunedestruction

CNN-Tuned

CNN-Base

SVM-Rich

SVM-BoW

23

Results

• CNNs greatly reduce the burden of hand crafting and feature

engineering for text classification.

• More portable then an SVM pipeline.

• Hyperparameter space is large, and exhaustive searching is

prohibitive

24

Conclusions

• We investigated the application of CNNs to the biomedical

domain text classification task of identifying the Hallmarks of

Cancer.

• We demonstrated that a CNN using only text and embeddings

can achieve a competitive performance to a feature-heavy

SVM classifier.

• We further adapted the CNN to the task by oversampling

positive examples, using tuned embeddings induced from

biomedical text, and tuning hyperparameters. We achieve a

substantive improvement over the previous state-of-the-art.

25

Thank you for listening!

26