Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...
Transcript of Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...
Cancer Hallmark Text Classification
Using Convolutional Neural Networks
Simon Baker, Anna Korhonen, Sampo Pyysalo
Cambridge Language Technology Lab
Introduction and motivation
• A major goal of cancer research is to understand the biological
mechanisms involved: how tumorous growths starts in the
body, how they are sustained, and how they turn malignant.
• Cancer is often described in the biomedical literature by its
hallmarks: a set of interrelated biological properties and
behaviours that enable cancer to thrive in the body.
2
Introduction and motivation
• The hallmarks of cancer were first introduced in the seminal
paper of Hanahan et al. (2000), the most cited paper in the
journal Cell.
• The paper introduces six hallmarks, which were then extended
in a follow-up paper (Hanahan et al. 2011) by another four,
forming the set of ten hallmarks that are known today.
3
Introduction and motivation
In recent work. A corpus comprised of over 1,800 abstracts from
biomedical publications annotated with the ten hallmarks of
cancer (Baker et al. 2016).
A machine learning based method for classifying abstracts
according to the hallmarks is also proposed. The approach
utilizes conventional NLP pipeline that extracts a feature-rich
representation that is used to train support vector machine
(SVM) classifiers.
The method achieves a an average F-score of 77%.
4
Introduction and motivation
A conventional pipeline method is expensive:
• Computationally demanding.
• Requires handcrafting and feature engineering.
• Error propagation through the pipeline.
Our goal is to overcome these challenges by applying
Convolutional Neural Networks to this task.
5
The Hallmarks of Cancer
• Sustaining proliferative signalling
• Evading growth suppressors
• Resisting cell death
• Enabling replicative immortality
• Inducing angiogenesis.
• Activating invasion & metastasis
• Genome instability & mutation
• Tumor-promoting inflammation
• Deregulating cellular energetics
• Avoiding immune destruction
6
Data
• Corpus of 1853 scientific abstracts.
• Labelled with zero or more hallmarks.
• Inter-annotator agreement on 155 subset of Keppa = 0.81.
• Split data into three sets: train (70%), development (10%), test
(10%).
• We used a sampling strategy that preserves the overall
distribution of the 10 classes.
• We train ten independent binary classifiers (one for each
hallmark).
7
Data
25%
13%
23%
6%
8%
16%
18%
13%
6%
6%
75%
87%
77%
94%
92%
84%
82%
87%
94%
94%
Proliferative signaling
Evading growth
Resisting cell death
Replicative immortality
Angiogensis
Invasion & metastasis
Genomic instablity
Tumor promoting inflamation
Cellular energetics
Avoiding immune destruction
Postives Negatives
8
Convolutional Neural Networks
We base our CNN architecture on the simple model of (Kim
2014).
9
Convolutional Neural Networks
We implemented the neural network using Keras.
Model hyperparameters and the training setup were initially fixed
to those applied by Kim 2014, summarized in the following:
• Word vector size:300 (Google News vectors)
• Filter sizes: 3, 4, 5
• Number of filters: 300 (100 of each size)
• Dropout probability: 0.5
• Minibatch size: 50
10
Convolutional Neural Networks
Adapting the model to our task:
• Oversampling positive examples
• Pre-train embeddings
• Tune filter-sizes
11
Oversampling
• We oversampled the positive examples (2X, 4X, 8X, 16X).
We selected balanced oversampling strategy where the
number of classes are equal.
• Oversampling improves F-score to 86.1% compared to 85.1%
without oversampling.
12
Pre-training Embeddings
We consider a variety of word embeddings:
- general-domain Google News vectors.
- PubMed (PM).
- PMC.
- Wikipedia texts.
- PMC-based vectors introduced for the BioASQ shared task.
- Finally, we consider two variants of PubMed-based vectors
introduced by (Chiu et al 2016).
13
Pre-training Embeddings
14
Pre-training Embeddings
On Development data:
85.6
86.1
84.9
85.2
85.2
85.3
86.1
84.0 84.5 85.0 85.5 86.0 86.5
Chiu-win-30
Chiu-win-2
Wiki+PM+PMC
PM+PMC
PM
PMC
GoogleNews
F-score (%)
15
Pre-training Embeddings
On Development data:
97.3
97.6
97.2
97.2
97.3
97.2
97.5
97.0 97.1 97.2 97.3 97.4 97.5 97.6 97.7
Chiu-win-30
Chiu-win-2
Wiki+PM+PMC
PM+PMC
PM
PMC
GoogleNews
AUC (%)
16
Selecting Filter Sizes
• The base model uses three filter sizes: 3,4,5.
• We investigate what happens to performance when changing
filter sizes (1-10).
• And the number of filter sizes (1-5).
• We keep the total number of filters fixed for each filter size.
17
Selecting Filter Sizes
18
Baseline
• CNN with original Kim 2014 hyperparameters
• SVM with Bag-of-Words features
• SVM with rich features (Baker et al 2016)
19
Baseline – Feature-rich SVM
Tokenisation
Lemmatisation Dependency
Parsing
Named Entitiy Recognition
Feature Encoding
GR
Classifier Feature
Selection
LBo
W
Information flow
Extracted features
POS Tagging
Data Cleaning
Metadata Extraction
Verb Class Clustering
N-gram Extraction
No
un
Big
ram
Ve
rb C
lass
es
Ch
em &
Me
SH
Nam
ed
En
titi
es
1 2
3 4 5
6 7 8 9
10 11 12
Input Article
20
Results (Average F-score %)
69.2
76.8
76.6
81
SVM-BoW
SVM-Rich
CNN-Base
CNN-Tuned
21
Results (Average AUC %)
93.1
94.9
97.1
97.8
SVM-BoW
SVM-Rich
CNN-Base
CNN-Tuned
22
Results (F-score %)
50 55 60 65 70 75 80 85 90 95
Proliferative signaling
Evading growth
Resisting cell death
Replicativeimmortality
Angiogensis
Invasion & metastasis
Genomic instablity
Tumor promotinginflamation
Cellular energetics
Avoiding immunedestruction
CNN-Tuned
CNN-Base
SVM-Rich
SVM-BoW
23
Results
• CNNs greatly reduce the burden of hand crafting and feature
engineering for text classification.
• More portable then an SVM pipeline.
• Hyperparameter space is large, and exhaustive searching is
prohibitive
24
Conclusions
• We investigated the application of CNNs to the biomedical
domain text classification task of identifying the Hallmarks of
Cancer.
• We demonstrated that a CNN using only text and embeddings
can achieve a competitive performance to a feature-heavy
SVM classifier.
• We further adapted the CNN to the task by oversampling
positive examples, using tuned embeddings induced from
biomedical text, and tuning hyperparameters. We achieve a
substantive improvement over the previous state-of-the-art.
25
Thank you for listening!
26