Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...

Cancer Hallmark Text Classification

Using Convolutional Neural Networks

Simon Baker, Anna Korhonen, Sampo Pyysalo

Cambridge Language Technology Lab

Introduction and motivation

• A major goal of cancer research is to understand the biological

mechanisms involved: how tumorous growths starts in the

body, how they are sustained, and how they turn malignant.

• Cancer is often described in the biomedical literature by its

hallmarks: a set of interrelated biological properties and

behaviours that enable cancer to thrive in the body.

• The hallmarks of cancer were first introduced in the seminal

paper of Hanahan et al. (2000), the most cited paper in the

journal Cell.

• The paper introduces six hallmarks, which were then extended

in a follow-up paper (Hanahan et al. 2011) by another four,

forming the set of ten hallmarks that are known today.

In recent work. A corpus comprised of over 1,800 abstracts from

biomedical publications annotated with the ten hallmarks of

cancer (Baker et al. 2016).

A machine learning based method for classifying abstracts

according to the hallmarks is also proposed. The approach

utilizes conventional NLP pipeline that extracts a feature-rich

representation that is used to train support vector machine

(SVM) classifiers.

The method achieves a an average F-score of 77%.

A conventional pipeline method is expensive:

• Computationally demanding.

• Requires handcrafting and feature engineering.

• Error propagation through the pipeline.

Our goal is to overcome these challenges by applying

Convolutional Neural Networks to this task.

The Hallmarks of Cancer

• Sustaining proliferative signalling

• Evading growth suppressors

• Resisting cell death

• Enabling replicative immortality

• Inducing angiogenesis.

• Activating invasion & metastasis

• Genome instability & mutation

• Tumor-promoting inflammation

• Deregulating cellular energetics

• Avoiding immune destruction

• Corpus of 1853 scientific abstracts.

• Labelled with zero or more hallmarks.

• Inter-annotator agreement on 155 subset of Keppa = 0.81.

• Split data into three sets: train (70%), development (10%), test

(10%).

• We used a sampling strategy that preserves the overall

distribution of the 10 classes.

• We train ten independent binary classifiers (one for each

hallmark).

Proliferative signaling

Evading growth

Resisting cell death

Replicative immortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promoting inflamation

Cellular energetics

Avoiding immune destruction

Postives Negatives

Convolutional Neural Networks

We base our CNN architecture on the simple model of (Kim

2014).

We implemented the neural network using Keras.

Model hyperparameters and the training setup were initially fixed

to those applied by Kim 2014, summarized in the following:

• Word vector size:300 (Google News vectors)

• Filter sizes: 3, 4, 5

• Number of filters: 300 (100 of each size)

• Dropout probability: 0.5

• Minibatch size: 50

Adapting the model to our task:

• Oversampling positive examples

• Pre-train embeddings

• Tune filter-sizes

Oversampling

• We oversampled the positive examples (2X, 4X, 8X, 16X).

We selected balanced oversampling strategy where the

number of classes are equal.

• Oversampling improves F-score to 86.1% compared to 85.1%

without oversampling.

Pre-training Embeddings

We consider a variety of word embeddings:

- general-domain Google News vectors.

- PubMed (PM).

- PMC.

- Wikipedia texts.

- PMC-based vectors introduced for the BioASQ shared task.

- Finally, we consider two variants of PubMed-based vectors

introduced by (Chiu et al 2016).

On Development data:

84.0 84.5 85.0 85.5 86.0 86.5

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

GoogleNews

F-score (%)

On Development data:

97.0 97.1 97.2 97.3 97.4 97.5 97.6 97.7

Chiu-win-30

Chiu-win-2

Wiki+PM+PMC

PM+PMC

GoogleNews

AUC (%)

Selecting Filter Sizes

• The base model uses three filter sizes: 3,4,5.

• We investigate what happens to performance when changing

filter sizes (1-10).

• And the number of filter sizes (1-5).

• We keep the total number of filters fixed for each filter size.

Selecting Filter Sizes

Baseline

• CNN with original Kim 2014 hyperparameters

• SVM with Bag-of-Words features

• SVM with rich features (Baker et al 2016)

Baseline – Feature-rich SVM

Tokenisation

Lemmatisation Dependency

Parsing

Named Entitiy Recognition

Feature Encoding

Classifier Feature

Selection

Information flow

Extracted features

POS Tagging

Data Cleaning

Metadata Extraction

Verb Class Clustering

N-gram Extraction

6 7 8 9

10 11 12

Input Article

Results (Average F-score %)

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

Results (Average AUC %)

SVM-BoW

SVM-Rich

CNN-Base

CNN-Tuned

Results (F-score %)

50 55 60 65 70 75 80 85 90 95

Proliferative signaling

Evading growth

Resisting cell death

Replicativeimmortality

Angiogensis

Invasion & metastasis

Genomic instablity

Tumor promotinginflamation

Cellular energetics

Avoiding immunedestruction

CNN-Tuned

CNN-Base

SVM-Rich

SVM-BoW

Results

• CNNs greatly reduce the burden of hand crafting and feature

engineering for text classification.

• More portable then an SVM pipeline.

• Hyperparameter space is large, and exhaustive searching is

prohibitive

Conclusions

• We investigated the application of CNNs to the biomedical

domain text classification task of identifying the Hallmarks of

Cancer.

• We demonstrated that a CNN using only text and embeddings

can achieve a competitive performance to a feature-heavy

SVM classifier.

• We further adapted the CNN to the task by oversampling

positive examples, using tuned embeddings induced from

biomedical text, and tuning hyperparameters. We achieve a

substantive improvement over the previous state-of-the-art.

Thank you for listening!

Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...

Documents

Transcript of Cancer Hallmark Text Classification Using Convolutional ... · The Hallmarks of Cancer •...

OTHER B&T SUPPRESSORS

Cancerous phenotypes associated with hypoxia-inducible ... · of cancer-related hallmarks, including “sustained proliferative signaling, evasion of growth suppressors, resistance

Unstyle: A Tool for Evading Authorship Attribution

Eluent Suppressors for Ion Chromatography Data · PDF filePassion. Power. Productivity. suppressors Eluent Suppressors for Ion Chromatography The Suppressor Advantage Dionex introduced

Evading YouTube Copyright Detectors Using Video Puzzles

IKKb-Mediated Resistance to Skin Cancer Development Is ......Oncogenes and Tumor Suppressors IKKb-Mediated Resistance to Skin Cancer Development Is Ink4a/Arf-Dependent Angustias Page1,2,3,

Ceramic Transient Voltage Suppressors CTVS

Application list and suggestion information Transient Voltage Suppressors Transient Voltage Suppressors Multilayer Surface Mount TVS Inova!

NTXISSACSC4 - The Art of Evading Anti-Virus

Transient Voltage Suppressors

Biochemical and Structural Analysis of Common Cancer ... · Oncogenes and Tumor Suppressors Biochemical and Structural Analysis of Common Cancer-Associated KRAS Mutations John C.

Altered Endosome Biogenesis in Prostate Cancer Has ... · Oncogenes and Tumor Suppressors Altered Endosome Biogenesis in Prostate Cancer Has Biomarker Potential Ian R.D. Johnson1,

Evading the Replication Fork Collapse Point

Sound SuppreSSorS Hunting & SportS SHooting · 2020. 12. 17. · suppressors. We have been manufacturing sound suppressors for hunting, sports and professional use since 1994. Ase

Targeting CREB Pathway Suppresses Small Cell Lung Cancer · Oncogenes and Tumor Suppressors Targeting CREB Pathway Suppresses Small Cell Lung Cancer Yifeng Xia1,2, Cheng Zhan2, Mingxiang

Tumor Suppressor Activity of Klotho in Breast Cancer Is ... · Oncogenes and Tumor Suppressors Tumor Suppressor Activity of Klotho in Breast Cancer Is Revealed by Structure–Function

for Ducts with Suppressors

MET-Independent Lung Cancer Cells Evading EGFR Kinase … · Therapeutics, Targets, and Chemical Biology MET-Independent Lung Cancer Cells Evading EGFR Kinase Inhibitors Are Therapeutically

MOUNTING SYSTEM WARNING! - Rugged Suppressors

Genetics of Cancer Lecture 35 Tumor Suppressors, DNA ...web.mit.edu/7.03/documents/CancerIII7032005Lecture... · Genetics of Cancer Lecture 35 Tumor Suppressors, DNA Damage & Mutations