Download - DEEP LEARNING TECHNIQUES FOR ANALYZING CLINICAL LUNG ...

DEEP LEARNING TECHNIQUES FOR ANALYZING CLINICAL LUNGCANCER DATA

BY

HAOZE DU

A Thesis Submitted to the Graduate Faculty of

WAKE FOREST UNIVERSITY GRADUATE SCHOOL OF ARTS AND SCIENCES

in Partial Fulfillment of the Requirements

for the Degree of

MASTER OF SCIENCE

Computer Science

August 2019

Winston-Salem, North Carolina

Copyright c© 2019 by Haoze Du

Approved By:

Samuel S. Cho, Ph.D., Advisor

William Turkett, Ph.D., Chair

V. Paul Pauca, Ph.D.

Acknowledgments

First, I would like to thank my advisor, Samuel Cho, Ph.D., for offering me sucha great opportunity to work in his research group, and providing support, resources,and training. He gave me a lot of helpful advice for my study and my life. Also, heshows a great sense of responsibility for my academical career.

I am very grateful to my committee members, Dr. Pauca and Dr. Turkett.Many thanks for your time and help on this thesis. Also, thank you for sharing youracademical experiences with me.

To all professors in the Department of Computer Science, thank you all for yourwarm help. Especially, I would like to thank Dr. Fulp, the first professor I met in WakeForest University, who helped me a lot at the very beginning; and Dr. Torgersen,who is very friendly and easy-going on class and after class.

Lastly, I would like to thank my dearest friend Liang Li and my family for theirsupport. Their selfless encouragement and support made it much easier for me tostudy abroad and make progress in my life.

ii

Table of Contents

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ii

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi

List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii

Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii

Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

Chapter 2 Overview of Machine Learning Techniques . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Notational Conventions and Types of Supervised Learning . . 7

2.1.2 Model Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.1.3 Decision Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.4 Support Vector Machine . . . . . . . . . . . . . . . . . . . . . 18

2.1.5 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . 23

2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.3 Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 39

Chapter 3 Ensemble Methods and Cascade Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

3.1 Basic Theory of Ensemble Methods . . . . . . . . . . . . . . . . . . . 41

3.2 Ensemble Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.3 Cascade Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.3.2 Structure of Cascade Forest . . . . . . . . . . . . . . . . . . . 45

3.3.3 Base Learners of Cascade Forest . . . . . . . . . . . . . . . . . 47

Chapter 4 Applying Cascade Forest on the SEER Dataset for SurvivabilityPrediction of Lung Cancer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1 Data Acquisition and Preprocessing . . . . . . . . . . . . . . . . . . . 50

4.1.1 SEER Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.1.2 Data Re-encoding . . . . . . . . . . . . . . . . . . . . . . . . . 51

iii

4.1.3 Dimensional Reduction . . . . . . . . . . . . . . . . . . . . . . 52

4.1.4 Training Set and Test Set . . . . . . . . . . . . . . . . . . . . 57

4.2 Building Cascade Forest Model . . . . . . . . . . . . . . . . . . . . . 57

4.2.1 Hyperparameter Setting and Tuning . . . . . . . . . . . . . . 57

4.2.2 Modified Cascade Forest for Feature Importance Analysis . . . 59

4.2.3 Model Training . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.4 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Evaluation and Comparison . . . . . . . . . . . . . . . . . . . . . . . 66

Chapter 5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Appendix A Description of Variables in SEER Dataset . . . . . . . . . . . . . . . . . . . . . . . . 87

Appendix B An Example of Record in SEER Dataset . . . . . . . . . . . . . . . . . . . . . . . . . 99

Curriculum Vitae Haoze Du . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

iv

List of Figures

1.1 Number of citations (excluding self citation), related to machine learn-ing and cancer over the past decade. . . . . . . . . . . . . . . . . . . 4

2.1 The relationship of generalization error, bias, and variance. . . . . . . 12

2.2 Generated decision tree, trained on Iris dataset. . . . . . . . . . . . . 17

2.3 Support vector and margin. . . . . . . . . . . . . . . . . . . . . . . . 19

2.4 XOR problem using SVM. . . . . . . . . . . . . . . . . . . . . . . . . 21

2.5 Biological neuron and Artificial neuron . . . . . . . . . . . . . . . . . 25

2.6 The graph of ReLU, Sigmoid and, tanh activation functions. . . . . . 26

2.7 Multi-layer feedforward neural network solving XOR problem. . . . . 28

2.8 Schematic of neural network. . . . . . . . . . . . . . . . . . . . . . . . 29

2.9 An example of applying BP on a feedforward neural network. . . . . . 30

2.10 An example of CNN. . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

2.11 The structure of LSTM memory unit. . . . . . . . . . . . . . . . . . . 35

2.12 PCA on Iris dataset, selected 3 as principle components number. . . . 38

2.13 Reinforcement learning structure. . . . . . . . . . . . . . . . . . . . . 39

3.1 The diagram of general ensemble methods. . . . . . . . . . . . . . . . 41

3.2 Structure of cascade forest . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Non-zero Lasso coefficients (λ ≥ 0.001 in log scale) for different valuesof λ. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.2 The correlation matrix for input features after Lasso regression. . . . 56

4.3 The correlation matrix after dropping highly correlated features. . . . 57

4.4 Modified cascade forest. . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.5 Training accuracy of estimators in cascade forest, trained by Lasso data. 62

4.6 Performance comparison of different dimensional reduction methods. . 64

4.7 Importance of features, generated by cascade forest using Lasso data. 65

4.8 Importance of features, generated by cascade forest using data fromprior works. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

4.9 ROC curves on Lasso data. . . . . . . . . . . . . . . . . . . . . . . . . 70

4.10 ROC curves on data from PCA. . . . . . . . . . . . . . . . . . . . . . 71

4.11 ROC curves on data from prior works. . . . . . . . . . . . . . . . . . 72

4.12 Elapsed time for Cascade Forest, Random Forest, SVM and DNN. . . 73

v

List of Tables

1.1 GENIE, TCGA, and SEER cancer databases overview. . . . . . . . . 2

1.2 Geographic areas and years covered in database “Incidence - SEER 18Regs Research Data + Hurricane Katrina Impacted Louisiana Cases,Nov 2017 Sub (1973-2015 varying)”. . . . . . . . . . . . . . . . . . . . 3

2.1 Confusion matrix for binary classification. . . . . . . . . . . . . . . . 9

2.2 Common kernel functions. . . . . . . . . . . . . . . . . . . . . . . . . 23

2.3 Common activation functions. . . . . . . . . . . . . . . . . . . . . . . 26

2.4 Definition of notations in backpropagation NN. . . . . . . . . . . . . 30

4.1 SEER variables categories. . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2 Features subset from prior studies that we used for comparison. . . . 53

4.3 The value R2 with different values of λ. . . . . . . . . . . . . . . . . . 53

4.4 Features after Lasso regression. . . . . . . . . . . . . . . . . . . . . . 55

4.5 Result analysis for cascade forest on different datasets. . . . . . . . . 63

4.6 The optimal hyperparameters set from this study. . . . . . . . . . . . 67

4.7 Comparison of cascade forest and other methods . . . . . . . . . . . . 68

4.8 The execution time of Cascade Forest, SVM, DNN, and RF. . . . . . 73

vi

List of Abbreviations

AUC Area Under the (ROC) Curve

Bagging Bootstrap aggregation

BP error BackPropagation

CNN Convolutional neural network

CPH Cox proportional hazard model

DNN Deep neural network

ET Extra Trees classifier

GAN Generative adversarial network

GCForest Multi-Grained Cascade Forest

GENIE Genomics Evidence Neoplasia Information Exchange Program

ICD-O-3 International Classification of Diseases for Oncology, the 3rd edition.

LASSO Least absolute shrinkage and selection operator

LSTM Long-short term memory

MDG Mean Decrease Gini

MP neuron McCulloch-Pitts neuron

MSE Mean Squared Error

NN Neural network

PCA Principle Component Analysis

RBF Radial basis function

ROC Receiver Operating Characteristic

ReLU Rectified Linear Unit

RF Random forest classifier

RNN Recurrent neural network

SEER the Surveillance, Epidemiology, and End Results Program

SVM Support vector machine

tanh Hyperbolic tangent

TCGA The Cancer Genome Atlas Program

vii

Abstract

Haoze Du

With the continued public concerns about cancer identification in patients, manymethods have been implemented to analyze clinical records to gain actionable infor-mation and make a meaningful prediction of cancer patients outcomes. It is necessaryto accurately predict the efficacy of specific therapy or identify a combination of ac-tionable treatments on clinical practice based on clinical datasets. While conventionalmachine learning methods such as artificial neural networks and support vector ma-chines have shown promise, they clearly have significant room for improvement. Inthis thesis, we attempted to train and optimize an innovative deep learning methodcalled cascade forest, which is inspired by artificial neural networks, as well as anumber of traditional machine learning methods and deep neural networks. Cuttingedge machine learning tools such as Tensorflow and Scikit-learn on the GPU plat-form, which allows parallel computation to enhance their performances, were usedto improve the time efficiency. The outcomes of this thesis include: 1) predictingthe outcomes of a cancer patient based on clinical data from the publicly availableSEER database; 2) evaluating the patient outcomes by comparing the models basedon different datasets; 3) attempting to increase the accuracy and reduce the executiontime for model training by optimizing machine learning models.

viii

Chapter 1: Introduction

Global Cancer Statistics 2018 report that there were 2,093,876 new cases of pa-

tients diagnosed with lung cancer and 1,761,007 deaths related to lung cancer in 2018

[1]. With the continued growth of incidences of lung cancer and accumulation of pa-

tient data, it is now possible to use statistical analyses to accurately predict patient

outcomes. A precise prognosis survival prediction not only could help patients know

about their survival expectation, but also help researchers understand the develop-

ment process of the disease and guide clinical therapy. The prediction of a specific

lung cancer patient’s outcome based on the input clinical data is usually an important

factor for deciding the proper treatment for that patient [2].

According to the National Cancer Institute, many types of lung cancer grow

quickly and spread rapidly so that early detection and prompt treatment are vital to

patients [3], which indicates that analyzing data related to lung cancer and making

accurate prediction of outcomes of lung cancer patients is critical. The leading lung

cancer research databases include “the Surveillance, Epidemiology, and End Results

program” (SEER), “The Cancer Genome Atlas Program” (TCGA), and “Genomics

Evidence Neoplasia Information Exchange” (GENIE). SEER and TCGA are both

provided by the National Cancer Institute, but implemented with different targets.

SEER focuses on collecting national cancer cases data, in order to provide information

on cancer statistics to reduce the cancer burden among the U.S. population [4]. On

the other hand, TCGA concentrates on characterizing cancer with genomic, epige-

nomic and clinical data, to find the connection between them in order to improve the

ability to diagnose, treat, and prevent cancer [5]. GENIE is a program sponsored by

American Association of Cancer Research, which aims to provide the statistical power

1

to link genomic clinical-grade cancer data with clinical outcomes [6]. An overview of

the comparisons between GENIE, TCGA, and SEER datasets are listed in Table 1.1.

For clinical data research, SEER is more common than the other two datasets due to

much more available clinical data and features to analyze the clinical data.

Database GENIE TCGA SEERNumber of patients 56,970 patients in total

(in GENIE 5.0), (9,438related to lung cancer)

11,315 cases in total(by 2019), (1,176 re-lated to lung cancer)

106,879,966 patients intotal (by 2019), cov-ering approximately34.6% of the U.S.population (SEER 21).

Cancer classes Cancer cases are classi-fied in 82 main classes

Cancer cases are classi-fied in 33 main classes

Cancer cases are classi-fied in 22 main classesbased on site and his-tology

Data classes Genomic data and re-lated clinical data

Genomic, epigenomicand clinical data

Clinical data

Number of features forclinical data

13 Number varies for dif-ferent projects

191

Data types Text Image/Text Text

Table 1.1: GENIE, TCGA, and SEER cancer databases overview.

In this thesis, the clinical data are from the specific database from SEER, pub-

lished in 2017, which is “Incidence - SEER 18 Regs Research Data + Hurricane

Katrina Impacted Louisiana Cases, Nov 2017 Sub (1973-2015 varying)” [7]. This

database covers approximately 27.8% of the U.S. population (based on 2010 census)

and contains records for 10,050,814 tumors in total [4]. The covered geographic areas

and years about this database is listed as Table 1.2.

2

Geographic Area Year range Geographic Area Year rangeSan Francisco-Oakland SMSA 1973+ San Jose-Monterey 1992+Connecticut 1973+ Los Angeles 1992+Detroit (Metropolitan) 1973+ Alaska Natives 1992+Hawaii 1973+ Rural Georgia 1992+Iowa 1973+ California excluding SF/SJM/LA 2000+New Mexico 1973+ Kentucky 2000+Seattle (Puget Sound) 1974+ Louisiana 2000+Utah 1973+ New Jersey 2000+Atlanta (Metropolitan) 1975+ Greater Georgia 2000+

Table 1.2: Geographic areas and years covered in database “Incidence - SEER 18Regs Research Data + Hurricane Katrina Impacted Louisiana Cases, Nov 2017 Sub(1973-2015 varying)”.

Conventional statistical methods are commonly used to predict the outcome of

cancer patients based on the clinical data. The Cox proportional hazard (CPH),

which is one of the most frequently used models, is designed for survival analyses of

cancer patients [8]. The CPH model is expressed by the hazard function for subject

i, which is denoted as hi(t). This function, as Equation (1.1) shows, can be briefly

explained as the risk of dying at time t [9].

hi(t) = h0(t) · eβ1xi1+β2xi2+···+βkxik (1.1)

Where t represents the survival time; the covariates (xi1, xi2, · · · , xik) are the k input

features for subject i; the coefficients (β1, β2, · · · , βk) estimate the impact of covariates

defined as above; the value h0(t) represents the base line hazard at time t, which

means the value of hazard when all the covariates x are equal to zero. To simplify

the calculation, usually Equation (1.1) is transformed using natural logarithm (ln) :

lnhi(t)

h0(t)= β1xi1 + β2xi2 + · · ·+ βkxik (1.2)

The CPH model is utilized to identify the significance of the feature set on the

survival of cancer patients. However, as Equation 1.1 shows, because it assumes

3

that the outcome is a linear combination of covariates x , it is too simple to predict

cancer patients’ outcomes accurately, leading to insufficient or unnecessary treatment,

because the outcomes of patients usually have complex interactions and relationships

between variables.

In contrast, modern machine learning methods are able to generate prediction

models by learning and representing the training data, which are much more accurate

than conventional statistical methods. As a branch of artificial intelligence which

enables detection of relationships from datasets, machine learning has recently been

applied on lung cancer clinical data research to predict patient’s outcomes [10]. Figure

1.1 shows the number of citations of academic papers increasing in 2010-2018.

Figure 1.1: Number of citations (excluding self citation), related to machine learningand cancer over the past decade.

Agrawal et al. used multiple supervised machine learning techniques, including

4

support vector machines, artificial neural networks, decision trees, random forests,

and others to analyze the survivability of lung cancer patients and compare the per-

formance of these methods on data from the the SEER database. In their paper, they

also designed an online user-friendly outcome calculator for patients, which does not

need professional knowledge to use [11].

Lynch et al. applied a number of supervised learning techniques to the Surveil-

lance, Epidemiology, and End Results program (SEER) [3] database to classify lung

cancer patients in terms of survival, including the techniques of linear regression, de-

cision trees, and gradient boosting machines (GBM)[12]. Lynch et al. also applied

some unsupervised machine learning techniques for classification and clustering to a

collection of descriptive variables from 10,442 lung cancer patient records in the SEER

database. Their results show unsupervised data analysis techniques may be of use

to classify patients by defining the classes as effective proxies for survival prediction

[13].

Wang et al. proposed a two-stage machine learning model to enhance cancer

survival prediction based on a decision tree-based imbalanced ensemble classification

method and a selective ensemble regression method. This approach can effectively

handle the imbalanced colorectal cancer data from the SEER database, and the pro-

posed regression method outperforms several state-of-the-art methods [14]

Machine learning methods as listed above have made remarkable achievements

in analyzing large sets of clinical data to draw conclusions and make predictions to

determine the survivability of a specific lung cancer patient. However, as the sizes of

the datasets continue to grow, predicting lung cancer patients’ outcomes may become

an increasingly difficult problem, because of two main reasons: one relates to the

number of samples for training, the other is its model complexity which is related

to the execution time of certain methods. As such, there is a strong motivation to

5

develop efficient methods to analyze clinical data accurately.

A new ensemble method based on the decision tree, named as GCForest, was first

introduced and implemented by Zhou and his colleagues [15]. GCForest shows a good

performance on different tasks involving image and text input for classification and

regression. This thesis focuses on making meaningful predictions and evaluation with

the input of data from the SEER clinical database by using the GCForest ensemble

decision tree methods.

This thesis is organized in the following manner. In Chapter 2, a brief overview

of typical machine learning methods is given for context. Chapter 3 introduces the

development of ensemble methods and deep forests. Chapter 4 shows the implementa-

tion of a deep forest method which is optimized for clinical data and compares it with

respect to training efficiency and classification accuracy to conventional ML methods,

including support vector machine, random forests, and deep neural networks. Finally,

Chapter 5 presents conclusions and suggests future work.

6

Chapter 2: Overview of Machine Learning Techniques

Providing the ability to automatically recognize unknown patterns and create high

performing predictive models from data, machine learning, especially deep learning,

is a very hot topic with many applications in recent years [16]. Based on the kind

of data available and the specific research task, machine learning can be generally

divided into at least three types [17]:

• Supervised learning. The machine learning model learns on a labeled dataset,

with the labels providing values associated with each data item that the algo-

rithm can use in model construction and to evaluate the constructed model’s

accuracy by comparing the model’s predicted labels/values to the actual label-

s/values on a test set (a subset of the the dataset).

• Unsupervised learning. The machine learning model attempts to extract fea-

tures and patterns on its own from unlabeled input data.

• Reinforcement learning. The machine learning model learns the training dataset

with a reward system. A reward feedback will be provided to the base learner

when it performs a better action in a particular situation.

2.1 Supervised Learning

2.1.1 Notational Conventions and Types of Supervised Learning

As described earlier, supervised learning model uses two different datasets, training

set S and test set T . They are generated from datasetD, usually using hold-out, cross-

validation, and bootstrap. After training the model using S, test error is evaluated

by applying the model on T , to estimate the model’s generalization error in real

7

applications. In supervised learning, the labeled training dataset S is given as the

pair (X,Y) and (x i, yi) is one sample from the training set. For each sample (x i, yi),

x i = {xi,1, xi,2, ..., xi,m} is the feature vector where m is the number of total features

in the training set, and yi is the actual label. In the test set, T = {(x test, ytest), · · · } is

given and Ypredict is predicted from Xtest by applying the learned model. Comparing

the actual value Ytest and the predicted value Ypredict, the accuracy of the given

machine learning model can be determined, which is described in the following section.

The common tasks in supervised learning are classification and regression. Clas-

sification is the task of predicting the output y as a discrete label, which indicates

a sample x belongs to a specific class or category. Regression predicts continuous

quantities.

2.1.2 Model Evaluation

(1) Evaluation Metrics

In machine learning, error is one of the most common metrics to evaluate a model’s

performance. Generally, the difference between the actual value and the predicted

value generated from a learner is error. Additionally, the error generated in training

process is called as training error or empirical error, and the error generated on the

test set is described as test error, which is used to estimate the model’s generalization

error in real applications.

The differences in prediction outputs from classification and regression lead to

different methods to evaluate the estimation of the generalization error of classification

models and regression models.

Typically, the evaluation of classification models is based on the indicator function

8

I(p) which accepts a proposition p as input, defined as below:

I(p) =

{0, p is true1, p is false

(2.1)

The error of a trained learner f(·) on the specific dataset D, annotated as E(f ;D)

and the accuracy of the same learner and dataset can be defined as below:

E(f ;D) =1

m

m∑i=1

I(f(x i) 6= yi) (2.2)

Accuracy(f ;D) =1

m

m∑i=1

I(f(xi) = yi)

= 1− E(f ;D)

(2.3)

Regression models are often evaluated by mean squared error (MSE), defined as:

MSE = E(f ;D) =1

m

m∑i=1

(f(x i)− yi)2 (2.4)

For classification problems, the confusion matrix and related indicators are also

common methods to evaluate the performance of learners. For binary classification

problem, the confusion matrix is defined as Table 2.1 shows.

Predicted valueActual value

Positive NegativePositive True Positive (TP) False Positive(FP)Negative False Negative(FN) True Negative(TN)

Table 2.1: Confusion matrix for binary classification.

where TP and TN are the number of correctly classified positive samples and nega-

tive samples respectively; FN and FP represent the number of incorrectly classified

positive samples and negative samples respectively.

9

The calculation of some commonly used indicators are defined as below:

Accuracy =TN + TP

TN + TP + FN + FP(2.5)

Precision =TP

TP + FP(2.6)

Recall = TPR(True Positive Rate) =TP

TP + FN(2.7)

FPR(False Positive Rate) =FP

FP + TN(2.8)

F1score = 2× Precision · Recall

Precision + Recall(2.9)

(2) Overfitting

The ideal training process of machine learning model is designed to minimize the

generalization error. However, because the generalization error is based on the un-

known data which cannot be trained directly, the training process actually attempts

to minimize an estimate of the generalization error, the test error. Overfitting usually

happens when the model fits the parameters too exactly for the particular observa-

tions in the training dataset but does not fit well on new data, which means the model

has low training error but relatively high test error. In contrast, underfitting usually

happens when the model has high training error and therefore poorly represents the

training data.

Variance, bias, and noise are used to estimate the generalization error for regres-

sion tasks. For a specific test sample x , yD is annotated as the label of x in dataset

10

D, y is annotated as the actual label of x . f(x ;D) is the prediction on dataset D

with input x , which is an estimation of actual model f . Then, the expectation E of

f(x ) and D is :

f (x ) = ED [f (x ;D)] (2.10)

The variance by using different training sets which have the same size, denoted as

var(x ), indicating the impact of changing data, is defined as below.

var(x ) = ED[(f (x ;D)− f (x )

)2](2.11)

The noise is defined as below, which represents the lower boundary of the gener-

alization error:

ε2 = ED[(yD − y)2

](2.12)

The bias is defined as the distance between predicted output and the actual output,

which represents the fitting ability of a specific learner:

bias2(x ) = (f(x )− y)2 (2.13)

In order to analyze the generalization error, Geman et at. implemented bias-

variance decomposition in 1992, dividing the generalization error into three parts:

bias, variance, and noise, as Equation (2.14) shows: [18].

E(f ;D) = bias2(x ) + var(x ) + ε2 (2.14)

As Figure 2.1 shows, during the training process, with the number of training

iterations increasing, the bias is decreasing, whereas the variance is increasing. At

the beginning of training, the model is not well-trained so that the generalization error

is dominated by the bias. At the end of learning, the variance is increasing because

the non-global feature from training set is learnt by the learner, which indicates

overfitting.

11

Figure 2.1: The relationship of generalization error, bias, and variance.

Early stopping, cross validation, and regularization are three common methods

to prevent the model from overfitting. Literally, early stopping means terminate the

training process when the variance is not so large, to decrease the generalization

error [19]. For example, tracking the accuracy on test set is a simple but efficient

way to prevent a model from overfitting: the model is trained continuously until the

accuracy does not increase. However, early stopping may be too conservative, which

likely leads to underfitting.

Cross validation splits the whole dataset D into k mutually exclusive subsets with

the same size, as shown as D = D1 ∪ D2 ∪ · · · ∪ Dk, Di ∩ Dj = ∅(i 6= j). For

each training process, cross validation uses k − 1 subsets for training, and uses the

remaining one subset for testing, and thus is able to execute k times training and

return k results. This is named k-fold cross validation. Usually k-fold cross validation

needs to be repeated p times by spliting the data randomly, and the final evaluation

12

is based on the mean of k-fold cross validation repeated p times.

Regularization aims to decrease the model complexity by adding a regularization

term or penalty term to the loss function, in order to reduce the risk of overfitting.

The base form of the regularization term is given as L(w) where w is related to

the weight of each feature input, and the regularization term is able to represent the

number of non-zero term in w . One of the common regularization term is L2 norm,

defined as follows:

L2 : ‖w‖2 =

√√√√ m∑i=1

w2i (2.15)

Another regularization term is L1 norm, with simply replacing the sum of square of

weights to the sum of absolute value of weights, defined as below:

L1 : ‖w‖1 =m∑i=1

|wi| (2.16)

Very interestingly, regularization is also used in dimensional reduction. Least

absolute shrinkage and selection operator (Lasso) [20] performs both feature selection

and regularization in order to enhance the prediction accuracy and interpretability of

the machine learning model it produces. Lasso applies L1 norm on the loss function

of linear regression model, which decreases the risk of overfitting.

minw

m∑i=1

(yi −wTx i

)2+ λ‖w‖1. (2.17)

where D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)} is the input set of samples, and w =

{w1, w2, · · · , wn} are defined as Lasso coefficients, which reflect the importance of the

related feature xi ∈ x , i = 1, 2, · · · , n to target y .

13

2.1.3 Decision Tree

Decision tree is a very common method in machine learning. A decision tree can be

“learned” by splitting the original set into subsets based on the information gain. The

procedure of generating a decision tree is based on divide-and-conquer.

Assume the k-th class xk in whole dataset D has frequency pk, (k = 1, 2, ...,m),

the information gain of D is

IG(D, a) = I(D)−V∑v=1

|Dv||D|

I(Dv) (2.18)

Here I(D) is the impurity of D, which can be implemented in multiple ways. Suppos-

ing the discrete feature a has V possible values {a1, a2, ..., aV }, if feature a is used to

split the dataset D, it will generate V sub-nodes, where the v-th sub-node contains

a subset of D, annotated as Dv = {x ∈ a | x = av}. The larger value of information

gain indicates the larger purity of D.

The basic algorithm is given as the pseudo code below:

14

Algorithm 1 The decision tree learning algorithm.Input:Training Dataset D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)};Feature set A = {a1, a2, · · · , ad}

1: procedure TreeGenerate(D,A)2: Generate a decision tree node k3: if ∀(x , y) ∈ D, y is in the same specific category C then4: Label node k as a leaf node in category C return5: end if6: if A = ∅ OR ∀(x , y) ∈ D,x has the same value in feature a ∈ A then7: Label node k as a leaf node in the specific category which contains the

most samples in D return8: end if9: Select the best feature from A, annotated as a∗, to split the decision tree node. The methods to select the best feature is described in next part.

10: for av∗ ∈ a∗ do11: Generate a branch node b for node k12: Let Dv = {(x , y) ∈ D|x has the same value av∗ in a∗}13: if Dv = ∅ then14: Label branch node b as a leaf node in the specific category which con-

tains the most samples in D15: else16: Make the output node of TreeGenerate(Dv, A \ {a∗}) as branch node17: end if18: end for19: end procedure

Output:A decision tree whose root is node k.

One of the most popular algorithms to generate decision trees is ID3 [21]. To

build a decision tree, this algorithm uses entropy as impurity of features, illustrated

as:

I(D) = Ent(D) = −m∑k=1

pk log pk (2.19)

Substituting I(D) in Equation (2.18),

IG(D, a) = Ent(D)−V∑v=1

|Dv||D|

Ent(Dv) (2.20)

15

To split a decision tree node, the optimal solution is using feature a∗, which brings

the maximum of information gain:

a∗ = argmaxa∈A

IG(D, a) (2.21)

Another popular method to split the decision tree node is Classification And

Regression Trees (CART) [22]. In this method, the Gini value is used to indicate the

purity of D.

Gini(D) =m∑k=1

∑k′ 6=k

pkpk′

= 1−m∑k=1

pk2

(2.22)

From the equation, Gini(D) indicates the probability of randomly choosing 2 different

samples from D. Smaller values of Gini(D) means higher purity of D. The Gini

index of feature a could be treated as the impurity of dataset D, and it is defined by

substituting I(D) in Equation (2.18) with Gini(D):

IG(D, a) = Gini index(D, a) = 1−V∑v=1

|Dv||D|

Gini(Dv) (2.23)

So the best feature to split the decision node can be annotated as:

a∗ = argmaxa∈A

IG(D, a) (2.24)

Figure 2.2 shows a constructed CART decision tree, trained on the Iris dataset

[23]. In this CART, Gini index is used to split the tree node. For example, when

splitting the root node, the attribute and the value in this attribute with the minimum

Gini index, “petal width” and “0.8”, are selected. Then the root node is divided into

two nodes by the condition “petal width ≤ 0.8”. Applying the split process to all

nodes recursively, a decision tree illustrated as Figure 2.2 is generated.

16

Figure 2.2: Generated decision tree, trained on Iris dataset.

Moreover, Gini value is also used to evaluate the importance of features in a

generated decision tree, called as the Gini importance, defined as the importance of

feature a by Equation (2.25) [22].

Imp(a) =∑t∈φ

∆IG(t) (2.25)

where t ∈ φ is a node in decision tree φ. In addition, the feature importance generated

by ensemble model is based on the the importance of features in decision tree as

Equation (2.25) shows. The application of feature importance in this work is described

in Chapter 4.

17

2.1.4 Support Vector Machine

Support vector machines (SVM), first introduced in 1963 [24], is a well established

supervised machine learning algorithm. SVM has been widely used in cancer data

research. Listgarten et al. used SVM to analyze the susceptibility of breast cancer

for multiple treatments [25]. Ehlers and Harbour applied SVM model on genomic

cancer data to rank the 25 primary uveal melanomas tumors, in order to find the

correlations between the ranking of uveal melanomas tumors and NBS1 protein [26].

The main idea of SVM is to construct some hyperplanes in a high-dimensional

space for classification, or regression, by attempting maximize the distance, as known

as margin, from hyperplanes to the nearest point of input data. The hyperplane

dividing different classes is usually described as:

wTx + b = 0 (2.26)

where w = {w1, w2, · · · , wn} is the normal vector which determines the direction of

the hyperplane, and b determines the distance from that hyperplane to the origin

of the multi-dimensional space. So the distance r from one sample x ∈ X to the

hyperplane (w , b) is given as:

r =|wTx+ b|‖w‖

. (2.27)

where ‖w‖ is the Euclidean norm of w , as Equation (2.28) shows.

‖w‖ =

√√√√ n∑i=1

wi2 (2.28)

Ideally, the hyperplane (w , b) can divide all of the training data correctly, which

18

means that for each pair of (x i, yi) ∈ D:

{wTx + b ≥ +1, yi = +1wTx + b ≤ −1, yi = −1

(2.29)

As Figure 2.3 illustrates, only the training samples which have minimum r can

satisfy the equality of Equation (2.29), and these training samples are so called support

vectors. So the support vectors (x+,+1), (x−,−1) and the hyperplane are:

Figure 2.3: Support vector and margin.

r+ =|wTx+ + b|‖w‖

r− =|wTx− + b|‖w‖

(2.30)

Because r+ and r− are the Euclidean distance between support vectors and the hy-

perplane, it is obvious that wTx+ + b = +1 and wTx−+ b = −1. So Equation (2.30)

can be represented as:

19

r+ =|+ 1|‖w‖

r− =| − 1|‖w‖

(2.31)

The sum of distance γ of support vectors from two different categories, as known

as margin, for these two categories, is defined as below:

γ = r+ + r−

=2

‖w‖

(2.32)

SVM attempts to find the hyperplane which has the maximum margin, i.e. to

find a specific pair of (w , b) to let γ achieve its maximum, as Equation (2.33) shows:

maxw ,b

2

‖w‖

s.t. yi(wTx i + b) ≥ 1, i = 1, 2, · · · ,m

(2.33)

In order to maximize the margin as Equation (2.33) shows, ‖w‖−1 needs to be

maximized, which is equivalent to minimizing ‖w‖2. So Equation (2.33) can be recast

as below, which is the standard form of SVM.

minw ,b

1

2‖w‖2

s.t. yi(wTx i + b) ≥ 1, i = 1, 2, · · · ,m

(2.34)

Usually, the dataset D cannot be linearly separated, so that additional dimension

should be involved to generate a hyperplane. For example, the exclusive-or problem

(XOR problem), as Figure 2.4 shows, a proper function can map the original input

feature set x to φ(x ) which has a higher dimension so that a specific hyperplane to

separate different classes is able to be generated as below:

20

Figure 2.4: XOR problem using SVM.

f(x ) = wTφ(x ) + b (2.35)

Then, similar to Equation (2.34), the problem to find the optimal hyperplane in

mapped feature set φ(x ) can be described as:

minw ,b

1

2‖w‖2

s.t. yi(wTφ(x i) + b) ≥ 1, i = 1, 2, · · · ,m

(2.36)

The original problem on φ(x ), Equation (2.36), can be transformed in dual space

by means of Lagrangian [27]. The transformed problem is given as below, and the α

here is Lagrangian multiplier.

21

maxα

m∑i=1

αi −1

2

m∑i=1

m∑j=1

αiαjyiyjφ(x i)Tφ(x j)

s.t.m∑i=1

αiyi = 0,

αi ≥ 0, i = 1, 2, · · · ,m

(2.37)

Usually, the calculation of φ(x i)Tφ(x j) is difficult because φ(x ) may have a

very high dimension. So kernel tricks, mapping the original features to the higher-

dimension to make the separation easier, was introduced in 1992 by B. Boser et al

[27]. The key point of kernel tricks is to find a function K(·, ·) and let K(x i,x j) =

φ(x i)Tφ(x j). The function K(·, ·) is a so-called kernel function. Then the Equation

(2.37) can be written as:

maxα

m∑i=1

αi −1

2

m∑i=1

m∑j=1

αiαjyiyjK(x i,x j)

s.t.m∑i=1

αiyi = 0,

αi ≥ 0, i = 1, 2, · · · ,m

(2.38)

The optimal coefficients of the hyperplane, as Equation (2.39) shows, is the so-

lution of Equation (2.38). The optimal solution can be accessed by expanding and

calculating the kernel function K(·, ·) on training samples, known as support vector

expansion.

f(x ) = wTφ(x ) + b

=m∑i

αiyiK(x ,x i) + b.(2.39)

22

Some common kernel functions are listed in Table 2.2, and xi, xj means different

two features in x.

Name Equation

Linear kernel K(xi,xj) = xTi xj

Polynomial kernel K(xi,xj) = (xTi xj)d

Radial basis function (RBF) kernel K(xi,xj) = exp(−‖xi,xj‖2

2σ2 )

Table 2.2: Common kernel functions.

where d ≥ 1 is the order of polynomial, and σ > 0 is the width of RBF kernel.

A linear kernel is used when the dataset D is linearly separable, which requires

less parameters and less time to execute than any other kernel. A polynomial kernel

is able to map the input feature x into a higher dimensional space φ(x ). The value

in kernel matrix may be too difficult to calculate due to the high order of polynomial

kernel, which means the higher d, the higher time complexity is. The RBF kernel

performs well no matter whether the size of dataset is big or not, and requires less

parameters than polynomial kernel. Thus, the common way to train a model based

on SVM is starting with RBF kernel from practical experiences.

2.1.5 Artificial Neural Networks

The basic model of neural networks was introduced in 1943 by W. McCulloch and

W. Pitts, which was the so-called McCulloch-Pitts neuron (MP neuron) [28]. The

research of MP neuron model was the beginning of the research of artificial neural

networks by mathematically simulating the behavior of human neurons. The first

artificial neural network model for pattern recognition was the perceptron model, in-

troduced by F. Rosenblatt in 1958 [29]. The two-layers perceptron model was able

23

to generate output by applying arithmetic operations on inputs. One major improve-

ment of neural networks was Back Propagation, introduced by P. Werbos in 1974 and

successfully applied in LeNet recognize to handwritten zip-code by Y. LeCun et al.

in 1989 [30]. During the 1990s, with the development of SVM, the improvement of

neural networks temporarily stalled due to the large amount of calculations required.

In 2012, A. Krizhevsky et al. developed AlexNet [31] using CUDA [32] based on

GPU to accelerate the training process of neural networks, and made a huge success

in ImageNet [33] classification, which is now often recognized as the beginning of deep

learning trend.

The following parts describe the structure of artificial neural networks in detail.

(1) Artificial Neuron

Artificial neural networks are inspired by the behavior of biological neural networks in

brain and neural science [34]. With the similar layer-by-layer structure like biological

neural systems as Figure 2.5(a) shows, the neural network is able to “generate a

response” on the basis of stimulation input. The basic unit in a neural network is

the neuron, also called as a node or a unit, which is from the MP neuron model. As

Figure 2.5(b) shows, each input xi has an associated weight (wi), which indicates the

relative importance of this input as compared to other inputs.

24

(a) biological neuron

Σ θ

1

x1

x2

x3

xn

bw1

w2

w3

w n

...

(b) artificial neuron model

Figure 2.5: Biological neuron and Artificial neuron, Figure (a) is adapted from “Wiki-media Commons”, https://commons.wikimedia.org/wiki/File:Neuron.svg. Source:“Anatomy and Physiology” by the US National Cancer Institute’s Surveillance, Epi-demiology and End Results (SEER) Program. Adapted with permission under CCBY-SA 3.0.

The node applies the activation function to the sum of weighted inputs, and

generates the output Y as the equation shows:

Y = f(n∑i=1

Xiwi + b− θ) (2.40)

In the equation, b is the bias, θ is the threshold checking the sum of weighted

input, and f(·) is activation function which determines the output of neural networks,

generating output like “positive” or “negative”. The activation function should be

differentiable and monotonic, in order to get the gradient, as the direction and step

length, to update the curve. Table 2.3 lists the common activation functions.

25

Activation function Equation

Sigmoid f(t) = 11+e−t

tanh f(t) = et−e−t

et+e−t

ReLU f(t) = max (0, t)

Table 2.3: Common activation functions.

Sigmoid function and hyperbolic tangent (tanh) functions are both S-shape, mono-

tonic, differentiable functions. They are widely used as the activation functions in

neural networks finding the minimum of a loss function using minimization approaches

such as gradient descent. The differences between Sigmoid and hyperbolic tangent

functions are illustrated in Figure 2.6. From the graph, it is clear that the Sigmoid

function varies in range [0, 1], whereas tanh varies in range [−1, 1]. The tanh function

changes more rapidly than Sigmoid when the input x is near to 0, which means using

tanh as activation function is more likely to make the whole model converge.

Figure 2.6: The graph of ReLU, Sigmoid and, tanh activation functions.

The major drawback of Sigmoid and tanh is that, in a deep neural network (which

26

has multiple layers), the gradient may be too small to update a new value. As a result,

the model converges very slowly and this is the so called vanishing gradient problem.

Motivated by the vanishing gradient problem, the Rectified Linear Unit (ReLU) is an

activation function defined by a constant positive gradient value 1 for positive inputs,

and 0 for negative inputs [35], as Equation (2.41) shows:

f (x) =

{x, x > 00, x ≤ 0

(2.41)

When a ReLU is activated with input above 0, the partial derivative is 1, which is able

to make ReLU avoid the vanishing gradient problem in multi-layer neural networks.

If the input x ≤ 0, the gradient of ReLU will be a constant 0, which is also described

as a saturated ReLU.

However, ReLUs have potential disadvantage during the training process because

the gradient is constantly 0 when the input is negative, which are called as the satu-

rated ReLU. This could result in slow convergence of model because saturated ReLU

never activates so that a gradient-based method will not adjust its weights, which is

the so-called “dying ReLU problem”.

To alleviate the potential dying ReLU problems caused by constant 0, a possible

solution is using leaky ReLU [36], a typical variant for ReLU, defined as below:

f (x) =

{x, x > 0

0.01x, x ≤ 0(2.42)

Leaky ReLU has a relatively smaller gradient for negative inputs, compared with pos-

itive inputs. This feature allows a gradient optimizing method to adjust the weights

slightly and slowly when leaky ReLU is saturated and not active to avoid becoming

the dying ReLU.

27

(a) XOR problem (b) Multi-layer feedforwardneural network structure

Figure 2.7: Multi-layer feedforward neural network solving XOR problem.

(2) Structure of Feedforward Neural Networks

The learning ability of a single neuron may not be appropriate for complex problems,

such as the exclusive or (XOR) problem which is not linear separable as Figure

2.7(a) shows. So a model with multiple layers of neurons is implemented. Figure

2.7(b) shows a simple neural network with two layers for solving XOR problem.

With more layers and neurons in the neural network model, it is able to fit more

complex non-linear problems. Figure 2.8 shows the structure of neural network with

multiple layers as an example, which has one hidden layer and one node in output

layer. This kind of neural network is also called as feedforward neural network. In

feedforward neural network, the neurons in each layer are fully connected with the

previous layer and next layer, but neurons in the same layer are not connected. With

this feature, the model can pass not only the output of this neuron but also its weights

to the next layer.

28

Input #1

Input #2

Input #3

Input #4

Output

Hiddenlayer

Inputlayer

Outputlayer

Figure 2.8: Schematic of neural network.

However, with the growth of the number of neurons, it may be difficult to train

the whole multiple layer network, because more neurons in the neural network means

more connected weights to be trained. One important optimization of feedforward

neural network to train the connected weights between layers is error backpropagation

(BP). Also, the connected weights are usually converged quickly during the training

process because BP is able to bring “feedbacks” to the previous trained layers. The

BP algorithm was originally introduced in 1970s by Werbos [37], and fully appreciated

after Rumelhart et al. published their work [38].

The BP algorithm works as such. Figure 2.9 shows a feedforward neural network

which has d neurons in the input layer, q neurons in the hidden layer, l neurons in

the output layer. Suppose all of the neurons in that neural network use Sigmoid as

activation function. The definition of notations used in this neural network is listed

in Table 2.4.

29

Figure 2.9: An example of applying BP on a feedforward neural network.

Notation Description

θj The threshold of the j-th output neuron

γh The threshold of the h-th hidden neuron

vih The weight from the i-th input neuron to the h-th hidden neuron

whj The weight from the h-th hidden neuron to the j-th output neuron

αh The input of the h-th hidden neuron, αh =∑d

i=1 vihxi

βj The input of the j-th output neuron, βj =∑q

i=1whjbh

bh The output of the h-th hidden neuron

Table 2.4: Definition of notations in backpropagation NN.

For a specific training sample from dataset D, (x k,yk), suppose the output of the

neural network as Figure 2.9 shown is yk =(yk1 , y

k2 , · · · , ykl

). The output y generated

30

by the model is an estimation of y . So ykj is calculated as:

ykj = f(βj − θj), (2.43)

so that the MSE of the neural network model on sample (x k,yk) is:

Ek =1

2

l∑j=1

(ykj − ykj )2. (2.44)

BP algorithm is based on gradient descent. So with the given learning rate η, and

the error Ek, the gradient of weight whj can be adjusted as:

∆whj = −η ∂Ek∂whj

, (2.45)

Applying the chain rule,

∂Ek∂whj

=∂Ek∂ykj·∂ykj∂βj· ∂βj∂whj

, (2.46)

Because βj =∑d

i=1whjbh, it is obvious that∂βj∂whj

= bh.The differential function of

Sigmoid function (f(x) = 11+e−x ) is illustrated as below,

f ′(x) = f(x)(1− f(x)). (2.47)

And based on Equation (2.43) and (2.44), the gradient of j can be calculated as below:

gj = −∂Ek∂ykj·∂ykj∂βj

= −(ykj − ykj )f ′(βj − θj)

= ykj (1− ykj )(ykj − ykj )

(2.48)

So Equation (2.46) can be written as ∂Ek

∂whj= gjbh with substitution using Equation

(2.48) and bh. So Equation (2.45) can be written as below.

∆whj = ηgjbh (2.49)

31

Similarly, the other parameters in the specific neural network can be calculated,

∆θj = −ηgj,

∆vih = ηehxi,

∆γh = −ηeh,

where eh = −∂Ek∂bh· ∂bh∂αh

= −l∑

j=1

∂Ek∂βj· ∂βj∂bh

f ′(αh − γh)

=l∑

j=1

whjgjf′(αh − γh)

= bh(1− bh)l∑

j=1

whjgj.

(2.50)

The psuedo code below shows how the BP algorithm works:

Algorithm 2 Backpropagation Algorithm.Input:Training Dataset D = {(x 1,y1), (x 2,y2), · · · , (xm,ym)};Learning rate ηProcedure:

1: Initialize all the weights and threshold in (0, 1)2: repeat3: for all (x k,yk) ∈ D do4: Calculate yk by Equation (2.43) and current weights and thresholds.5: Calculate gj by Equation (2.48)6: Calculate eh by Equation (2.50)7: Update whj, vih, θj, γh by Equation (2.50)8: end for9: until The training error reaches the threshold, or the iteration reaches the thresh-

old.

Output:A multi-layer feedforward neural network with trained weights and thresholds.

The BP algorithm makes training multi-layer neural networks become possible.

32

Deep neural networks (DNNs), or neural networks with multiple hidden layers, have

been introduced in [17]. The major difference between DNNs and conventional arti-

ficial neural networks is the number of hidden layers. Typically, an artificial neural

network usually has three layers (the input layer, the hidden layer, and the output

layer), and is trained to be optimized for a specific task. Differently, DNNs have more

layers, and each layer in a DNN produces a representation of the patterns based on

the input data from the previous layer [17]. Recent research shows DNNs have been

applied to speech recognition, computer vision, and clinical data research [16]. The

following parts show some representative models based on DNNs.

(3) Convolutional Neural Network

Convolutional neural networks (CNNs), as a special version of DNNs, contain one or

more convolutional layers. This special structure allow CNNs to take advantage of

extracting features from the spatial domain [39], which means it has better perfor-

mance in image processing and natural language processing. LeNet-5 [40] was one of

the famous applications in the early period of convolutional neural network; it is able

to recognize hand-written digits automatically. An illustration of a 2D CNN is given

as Figure 2.10, which shows examples of a max-pooling layer and a convolution layer.

Figure 2.10: An example of CNN.

33

A convolution layer usually has a convolution kernel, which slides the whole input

data in a specific order (usually from left to right (1D convolution), or from left top to

right bottom (2D convolution)) and extracts the relationships in the spatial domain.

Maxpooling layer is a special layer that outputs the maximum of the values in the

adjacent range of a specific data point.

Convolutional neural networks are widely applied on clinical image recognition.

Cirean et al. applied the convolutional neural network with max pooling layer on

breast cancer histology image data in order to detect mitosis, and won the ICPR

2012 mitosis detection competition [41]. Shen et al. proposed multi-scale convolu-

tional neural networks to automatically classify malignant and benign nodules from

computed tomography screening data without additional procedure of nodule seg-

mentation [42]. Esteva et al. trained a single CNN to classify skin cancer by using

disease-labeled images as input data [43].

(4) Other Neural Networks

Recurrent neural networks (RNN) are a special type of deep learning model where

the neural networks contain additional weighted edges to create cycles in the network,

in order to extract meaningful information in time series of data [44]. A special type

of RNN called long short-term memory neural network (LSTM) was developed, and

it repeats the specific memory unit to maintain the information from the previous

state [45]. Applications include text and speech recognition, music composition, and

language translation [46]. Figure 2.11 shows the structure of the memory unit in

LSTM, which includes a series of gate functions in the unit to determine whether the

information from the previous states should be kept or ignored.

34

σ σ Tanh σ

× +

× ×

Tanh

c〈t−1〉

Cell

h〈t−1〉

Hidden

x〈t〉Input

c〈t〉

Label1

h〈t〉

Label2

h〈t〉Label3

Figure 2.11: The structure of LSTM memory unit.

Recently, Razavian et al. applied LSTM to predict disease onset based on clinical

data [47]. Guan et al. applied three types of RNNs (gated recurrent unit, LSTM,

and bidirectional LSTM) on electric medical records to classify documents to different

groups in order to evaluate the impact of treatments [48, 49].

Generative adversarial networks (GAN) show another approach to process ma-

chine learning: using a neural network to generate the simulated data which is sim-

ilar to the given input data. GANs usually contain two parts: the generator model

which generate the input-like data, and the discriminator model which determines

the source of given data (original input data or the generated input data). GANs

were recently applied to image processing, computer vision, speech recognition, and

so on. Sun et al. used GAN develop a method to recognize the speech contents under

multiple Chinese dialects (e.g. Cantonese, Wu and so on) [50] spoken by different

people. Evtimov et al. showed that it is able to generate a GAN to mislead the CNN

based on computer vision algorithms to make the incorrect predictions [51].

GAN has also been applied widely in clinical research. Beaulieu-Jones et al.

35

trained pairs of neural networks to generate simulated data from actual data, which

provided a method to share the simulated patients’ data while preserving their privacy

[52]. Shin et al. used GAN to generate synthetic abnormal MRI images with brain

tumors from public databases in order to increase the diversity of clinical MRI image

data [53]. Rezaei et al. applied GAN on generating segmentation label maps for

images of brain lesions [54].

(5) Platforms Related to Neural Networks

Neural network models with multi-neuron architecture are computationally intensive

but can be computed using parallel algorithms. To carry out these calculations, highly

parallelization-optimized hardware and software tools are strongly needed. A high

performance GPU with multicores and shareable large-capacity cache is needed to

accelerate the training process [55]. Multiple software platforms and tools for working

with and parallelizing neural networks have been developed, such as CUDA [32],

Tensorflow [56], and Keras [57]. The most common languages in machine learning,

especially for neural networks for academic research use, are Python and R, which

are easy to use and have a large number of relevant packages and resources.

In 2018, Nvidia developed the Volta GPU microarchitecture and introduced a

new specialized hardware unit called Tensor Core that is able to perform one matrix-

multiply-and-accumulate operation on 4× 4 matrices in a single clock cycle [56]. The

Tensor Cores are designed to make a tradeoff between the calculation precision and

the time efficiency, as mixed datatypes are used during the calculations, like half

precision float (float 16) and full precision float (float 32) [58]. Research shows that

NVIDIA Tensor Cores can strongly accelerate high performance computing through

efficient matrix multiplications with acceptable loss of calculation precision, , which

can be exploited in training deep learning models and related activities [59].

36

2.2 Unsupervised Learning

Unsupervised learning is used when the dataset is not labelled . In general, unsuper-

vised learning attempts to find the implicit relations between data, in order to extract

meaningful information from the data. Two examples are reducing the dimension of

the data (e.g., PCA) and performing clustering (e.g., K-means).

Principle components analysis (PCA) is a common unsupervised learning method

for dimensional reduction. It represents the original input feature set (annotated as

X = {x 1,x 2, · · · ,xm}) by generating principle components X ′ = {x ′1,x ′2, · · · ,x ′k},

k < m, which are in the lower dimension using singular vector decomposition. The

pseudo code below describes how PCA works.

Algorithm 3 Principle Components Analysis.Input:The dataset with m input features X = {x 1,x 2, · · · ,xm}The number k of principle components to be generated.Procedure:

1: x i ← x i − 1m

∑mj=1 x j . Centralizing x i

2: Calculate the covariance matrix XXT for X.3: v ,S ← SVD(XXT ) . Singular values v = {v1, v2, · · · , vm} and singular vectors

S = {S 1,S 2, · · · ,Sm}4: Select k singular vectors S with the k largest singular values.5: Put the selected singular vectors S in a new set X ′

Output:Principle components X ’

Here is an example of PCA on the Iris dataset [23], which selects 3 principle

components instead of four original features in the model to reduce the dimension,

in order not only to reduce the complexity for further steps, but also to support

visualization of the original data.

37

Figure 2.12: PCA on Iris dataset, selected 3 as principle components number.

Another common task in unsupervised learning is clustering, aiming to find the

internal similarity relationships between samples. K-means is one of the most often-

used methods for clustering. The main procedure of K-means is:

1. Input K as the number of the clustering centers;

2. Randomly choose K samples as initial clustering centers;

3. For each sample x i in sample set X , its distances to all K clustering centers is

calculated;

4. Categorize x i into the nearest clustering center and update the clustering center

by shifting each center to be the average of the samples associated with that

center.

Step 3, and 4 are iteratively executed until the procedure fulfills some termination

38

condition(s), such as all samples are clustered, no clustering center is changing, and/or

the MSE of all samples reaches a minimum.

Because of its efficiency and simplicity, K-Means clustering has been used in clin-

ical data research for unsupervised learning. Haldar et al. used K-means clustering

in three independent asthma datasets of patients’ records, to identify asthma phe-

notypes for making different treatment decisions [60]. Tothill et al. attempted to

identify novel molecular subtypes of ovarian cancer by using K-means and to evaluate

the patients survival within k-means groups by Cox proportional hazards models [61].

However, the main drawback of K-means is the number of clustering centers K

should be estimated and specified in advance, but it is not straightforward to estimate

a proper K. Also, instead of clustering samples by generating borders, K-means clus-

ters the samples by optimizing the center of clustering, which often leads to incorrectly

clustering samples [17].

2.3 Reinforcement Learning

In reinforcement learning, the training target is to develop a model (agent) which is

able to improve its performance by interacting with the environment [62], as Figure

2.13 illustrates. A so-called reward signal is generated to indicate how well the model

is interacting with the environment as defined by a reward function, which is different

from the value or label used in supervised learning.

Figure 2.13: Reinforcement learning structure.

39

During the training process of reinforcement learning, the agent attempts to learn

a policy π, and using π generates the action a = π(x) based on the environment state

x, which brings the optimal reward. One famous example of reinforcement learning

is AlphaGo, a go (a kind of board game) AI developed by Google Deepmind [63].

By using reinforcement learning to train itself, AlphaGo defeats some top go players

around the world, including Ke Jie and Lee Sedol, which shows the strong power

of reinforcement learning. Moreover, reinforcement learning has a broad future with

potential uses in industrial manufacturing, game AI designing, and even tuning the

hyperparameters for other machine learning models [64].

40

Chapter 3: Ensemble Methods and Cascade Forest

3.1 Basic Theory of Ensemble Methods

Like the old saying goes, “A jack of all trades is a master of none, but oftentimes

better than a master of one”. Similarly, ensemble methods, which is a machine learn-

ing strategy rather than a specific machine learning method, combine multiple basic

individual machine learning models as Figure 3.1 shows to optimize the prediction.

Ensemble methods can be used for classification, regression, feature selection, outlier

detection, and so on.

Figure 3.1: The diagram of general ensemble methods.

Ensemble methods attempt to combine several weak models together in order to

decrease variance (bagging), bias (boosting), or improve predictions (stacking) [65].

There are two different kinds of ensemble methods in general to integrate multiple

learners:

• Homogeneous ensemble. All the individual learners to construct the ensem-

ble learner are of the same kind, or homogeneous, such as a perceptron unit

in a neural network. These learners are called base learners, and the learning

algorithm of the learners is a base learning algorithm.

41

• Heterogeneous ensemble. Some individual learners are not the same, or

heterogeneous. The learners in heterogeneous ensemble are called as component

learners, which are generated from different machine learning algorithms. For

example, considering a specific classification problem, different models including

SVM, logistic regression, and neural network are applied on the training data.

From the point of base learners’ organization, ensemble methods can be also di-

vided into 2 groups:

• Sequential ensemble methods. Each base learner is generated sequentially

to exploit the dependence between the base learners. Boosting [66] is one of the

most representative examples of sequential ensemble methods.

• Parallel ensemble methods. Each base learner is generated in parallel to

exploit the independence between the base learners in order to reduce the error.

One of the most popular parallel ensemble methods is bagging [67].

3.2 Ensemble Strategies

Ensemble strategies integrate the outputs from individual learners. Averaging, voting,

and stacking are the three typical ensemble strategies.

Assume the ensemble modelH contains T base learners, annotated as {h1, h2, · · · , hT}.

The output for each base learner is hi(x ), when x is the given input. The following

parts illustrates averaging, voting, and stacking of integrating the outputs.

For regression tasks, the common strategy is averaging. Averaging is described as

Equation (3.1), where wt represents the weight for learner ht(·).

42

H(x ) =1

T

T∑t=1

wtht(x )

T∑t=1

wt = 1

(3.1)

This equation describes simple averaging when all of the base learners have the same

weight. Otherwise, if the weights are different, it is called as weighted averaging.

Different from averaging, voting is a method which performs better on classifica-

tion tasks [65]. For the same sample x i, the voting strategy lets the ensemble model

generate the output on the basis of the majority of individual learner ht.

Stacking is a technique for ensemble learning which combines multiple learners via

an integrated learner, described as meta-learner (meta-classifier or meta-regressor).

The individual learners in base level are trained with all of the input training data.

Then the meta-model is trained on both of the label from training data Y and the

outputs of the base level models as features z , to generate the output H. The basic

algorithm of stacking is illustrated as below:

43

Algorithm 4 Stacking.Input:Training Dataset D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)};Base level learner L1,L2, · · · ,LT ;Meta-Learner LProcedure:

1: for t = 1, 2, · · · , T do2: ht = Lt(D)3: end for4: D′ = ∅5: for i = 1, 2, · · · ,m do6: for t = 1, 2, · · · , T do7: zit = ht(x i)8: end for9: D′ = D′ ∪ ((zi1, zi2, · · · , ziT ), yi)

10: end for11: h′ = L(D′)

output:H(x ) = h′(h1(x ), h1(x ), · · · , h1(x ))

In training process, the meta-learner will overfit if uses the base learners’ training

set. Thus, usually cross validation is applied to generate the training sample for the

meta-learner.

3.3 Cascade Forest

3.3.1 Motivation

In Chapter 2, it is mentioned that deep neural networks recently have achieved a

great success in many different fields, especially in image and voice processing and

recognition. However, deep neural networks still have two main drawbacks: the deep

network is very complex, requiring a lot of hyperparameters to be tuned; deep network

may have low accuracy when the size of input data is limited.

Cascade Forest, as a part of multi-Grained and Cascade Forest (GCForest), was

first developed by Zhou et al. [15, 68] in 2017, and it is a decision tree ensemble

44

method. Inspired by the layer structure of deep neural network, cascade forest also

has a typical layer-by-layer structure.

3.3.2 Structure of Cascade Forest

The structure of cascade forests, as illustrated in Figure 3.2, is inspired by deep neural

networks. With the layer-by-layer processing of raw features, deep neural networks

is able to do representation learning. In cascade forests, each level accepts outputs

generated by the base learners, i.e. estimators, in its previous layer as input, and

outputs its processing result to next layer. Each layer of cascade forests contain an

ensemble of heterogeneous base learners, whereas every layer in every level of cascade

forests are homogeneous. In the next section, we will describe in detail three types of

base learners we used in this study, namely Random Forest, Extra Trees and Logistic

Regression.

Figure 3.2: Structure of cascade forest. Suppose there are 2 classes to predict, eachlayer consists of m base learners and the whole deep forest may have n layers.

To cut down the risk of overfitting during the training procedure, the output

produced by each layer of cascade forest is generated by k -fold cross validation. In

45

detail, each specific sample from the training set will be used as training data for

k− 1 times to generate k− 1 outputs. Then the output for this layer is generated by

the average of the k−1 outputs. Before generating new layer, the performance of the

whole cascade forest can be evaluated on the validation set. If the performance does

not gain significantly, the training procedure will be terminated, which means the

number of layers in cascade forest is automatically decided. In contrast to most deep

neural networks whose model complexity is stable and set by hyperparameters, this

feature of cascade forest shows the ability to terminate the training process adaptively,

which enables this ensemble method to decide its model complexity and let GCForest

be able to process both small and large scales of training data [15].

Since GCForest was developed in 2017, research [68] about the applications and

improvements of deep forest model have been popular. Utkin et al. attempted to

weight the outputs from base learners per layer and get the weighted average result

as output for this layer. These weights are able to be trained, in order to improve the

accuracy of cascade forest and converge the model rapidly [69].

Some recent works about cancer clinical data research based on deep forest model

are listed below. Guo et al. developed BCDForest based on modifying GCforest [70],

to address cancer subtype classification on small-scale genomic datasets in 2018. By

adding boosting to the standard cascade forest model, they used the modified model

BCDForest to analyze the genomic data from TCGA, to distinguish 11 different types

of cancer, including breast cancer, lung cancer, and so on. Su et al. proposed Deep-

Resp-Forest, based on the GCForest, to evaluate the response of anti-cancer drugs by

training the proposed model to classify the labeled data as “sensitive” or “resistant”

[71].

46

3.3.3 Base Learners of Cascade Forest

In this thesis, we choose random forest, extra trees, and logistic regression as indi-

vidual learners because these methods need less time and fewer hyperparameters as

compared to neural networks and SVM. Also, more heterogeneous individual learners

in the ensemble cascade forest model improve the diversity of whole model, which

helps make predictions more accurately [68]. The following parts describe these three

base learners briefly.

(1) Random Forest

Just like the relationship between trees and forests in the real world, random forests

(RF) contain a set of decision trees [67]. Specifically, random forests use Bootstrap

AGGregation (Bagging) to sample data, and use the results of a set of decision trees

to generate the output. Bagging uses bootstrap sampling to get the subsets of features

for training the base learners. Then the subsets, i.e. the set of samples, of the original

samples are generated. With these subsets of samples, each decision tree is generated

as a base learner from the different sampling set. To aggregate the outputs from the

base learners, bagging uses the majority of voting the outputs for classification, and

the average of the outputs for regression. The pseudo code of bagging is given as

below:

47

Algorithm 5 Bagging Algorithm.Input:Training Dataset D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)};Base decision method L;Maximum training iteration TProcedure:

1: for t = 1, 2, · · · , T do2: ht = L(D,Dbs) . Dbs ⊂ D is generated from bootstrap sampling.3: end for

output:H(x ) = argmax

y∈Y

∑Tt=1 I(ht(x ) = y)

(2) Extra Trees

One other possible method to create an ensemble of decision trees is called as Extra

trees (EXTremely RAndomized trees, ET) [72]. Extra trees is generated more ran-

domly than random forest: the thresholds to split the node and generate decision

trees are randomized. In addition, thresholds are generated stochastically for each

candidate feature, and the best of these thresholds is picked as the splitting rule. The

algorithm of Extra Trees is described as below:

Algorithm 6 Extra Trees algorithm.Input:Training Dataset D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)};Feature set A = {a1, a2, · · · , ad}Base decision tree method L;Maximum training iteration TProcedure:

1: for t = 1, 2, · · · , T do2: Select a feature a∗ ∈ A randomly3: Annotate the maximum a∗max and minimum a∗min of D on feature a∗

4: randomly pick a ac5: ht = L(D, ac)6: end for

output:H(x ) = argmax

y∈Y

∑Tt=1 I(ht(x ) = y)

48

(3) Logistic Regression

Logistic regression, despite its name, is a linear model for classification rather than

regression [73]. The cost function of logistic regression with binary class `2 penalty is

described as Equation (3.2) shows.

y =1

1 + e−(wTX+b)

minw,C

1

2wTw + C

n∑i=1

log(exp(−yi(XTi w + b)) + 1).

(3.2)

Here, input data is given as D = {(X 1, y1), (X 2, y2), · · · , (Xm, ym)}, and each input

Xi has an associated weight (wi), which indicates the relative importance of this input

to other inputs. C is a constant determining the term of regression.

49

Chapter 4: Applying Cascade Forest on the SEER

Dataset for Survivability Prediction of Lung Cancer

This chapter focuses on applying the proposed cascade forest model described

in detail in Chapter 3 on the clinical data analysis, and making comparison with

conventional methods illustrated in Chapters 2 and 3. The following sections intro-

duce the classification of survivability data acquisition, data preprocessing, model

construction, model evaluation, and comparisons.

4.1 Data Acquisition and Preprocessing

4.1.1 SEER Dataset

In this thesis, the clinical data used is “Incidence - SEER 18 Regs Research Data

+ Hurricane Katrina Impacted Louisiana Cases, Nov 2017 Sub (1973-2015 varying)”

from the Surveillance, Epidemiology, and End Results Program (SEER) [7]. This

database covers approximately 27.8% of the U.S. population (based on 2010 census)

and contains records for 10,050,814 tumors in total [4]. All tumors recorded in SEER

are categorized into 22 main classes. Lung cancer is classified in the main class

“Respiratory System” and the branch “Lung and Bronchus”. In this database, a

clinical record for one specific patient’s case, i.e. record of tumor, has 191 features

and this record is encoded in a single row. A brief description of all features from

SEER is listed in Appendix B. [74]. These features can be categorized as Table 4.1.1

shows.

50

Category number Category of variables1 Record identification2 Information source3 Demographic information4 Description of neoplasm5 First course of therapy6 Follow up information7 Record variables

Table 4.1: SEER variables categories.

The clinical data of lung cancer (“Site recode ICD-O-3/WHO 2008” is “Lung

and Bronchus”) with the value “year of diagnosis” varying from 2013 to 2015 is

extracted from the specific SEER database mentioned above, because the recently

published database is in a complete and clear format. Additionally, the data records

with missing value are dropped. Finally, the database is composed of 46,088 lines of

records with 191 columns of features. The target feature to predict is named as “Vital

status recode (study cutoff used)”, which contains two values (“alive” or “dead”), to

describe the status of patients. There are 26,631 cases marked as “alive” and 19,457

cases marked as “dead”. An example of a clinical record in the specific database is

listed in Appendix C.

4.1.2 Data Re-encoding

Some values in the extracted dataset from SEER database are formatted in natural

language and stored as strings so that it is not possible to train or test any machine

learning models by using the raw extracted dataset directly. For the input features

X, the Python function sklearn.preprocessing.OrdinalEncoder() from

scikit-learn [75] is used to transform all of the string or integer values that represent

different categories to integers automatically. Each different value is treated as a

category, and a integer is generated to replace the original value. This transforms the

51

original input data to a single column of integers (0 to number of categories - 1) for

each feature. For the target feature Y, all records labeled as “alive” are re-encoded

as “1”, while the others are re-encoded as “0”.

4.1.3 Dimensional Reduction

Dimensional reduction is applied on the extracted SEER dataset in order to reduce

the complexity of the machine learning model. First of all, because the type of cancer

selected is “Lung and Bronchus”, some of the features are not proper for lung cancer,

and should be eliminated from the feature set. For example, features like “Histology

recode - Brain groupings”, “Breast - Adjusted AJCC 6th T (1988+)”, “Breast -

Adjusted AJCC 6th N (1988+)”, and “Breast - Adjusted AJCC 6th M (1988+)”

are designed for tumors on different sites (brain or breast) rather than lung, which

are not related to lung cancer. Second, features which have high correlations with

the target feature “Vital status recode (study cutoff used)” are dropped. Also, some

columns of the extracted data contain just one value, which is not appropriate for

training the model. Thus 113 features are dropped due to the reasons described

above. Additionally, “Patient ID” in SEER database is only used to distinguish

patients, which has little significance to the prediction result, so that it was dropped

manually. As a result, our extracted dataset contained 77 features.

To compare our results with recent studies [11] [76] [77], we manually generated a

subset with 11 features listed in Table 4.2 that were selected for further analysis on

Lung cancer data from the SEER database. Since the SEER database was updated

after the prior studies in content and feature names, we attempted to reproduce the

features set and used the updated SEER database.

52

Number Name of feature1 Sex2 Age at diagnosis3 Year of diagnosis4 Histologic Type ICD-O-35 Grade6 Survival months7 CS tumor size (2004+)8 RX Summ–Surg Prim Site (1998+)9 County10 CS lymph nodes(2004+)11 Vital status recode (study cutoff used)

Table 4.2: Features subset from prior studies that we used for comparison.

(1) Lasso Regression

In this part, Lasso regression as described in Equation (2.17) is used to reduce the

dimension of input data.

To the extracted SEER data (46, 088×77), multiple attempts with different values

of λ are executed, so that multiple sets of Lasso coefficients are generated. With

increasing increments of λ, the Lasso coefficients for all of the features are shrinking

toward 0, and, the less significant the feature is, the faster it shrinks to 0. With a

specific λ, the subset D ′, representing the original D , can be generated, in order to

reduce the dimension of the data. The R2 score is used to evaluate how D ′ represents

D , and is listed in Table 4.3

λ R2

0.001 Not converged0.005 0.98250.01 0.98240.05 0.98140.1 0.98060.5 0.95631 0.8807

Table 4.3: The value R2 with different values of λ.

53

Because Lasso algorithm does not converge at λ = 0.001, instead λ ∈ (0.005, 0.01)

are selected. Figure 4.1 shows the relationship between the non-zero Lasso coefficients

and the λ. Both x and y axes are in log scale because the range of values is too wide in

linear scale. This leads negative Lasso coefficients to be transformed to small positive

ones. In addition, to generate this figure clearly, features whose Lasso coefficient is

zero are not shown. After Lasso regression, the selected features are listed in Table

4.4. Comparing with the feature set from prior works, data with similar features are

selected for training. Instead of selecting features manually, using Lasso regression is

an efficient method to generate the feature set automatically.

Figure 4.1: Non-zero Lasso coefficients (λ ≥ 0.001 in log scale) for different values ofλ.

54

Number Name of feature1 State-county2 Derived AJCC Stage Group, 7th ed (2010+)3 RX Summ–Surg Prim Site (1998+)4 CS tumor size (2004+)5 CS extension (2004+)6 CS mets at dx (2004+)7 Regional nodes examined (1988+)8 Regional nodes positive (1988+)9 Survival months10 Histology ICD-O-211 Race/ethnicity12 Year of birth13 Vital status recode (study cutoff used)

Table 4.4: Features after Lasso regression.

The feature set listed in Table 4.4 is extracted as input data for training and

testing the model. However, Figure 4.2 shows the correlation matrix of input features

after Lasso regression, which indicates some input features are highly correlated.

55

Figure 4.2: The correlation matrix for input features after Lasso regression.

We drop the feature whose correlation was larger than 0.4 (“Derived AJCC Stage

Group, 7th ed (2010+)”, “RX SummSurg Prim Site (1998+)”, and “CS mets at dx

(2004+)”) from Table 4.4. The correlation matrix after these features dropped is

illustrated as Figure 4.3.

The dataset after dropping the features described above is also used as input data

to compare with other dimensional reduction methods.

(2) Principal Components Analysis

After the process of PCA described in Chapter 2, 12 principal components are selected

in order to reduce the dimension of dataset.

56

Figure 4.3: The correlation matrix after dropping highly correlated features.

4.1.4 Training Set and Test Set

The training set and test set are split randomly from the whole datasets after three

parallel dimensional reduction methods respectively. The 70% of total data (32,261

lines) are used as training set, and the rest (13,827 lines) consist the test set for each

method mentioned for dimensional reduction, for all comparative experiments.

4.2 Building Cascade Forest Model

4.2.1 Hyperparameter Setting and Tuning

The hyperparameters of cascade forest are: the random state of model, the maximum

number of layers, the early-stopping rounds, the number of classes to classify, and

the base learners used in the model. Here, we use the current time in seconds as

the random state, set 100 as maximum number of layers, and set 3 as the early-

stopping rounds. Random forest, extra trees, and Logistic regression described in

57

Chapter 3 are used as base learners. From the experiments made in this thesis,

the hyperparameters for the cascade forest itself do not need to be tuned too much,

whereas the hyperparameters for base learners need to be well tuned. Instead of

tuning hyperparameters manually, we applied randomized grid search [78], which

automatically attempts the value of hyperparameters from a designed set randomly,

on three base learners of cascade forest to improve its performance.

The python code below shows the tuning process for random forest base learner

as an example:

param_dist_RF = {

"max_depth": [5, 10, 50, 100, None],

"n_estimators": sp_randint(1, 11),

"criterion": ["gini", "entropy"]

}

n_iter = 20

random_RF = RandomizedSearchCV(RF,

param_distributions=param_dist_RF,

n_iter=n_iter, cv=5)

Here in the code, the hyperparameters for random forest listed in “param dist RF” in-

dicate the shape of the random forest, using bootstrap or not, and using which kind of

splitting method for the decision tree in it. The function RandomizedSearchCV()

can automatically generate a number of parameter sets (here is 20), and measure the

performance using cross validation (here is a 5-fold cv).

Finally, the optimal hyperparameters for cascade forest’s base learners are gener-

ated, as python code below shows:

ca_config["estimators"].append({"n_folds": 5,

"type": "RandomForestClassifier",

"n_estimators": 10, "max_depth": None,

"criterion":"gini"})


"type": "ExtraTreesClassifier",

"n_estimators": 8, "max_depth": None,

"criterion":"gini"})

58


"type": "LogisticRegression",

"solver":"saga"})

where ca config is the hyperparameter set for cascade forest, which has three

tuned base learners (“RandomForestClassifier”, “ExtraTreesClassifier”, “LogisticRe-

gression”).

4.2.2 Modified Cascade Forest for Feature Importance Analysis

In this study, we used Gini importance to evaluate the feature importance of the

proposed cascade forest model, which is based on the feature importance of decision

tree as Equation (2.25) shows. Previous works [67] [75] [79] show that the mean feature

importance generated from the decision tree in tree-ensemble methods such as random

forest can be used to evaluate the feature importance of the whole ensemble model,

as known as Mean Decrease Gini (MDG). Similarly, the MDG of the whole proposed

cascade forest model is calculated by the MDG of base learners in the optimal layer

which has the best average accuracy in the cascade forest, illustrated as Figure 4.4.

Figure 4.4: Modified cascade forest.

59

The original structure of cascade forest is not able to generate output about feature

importance, so we modified the original cascade forest in order to generate not only

the prediction result but also the importance of features, based on the input training

data and the output of base learners. Feature importance from each decision-tree

base learner in the optimal layer is extracted, and averaged to get the importance

of whole cascade forest, which represents the feature importance generated for the

whole model, as Figure 4.4 illustrates. During the training process of a single layer

in cascade forest, the feature importance values generated from decision tree based

individual learners, such as random forest and extra trees, are averaged and stored.

After training, the feature importance values generated from the optimal layer are

selected, which provides a method to evaluate the significance of each input feature.

4.2.3 Model Training

This cascade forest model is trained using the hardware platform as below:

• GPU Nvidia GeForce GTX 1080 which has 2,560 CUDA cores with 8GB RAM

at speed of 10 Gbps.

• CPU Intel i7-7820HK which has 4 cores, 8 threads with 2.9 ∼ 3.9GHz main

frequency.

• RAM 32GB.

With the advanced GPU GeForce GTX 1080 [55] supported by NVIDIA, using Keras

[57] and Tensorflow [56] as the backend, the time efficiency of the code evidently

improved comparing with the similar method on CPU.

The pseudo code listed below describes training the tuned cascade forest model

for the outcomes prediction by using SEER lung cancer datasets.

60

Algorithm 7 Cascade forest to predict the outcomes of lung cancer patients.Input:Training Dataset D = {(x 1, y1), (x 2, y2), · · · , (xm, ym)};Base learner LRF (Random Forest), LET (Extra Trees), LLR(Logistic Regression);Procedure:

1: i← 0 . The i-th layer.2: D′ = ∅ . Annotate output from prior layer as D′

3: Loptimal ← ∅4: Accoptimal ← 05: ioptimal ← 06: while True do7: i← i+ 18: Train LRF ,LET ,LLR with D ∪D′;9: D′ ← the outputs of LRF ,LET ,LLR

10: Li ← averaging{LRF ,LET ,LLR}11: Calculate the accuracy Acci for Li by Equation (2.5)12: if Acci > Accoptimal then13: Loptimal ← Li14: Accoptimal ← Acci15: ioptimal ← i16: end if17: if i ≥ 3 then18: if Acci ≤ Acci−1 & Acci ≤ Acci−2 then19: Break . Early stop if the accuracy for this layer L does not increase

in 3 iterations.20: end if21: end if22: end while23: Calculate the average importance of features fi in Loptimal using the method

described in previous section.

output:The optimal layer c.The average of importance of features fi from Loptimal.

4.2.4 Result and Analysis

The varying accuracy during the process of training on Lasso data is illustrated as

Figure 4.5. The accuracy for base learners (or estimators) and the layer averaging all

base learners are given. The “AVG” represents for the accuracy of the ensemble layer,

61

calculated by averaging the accuracy of base learners in this layer. From Figure 4.5,

the algorithm detected the averaging layer accuracy does not increase for generating

three layers after layer 6 was trained, so that it terminate the training process and

make an early stop.

Figure 4.5: Training accuracy of estimators in cascade forest, trained by Lasso data.

After training the cascade forest models, the test sets from the different datasets

are analyzed by the respective cascade forest models. As Table 4.5 shows, some

indicators listed below are used to evaluate the performance of cascade forest.

62

Indicator Dataset Lasso Dataset Lasso Drop Dataset PCA Dataset PriorTP 6412 6118 7694 7522TN 4396 4331 4724 4585FP 1370 1523 966 1111FN 1649 1855 443 609TP+TN 10808 10449 12418 12107FP+FN 3019 3378 1409 1720Accuracy 0.7817 0.7557 0.8981 0.8756Precision 0.8240 0.8007 0.8885 0.8713Recall 0.7954 0.7673 0.9456 0.9251F1-score 0.8094 0.7837 0.9161 0.8974

Table 4.5: Result analysis for cascade forest on different datasets.

From the Table 4.5, the data from PCA has a better performances comparing with

dataset generated from Lasso, Lasso dropping some features, and dataset from prior

works. After highly correlated features are dropped, the performance of cascade forest

is slightly lower than using the feature set directly from Lasso, because the dropped

feature may be important to the result. An example is the feature “RX SummSurg

Prim Site (1998+)” which has high importance to the target output but is dropped

as Figure 4.7 shows. By controlling the error rate in a reasonable range, the principal

components which has least correlations between input features can represent the

original input data effectively. Figure 4.6(a) and Figure 4.6(b) show the differences

intuitively.

63

(a) P-R Curve (b) ROC Curve

Figure 4.6: Performance comparison of different dimensional reduction methods.

In Figure 4.6(a), the proposed Cascade Forest trained with the data from PCA

has the best performance on the balance of precision and recall, which shows the

Cascade Forest model trained by PCA has better performance when compared with

other methods. In Figure 4.6(b), the proposed Cascade Forest model using PCA data

reaches the maximum of AUC among other methods, indicating the data processed

by PCA may be better to represent the original data.

Also, the cascade forest produces the importance of each feature, as Figure 4.7

and 4.8 show.

64

Figure 4.7: Importance of features, generated by cascade forest using Lasso data.

Figure 4.8: Importance of features, generated by cascade forest using data from priorworks.

From Figure 4.7 and Figure 4.8, “survival months” and “RX Summ-Surg Prim

Site” have relatively more importance to the result of case than most of the other

features. Lung cancer usually develops rapidly and currently the common treatment

65

is surgery [3], which indicates these two features, “survival months” and “RX Summ-

Surg Prim Site”, have a great impact on the survivability of lung cancer. However,

for PCA data, though it is able to generate a bar plot to describe the importance of

“features”, specifically, the principal components, it is still too hard to combine the

physical meaning with the principal components.

4.3 Evaluation and Comparison

In order to evaluate the performance of the cascade forest method proposed in this pa-

per in cancer survivability prediction, the experiment compares the proposed method

with a set of common machine learning methods, including SVM, DNN, and random

forest. Hyperparameters tuning is also applied on SVM, DNN and random forest to

compare performance with the proposed deep forest method. Randomized grid search

is used to optimize the hyperparameters of DNN and random forest, while we have to

tune SVM hyperparameters manually by comparing the accuracy of each SVM to se-

lect the optimal set of hyperparameters, including different kernel functions, penalty

parameters (C), tolerance for stopping criterion (tol), because it takes a relatively

longer time for a single run. The parameter sets for SVM, DNN, and random forest

are listed below, as Table 4.6 shows.

66

Classifier Hyperparameters

DNN

4 ReLU layers, each layer contains 32 neurons.1 softmax layer as output layeroptimizer=’Adadelta’,loss=’categorical crossentropy’,metrics=[’accuracy’]

SVM

kernel=’rbf’,probability=True,gamma=’scale’,C=1.0,tol=0.001

RFn estimators=10,criterion=gini

Table 4.6: The optimal hyperparameters set from this study.

All of the methods are evaluated, compared, and analyzed using a subset of usual

performance indicators described in Chapter 2, by using three different datasets

(Lasso, PCA, prior works) as mentioned above. The test set contains 13,827 lines

of labeled data (30% of the whole data). The proposed method and the methods to

compare with are tested 20 times on independently randomly sampled data from test

set with replacement in order to make the results more statistically convincing.

67

Method Indicator Lasso data Lasso Drop PCA data Prior Works

CF

Accuracy 0.7799± 0.0037 0.7522± 0.0072 0.8918± 0.0028 0.8695± 0.0035Precision 0.8182± 0.0042 0.8023± 0.0071 0.8851± 0.0052 0.8643± 0.0042

Recall 0.7938± 0.0063 0.7563± 0.0092 0.9361± 0.0035 0.9217± 0.0037F1-score 0.8058± 0.0037 0.7786± 0.0068 0.9099± 0.0029 0.8921± 0.0031

SVM


Recall 0.7632± 0.0073 0.7246± 0.0090 0.7267± 0.0079 0.8531± 0.0055F1-score 0.7972± 0.0048 0.7676± 0.0069 0.7835± 0.0058 0.8509± 0.0051

DNN


Recall 0.7976± 0.0062 0.7369± 0.0090 0.8989± 0.0051 0.9495± 0.0041F1-score 0.8032± 0.0036 0.7716± 0.0071 0.9023± 0.0025 0.8805± 0.0043

RF


Recall 0.7938± 0.0061 0.7569± 0.0082 0.9126± 0.0045 0.9144± 0.0046F1-score 0.7961± 0.0046 0.7694± 0.0064 0.8915± 0.0039 0.8899± 0.0035

Table 4.7: Comparison of cascade forest and other methods, using data from Lasso,PCA, and prior works, executed 20 times.

From Table 4.7, the cascade forest has the best mean accuracy for all three datasets

(0.7799 on Lasso, 0.7522 on Lasso after dropping features, 0.8918 on PCA, 0.8695 on

prior works dataset), comparing with SVM, DNN and RF. Also, cascade forest has

relatively higher F1 score (0.8058 on Lasso, 0.7786 on Lasso after dropping features,

0.9099 on PCA, 0.8921 on prior works dataset), which means cascade forest has a

better balance between precision and recall. As another common ensemble method,

random forest performs slightly worse than cascade forest and deep neural network

model. But combining with the time usage between these four methods (data to be

shown later in this section), random forest has the best time cost performance for

data training among the used methods, which is better and suitable when the perfor-

mance of hardware is limited. For the datasets generated by three different methods,

Lasso shows the worst performance for all four methods: the average accuracy for

the mentioned methods from Lasso dataset (77.44%) is 7.9% less than the average

accuracy from PCA dataset (85.34%). The possible reason is PCA could generate

data with lower correlations because the principal components are orthogonal with

68

each other. To conclude, cascade forest shows the best performance in experiment of

four methods for the model’s quality of generating the accurate prediction; random

forest balance the execution time and a reasonable performance.

Two commonly used approaches for illustrating the diagnostic ability of a bi-

nary classification machine learning method are the Receiver operating characteristic

(ROC) curve and the Area Under the Curve (AUC). An ROC curve plots the True

Positive Rate (TPR; Recall) vs. False Positive Rate (FPR) (Section 2.1.2). For a

given threshold, the samples with possibilities greater than the threshold are classi-

fied to the positive class, and those with possibilities smaller than the threshold are

classified to the negative class. Thus, pairs of TPR and FPR for each given threshold

are calculated. By varying the discrimination threshold across multiple values from

0 to 1, an ROC curve is generated with the multiple pairs of TPR and FPR. The

method with greater AUC is considered better than other method.

In the ideal case, the positive class and negative classes can be perfectly distin-

guished by the machine learning algorithm (no false positives or false negatives). As

such, the ROC curve will have a TPR of 1.0 for all FPR values and the AUC is

1.0. In the worst case, the machine learning algorithm can do no better than random

chance, so the ROC curve will go along the diagonal and the AUC is 0.5. There is one

additional scenario, a “worst” case where the machine learning algorithm predicts the

incorrect class every single time such that the AUC is 0.0. Trivially, this is actually

equivalent to the ideal case since its inverse is actually the correct classification every

single time. In this thesis, machine learning algorithms generated a prediction of

whether each sample corresponded to an outcome of “alive” or “died”.

Figure 4.9, 4.10 and 4.11 show the comparison of ROC from using the four methods

mentioned above, generated by using three different datasets, indicating the proposed

cascade forest has a slightly better performance given its AUC is larger than the

69

AUC for the other methods for all three datasets. SVM performs with the worst

performance when comparing accross the four different methods.

Figure 4.9: ROC curves on Lasso data.

70

Figure 4.10: ROC curves on data from PCA.

71

Figure 4.11: ROC curves on data from prior works.

The elapsed time of training process is also an important metric to evaluate the

performance of methods. We extracted six subsets contain different lines of record

(5,544/11,087/16,631/22,174/27,718/32,261) from the Lasso training data to evaluate

the training time. The training time for each method was calculated by averaging the

elapsed time of executing training on these subsets 10 times. Table 4.8 and Figure

4.12 show the time usage distribution for proposed Cascade Forest and methods used

72

for comparison (SVM, DNN, and RF).

Number of Lines Cascade Forest SVM DNN RF

5,544 11.188± 0.206 3.932± 0.032 6.546± 0.916 0.046± 0.00511,087 13.649± 1.052 17.279± 0.324 10.869± 1.332 0.090± 0.00916,631 19.660± 1.028 39.653± 0.589 15.112± 1.907 0.141± 0.00622,174 16.905± 1.174 69.924± 1.126 19.887± 2.341 0.178± 0.00327,718 18.317± 0.540 108.878± 2.121 24.574± 2.936 0.239± 0.01332,261 30.189± 0.862 146.323± 0.858 29.417± 3.595 0.277± 0.012

Table 4.8: The execution time of Cascade Forest, SVM, DNN, and RF.

Figure 4.12: Elapsed time for Cascade Forest, Random Forest, SVM and DNN.

From Table 4.8 and Figure 4.12, random forest (RF) took the shortest time to

train the input data, whereas SVM took the longest time to train the data. The

proposed cascade forest and neural network have a similar time when training.

73

Chapter 5: Conclusion and future work

This work aimed to apply a new ensemble model, Cascade Forest, on clinical data

to predict lung cancer survivability. With this work, we can conclude an ensemble

deep forest model such as Cascade Forest is suitable for clinical data research whether

the size of input is large or not, because this ensemble decision tree model can change

the growth of layer in an adaptive process, showing a balance of acceptable execution

time and higher accuracy. Moreover, with the development of the multi-core hardware

like GPU in future, parallel distributed algorithm and computation will be more and

more efficient comparing with the conventional methods. Also, the work described in

this thesis brings a new ensemble decision tree method to generate predictions, which

will be helpful for physicians to evaluate the status of their patients, and assist in

making decisions about the proposed therapy of a specific patient.

Possible directions for future work for the applications of deep forest method and

clinical research is abundant. Naturally, the next step is to apply the approach to

genetic data and clinical record together to evaluate the significance of all features, in

order to improve the accuracy of the prediction and indicate which genetic features are

related to a specific type of cancer. We can use multi-grained scanning on the genetic

data to extract more useful relationships in the spatial domain. Additionally, there is

still room to improve the model with respect to time efficiency: the redundant cross-

validation procedure for each node may be optimized to save more time. Moreover,

the hyperparameters for a specific node can be produced by, for example, randomly

increasing the diversity of the model. The hyperparameters can also be inherited and

mutated by genetic algorithm (GA) from the previous layer, as done by F. Friedrichs

et al. [80], using GA to tune the parameters of SVM. That means different random

74

forest nodes in the specific layer could have different hyperparameters, which will

increase the diversity of the layer. This improvement in diversity may have a positive

effect of the accuracy and require a relatively smaller number of executed iterations.

Another possible method is to use weighted averaging (described in Section 3.2) to

integrate the outputs from the individual learners in a layer with those weights and

with the addition of a loss function to be used as part of the training data in the

cascade forest.

75

Bibliography

[1] F. Bray, J. Ferlay, I. Soerjomataram, R. L. Siegel, L. A. Torre, and A. Jemal,

“Global cancer statistics 2018: GLOBOCAN estimates of incidence and mor-

tality worldwide for 36 cancers in 185 countries,” CA: A Cancer Journal for

Clinicians, vol. 68, no. 6, pp. 394–424, 2018, issn: 1542-4863. doi: 10.3322/

caac.21492. [Online]. Available: https://onlinelibrary.wiley.com/doi/

abs/10.3322/caac.21492 (visited on 05/16/2019).

[2] EAPC. (). European association for palliative care, EAPC home, [Online].

Available: https://www.eapcnet.eu/home (visited on 05/17/2019).

[3] SEER Training Modules, Lung cancer, Jan. 5, 2019. [Online]. Available: https:

//training.seer.cancer.gov/lung/.

[4] Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov).

(Nov. 2017). SEER*stat database: Incidence - SEER 18 regs research data +

hurricane katrina impacted louisiana cases, nov 2017 sub (1973-2015 varying) -

linked to county attributes - total u.s., 1969-2016 counties.

[5] (Jun. 13, 2018). The cancer genome atlas, National Cancer Institute, [Online].

Available: https://www.cancer.gov/tcga (visited on 05/29/2019).

[6] A. P. G. Consortium and others, “AACR project GENIE: Powering precision

medicine through an international consortium,” Cancer discovery, vol. 7, no. 8,

pp. 818–831, 2017.

[7] (). Surveillance, epidemiology, and end results program, SEER, [Online]. Avail-

able: https://seer.cancer.gov/index.html (visited on 04/03/2019).

76

[8] D. W. Kim, S. Lee, S. Kwon, W. Nam, I.-H. Cha, and H. J. Kim, “Deep learning-

based survival prediction of oral cancer patients,” Scientific Reports, vol. 9,

no. 1, p. 6994, May 6, 2019, issn: 2045-2322. doi: 10.1038/s41598- 019-

43372-7. [Online]. Available: https://www.nature.com/articles/s41598-

019-43372-7 (visited on 05/16/2019).

[9] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Statistical

Society: Series B (Methodological), vol. 34, no. 2, pp. 187–202, 1972.

[10] K. Kourou, T. P. Exarchos, K. P. Exarchos, M. V. Karamouzis, and D. I.

Fotiadis, “Machine learning applications in cancer prognosis and prediction,”

Computational and Structural Biotechnology Journal, vol. 13, pp. 8 –17, 2015,

issn: 2001-0370. doi: https://doi.org/10.1016/j.csbj.2014.11.005.

[Online]. Available: http://www.sciencedirect.com/science/article/pii/

S2001037014000464.

[11] A. Agrawal, S. Misra, R. Narayanan, L. Polepeddi, and A. Choudhary. (2012).

Lung cancer survival prediction using ensemble data mining on seer data, Scien-

tific Programming, [Online]. Available: https://www.hindawi.com/journals/

sp/2012/920245/abs/ (visited on 04/05/2019).

[12] C. M. Lynch, B. Abdollahi, J. D. Fuqua, A. R. de Carlo, J. A. Bartholomai,

R. N. Balgemann, V. H. van Berkel, and H. B. Frieboes, “Prediction of lung can-

cer patient survival via supervised machine learning classification techniques,”

International Journal of Medical Informatics, vol. 108, pp. 1–8, 2017, issn:

1872-8243. doi: 10.1016/j.ijmedinf.2017.09.013.

[13] C. M. Lynch, V. H. v. Berkel, and H. B. Frieboes, “Application of unsuper-

vised analysis techniques to lung cancer patient data,” PLOS ONE, vol. 12,

no. 9, e0184370, Sep. 14, 2017, issn: 1932-6203. doi: 10 . 1371 / journal .

pone.0184370. [Online]. Available: https://journals.plos.org/plosone/

article?id=10.1371/journal.pone.0184370 (visited on 05/16/2019).

77

[14] Y. Wang, D. Wang, X. Ye, Y. Wang, Y. Yin, and Y. Jin, “A tree ensemble-

based two-stage model for advanced-stage colorectal cancer survival predic-

tion,” Information Sciences, vol. 474, pp. 106–124, Feb. 1, 2019, issn: 0020-

0255. doi: 10.1016/j.ins.2018.09.046. [Online]. Available: http://www.

sciencedirect.com/science/article/pii/S002002551830759X (visited on

05/22/2019).

[15] Z.-H. Zhou and J. Feng, “Deep forest: Towards an alternative to deep neural net-

works,” in Proceedings of the Twenty-Sixth International Joint Conference on

Artificial Intelligence, IJCAI-17, 2017, pp. 3553–3559. doi: 10.24963/ijcai.

2017/497. [Online]. Available: https://doi.org/10.24963/ijcai.2017/497.

[16] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, p. 436,

May 27, 2015. [Online]. Available: https://doi.org/10.1038/nature14539.

[17] Z.-H. Zhou, Machine Learning. Tsinghua University Press, 2016.

[18] S. Geman, E. Bienenstock, and R. Doursat, “Neural networks and the bias/-

variance dilemma,” Neural Computation, vol. 4, no. 1, pp. 1–58, Jan. 1, 1992,

issn: 0899-7667. doi: 10.1162/neco.1992.4.1.1. [Online]. Available: https:

//doi.org/10.1162/neco.1992.4.1.1 (visited on 05/25/2019).

[19] R. Caruana, S. Lawrence, and C. L. Giles, “Overfitting in neural nets: Back-

propagation, conjugate gradient, and early stopping,” in Advances in neural

information processing systems, 2001, pp. 402–408.

[20] R. Tibshirani, “Regression shrinkage and selection via the lasso,” Journal of the

Royal Statistical Society: Series B (Methodological), vol. 58, no. 1, pp. 267–288,

1996.

[21] J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1,

pp. 81–106, Mar. 1, 1986, issn: 1573-0565. doi: 10.1007/BF00116251. [Online].

Available: https://doi.org/10.1007/BF00116251 (visited on 04/21/2019).

78

[22] L. Breiman, Classification and Regression Trees. Routledge, Oct. 19, 2017, isbn:

978-1-351-46049-1. doi: 10.1201/9781315139470. [Online]. Available: https:

//www.taylorfrancis.com/books/9781351460491 (visited on 04/21/2019).

[23] D. Dua and C. Graff, UCI Machine Learning Repository. University of Cal-

ifornia, Irvine, School of Information and Computer Sciences, 2017. [Online].

Available: http://archive.ics.uci.edu/ml.

[24] V. Vapnik and A. Y. Lerner, “Recognition of patterns with help of generalized

portraits,” Avtomat. i Telemekh, vol. 24, no. 6, pp. 774–780, 1963.

[25] J. Listgarten, S. Damaraju, B. Poulin, L. Cook, J. Dufour, A. Driga, J. Mackey,

D. Wishart, R. Greiner, and B. Zanke, “Predictive models for breast cancer

susceptibility from multiple single nucleotide polymorphisms,” Clinical cancer

research, vol. 10, no. 8, pp. 2725–2737, 2004.

[26] J. P. Ehlers and J. W. Harbour, “NBS1 expression as a prognostic marker

in uveal melanoma,” Clinical Cancer Research, vol. 11, no. 5, pp. 1849–1853,

Mar. 1, 2005, issn: 1078-0432, 1557-3265. doi: 10.1158/1078- 0432.CCR-

04-2054. [Online]. Available: http://clincancerres.aacrjournals.org/

content/11/5/1849 (visited on 06/06/2019).

[27] B. E. Boser, I. M. Guyon, and V. N. Vapnik, “A training algorithm for optimal

margin classifiers,” in Proceedings of the Fifth Annual Workshop on Computa-

tional Learning Theory, ser. COLT ’92, event-place: Pittsburgh, Pennsylvania,

USA, New York, NY, USA: ACM, 1992, pp. 144–152, isbn: 978-0-89791-497-0.

doi: 10.1145/130385.130401. [Online]. Available: http://doi.acm.org/10.

1145/130385.130401 (visited on 05/15/2019).

[28] W. S. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in

nervous activity,” The bulletin of mathematical biophysics, vol. 5, no. 4, pp. 115–

133, 1943.

79

[29] F. Rosenblatt, “The perceptron: A probabilistic model for information storage

and organization in the brain.,” Psychological review, vol. 65, no. 6, p. 386, 1958.

[30] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard,

and L. D. Jackel, “Backpropagation applied to handwritten zip code recog-

nition,” Neural Computation, vol. 1, no. 4, pp. 541–551, Dec. 1, 1989, issn:

0899-7667. doi: 10.1162/neco.1989.1.4.541. [Online]. Available: https:

//doi.org/10.1162/neco.1989.1.4.541 (visited on 06/06/2019).

[31] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with

deep convolutional neural networks,” in Advances in Neural Information Pro-

cessing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Wein-

berger, Eds., Curran Associates, Inc., 2012, pp. 1097–1105. [Online]. Available:

http://papers.nips.cc/paper/4824-imagenet-classification-with-

deep-convolutional-neural-networks.pdf (visited on 06/06/2019).

[32] (Oct. 6, 2015). CUDA toolkit 10.0 download, NVIDIA Developer, [Online].

Available: https://developer.nvidia.com/cuda- downloads (visited on

02/15/2019).

[33] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A

large-scale hierarchical image database,” in CVPR09, 2009.

[34] T. Kohonen, “An introduction to neural computing,” Neural Networks, vol. 1,

no. 1, pp. 3–16, Jan. 1, 1988, issn: 0893-6080. doi: 10.1016/0893-6080(88)

90020- 2. [Online]. Available: http://www.sciencedirect.com/science/

article/pii/0893608088900202 (visited on 04/20/2019).

[35] V. Nair and G. E. Hinton, “Rectified linear units improve restricted boltz-

mann machines,” in Proceedings of the 27th international conference on machine

learning (ICML-10), 2010, pp. 807–814.

[36] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve

neural network acoustic models,” in Proc. icml, vol. 30, 2013, p. 3.

80

[37] P. Werbos, “Beyond regression: New tools for prediction and analysis in the

behavioral sciences,” Ph. D. dissertation, Harvard University, 1974.

[38] D. E. Rumelhart, G. E. Hinton, R. J. Williams, and others, “Learning repre-

sentations by back-propagating errors,” Cognitive modeling, vol. 5, no. 3, p. 1,

1988.

[39] Y. LeCun, Y. Bengio, and T. B. Laboratories, “Convolutional networks for

images, speech, and time-series,” The handbook of brain theory and neural net-

works, vol. 3361, no. 10, p. 15, 1995.

[40] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, and others, “Gradient-based learn-

ing applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11,

pp. 2278–2324, 1998.

[41] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis

detection in breast cancer histology images with deep neural networks,” in In-

ternational Conference on Medical Image Computing and Computer-assisted

Intervention, Springer, 2013, pp. 411–418.

[42] W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian, “Multi-scale convolutional

neural networks for lung nodule classification,” in International Conference on

Information Processing in Medical Imaging, Springer, 2015, pp. 588–599.

[43] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and

S. Thrun, “Dermatologist-level classification of skin cancer with deep neural

networks,” Nature, vol. 542, no. 7639, p. 115, 2017.

[44] J. J. Hopfield, “Neural networks and physical systems with emergent collec-

tive computational abilities,” Proceedings of the National Academy of Sciences,

vol. 79, no. 8, pp. 2554–2558, Apr. 1, 1982, issn: 0027-8424, 1091-6490. doi:

10.1073/pnas.79.8.2554. [Online]. Available: https://www.pnas.org/

content/79/8/2554 (visited on 04/21/2019).

81

[45] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Com-

putation, vol. 9, no. 8, pp. 1735–1780, Nov. 1, 1997, issn: 0899-7667. doi:

10.1162/neco.1997.9.8.1735. [Online]. Available: https://doi.org/

10.1162/neco.1997.9.8.1735 (visited on 04/21/2019).

[46] Z. C. Lipton, J. Berkowitz, and C. Elkan, “A critical review of recurrent neural

networks for sequence learning,” arXiv preprint arXiv:1506.00019, 2015.

[47] N. Razavian, J. Marcus, and D. Sontag, “Multi-task prediction of disease on-

sets from longitudinal laboratory tests,” in Machine Learning for Healthcare

Conference, 2016, pp. 73–100.

[48] M. Guan, S. Cho, R. Petro, W. Zhang, B. Pasche, and U. Topaloglu, “Natu-

ral language processing and recurrent network models for identifying genomic

mutation-associated cancer treatment change from patient progress notes,” JAMIA

Open, vol. 2, no. 1, pp. 139–149, Jan. 3, 2019, issn: 2574-2531. doi: 10 .

1093/jamiaopen/ooy061. [Online]. Available: https://doi.org/10.1093/

jamiaopen/ooy061 (visited on 06/21/2019).

[49] M. Guan, “INCORPORATING EMR AND GENOMIC DATA USING NLP

AND MACHINE LEARNING TO REFINE CANCER TREATMENT,” Dis-

sertation/Thesis, PhD thesis, Wake Forest University, 2018.

[50] S. Sun, C. Yeh, M. Hwang, M. Ostendorf, and L. Xie, “Domain adversarial train-

ing for accented speech recognition,” in 2018 IEEE International Conference on

Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018, pp. 4854–4858.

doi: 10.1109/ICASSP.2018.8462663.

[51] I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rah-

mati, and D. Song, “Robust physical-world attacks on deep learning models,”

in Computer Vision and Pattern Recognition, 2018.

[52] B. K. Beaulieu-Jones, Z. S. Wu, C. Williams, R. Lee, S. P. Bhavnani, J. B.

Byrd, and C. S. Greene, “Privacy-preserving generative deep neural networks

82

support clinical data sharing,” bioRxiv, 2018. doi: 10.1101/159756. [Online].

Available: https://www.biorxiv.org/content/early/2018/12/20/159756.

[53] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem,

J. L. Gunter, K. P. Andriole, and M. Michalski, “Medical image synthesis for

data augmentation and anonymization using generative adversarial networks,”

in Simulation and Synthesis in Medical Imaging, A. Gooya, O. Goksel, I. Oguz,

and N. Burgos, Eds., Springer International Publishing, 2018, pp. 1–11, isbn:

978-3-030-00536-8.

[54] M. Rezaei, K. Harmuth, W. Gierke, T. Kellermeier, M. Fischer, H. Yang, and C.

Meinel, “A conditional adversarial network for semantic segmentation of brain

tumor,” in Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain

Injuries, A. Crimi, S. Bakas, H. Kuijf, B. Menze, and M. Reyes, Eds., Springer

International Publishing, 2018, pp. 241–252, isbn: 978-3-319-75238-9.

[55] (). GeForce GTX 1080 graphics cards | NVIDIA GeForce, [Online]. Available:

https://www.nvidia.com/en-us/geforce/products/10series/geforce-

gtx-1080/ (visited on 04/16/2019).

[56] (). TensorFlow, TensorFlow, [Online]. Available: https://www.tensorflow.

org/ (visited on 02/15/2019).

[57] (). Home - keras documentation, [Online]. Available: https : / / keras . io/

(visited on 02/15/2019).

[58] (). Tensor cores in NVIDIA volta GPU architecture, NVIDIA, [Online]. Avail-

able: https://www.nvidia.com/en-us/data-center/tensorcore/ (visited

on 06/11/2019).

[59] S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter, “NVIDIA

tensor core programmability, performance precision,” in 2018 IEEE Interna-

tional Parallel and Distributed Processing Symposium Workshops (IPDPSW),

May 2018, pp. 522–531. doi: 10.1109/IPDPSW.2018.00091.

83

[60] P. Haldar, I. D. Pavord, D. E. Shaw, M. A. Berry, M. Thomas, C. E. Brightling,

A. J. Wardlaw, and R. H. Green, “Cluster analysis and clinical asthma pheno-

types,” American journal of respiratory and critical care medicine, vol. 178,

no. 3, pp. 218–224, 2008.

[61] R. W. Tothill, A. V. Tinker, J. George, R. Brown, S. B. Fox, S. Lade, D. S.

Johnson, M. K. Trivett, D. Etemadmoghadam, B. Locandro, and others, “Novel

molecular subtypes of serous and endometrioid ovarian cancer linked to clinical

outcome,” Clinical cancer research, vol. 14, no. 16, pp. 5198–5208, 2008.

[62] S. J. Russell and P. Norvig, Artificial intelligence: a modern approach. Malaysia;

Pearson Education Limited, 2016.

[63] J. X. Chen, “The evolution of computing: AlphaGo,” Computing in Science &

Engineering, vol. 18, no. 4, p. 4, 2016.

[64] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning. MIT press, 2016.

[65] Z. Zhou, Ensemble methods: foundations and algorithms. Chapman and Hal-

l/CRC, 2012.

[66] Y. Freund and R. E. Schapire, “A decision-theoretic generalization of on-line

learning and an application to boosting,” Journal of Computer and System

Sciences, vol. 55, no. 1, pp. 119–139, Aug. 1, 1997, issn: 0022-0000. doi: 10.

1006/jcss.1997.1504. [Online]. Available: http://www.sciencedirect.com/

science/article/pii/S002200009791504X (visited on 05/19/2019).

[67] L. Breiman, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32,

Oct. 1, 2001, issn: 1573-0565. doi: 10.1023/A:1010933404324. [Online]. Avail-

able: https://doi.org/10.1023/A:1010933404324 (visited on 04/17/2019).

[68] Z.-H. Zhou and J. Feng, “Deep forest,” National Science Review, vol. 6, no. 1,

pp. 74–86, Jan. 1, 2019, issn: 2095-5138. doi: 10.1093/nsr/nwy108. [Online].

Available: https://academic.oup.com/nsr/article/6/1/74/5123737

(visited on 04/17/2019).

84

[69] L. V. Utkin, M. S. Kovalev, and A. A. Meldo, “A deep forest classifier with

weights of class probability distribution subsets,” Knowledge-Based Systems,

vol. 173, pp. 15–27, Jun. 1, 2019, issn: 0950-7051. doi: 10.1016/j.knosys.

2019.02.022. [Online]. Available: http://www.sciencedirect.com/science/

article/pii/S0950705119300838 (visited on 05/20/2019).

[70] Y. Guo, S. Liu, Z. Li, and X. Shang, “BCDForest: A boosting cascade deep forest

model towards the classification of cancer subtypes based on gene expression

data,” BMC bioinformatics, vol. 19, pp. 118–13, Suppl 5 2018, issn: 1471-2105.

doi: 10.1186/s12859-018-2095-4.

[71] R. Su, X. Liu, L. Wei, and Q. Zou, “Deep-resp-forest: A deep forest model to

predict anti-cancer drug response,” Methods, Feb. 14, 2019, issn: 1046-2023.

doi: 10.1016/j.ymeth.2019.02.009. [Online]. Available: http://www.

sciencedirect.com/science/article/pii/S1046202318303232 (visited on

06/12/2019).

[72] P. Geurts, D. Ernst, and L. Wehenkel, “Extremely randomized trees,” Machine

Learning, vol. 63, no. 1, pp. 3–42, Apr. 1, 2006, issn: 1573-0565. doi: 10.

1007/s10994-006-6226-1. [Online]. Available: https://doi.org/10.1007/

s10994-006-6226-1 (visited on 04/17/2019).

[73] D. G. Kleinbaum, K Dietz, M Gail, and M. Klein, Logistic regression. Springer,

2002.

[74] (). SEER*stat databases: November 2017 submission, [Online]. Available: https:

//seer.cancer.gov/data-software/documentation/seerstat/nov2017/

(visited on 04/17/2019).

[75] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,

M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and others, “Scikit-learn:

Machine learning in python,” Journal of machine learning research, vol. 12,

pp. 2825–2830, Oct 2011.

85

[76] S. J. Wang, S. G. Patel, J. P. Shah, D. P. Goldstein, J. C. Irish, A. L. Carvalho,

L. P. Kowalski, J. L. Lockhart, J. M. Holland, and N. D. Gross, “An oral cav-

ity carcinoma nomogram to predict benefit of adjuvant radiotherapy,” JAMA

Otolaryngology Head and Neck Surgery, vol. 139, no. 6, pp. 554–559, Jun. 1,

2013, issn: 2168-6181. doi: 10.1001/jamaoto.2013.3001. [Online]. Available:

https://jamanetwork.com/journals/jamaotolaryngology/fullarticle/

1686141 (visited on 05/16/2019).

[77] J. Kim and H. Shin, “Breast cancer survivability prediction using labeled, unla-

beled, and pseudo-labeled patient data,” Journal of the American Medical In-

formatics Association, vol. 20, no. 4, pp. 613–618, Jul. 1, 2013, issn: 1067-5027.

doi: 10.1136/amiajnl-2012-001570. [Online]. Available: https://academic.

oup.com/jamia/article/20/4/613/819025 (visited on 01/31/2019).

[78] J. Bergstra and Y. Bengio, “Random search for hyper-parameter optimization,”

Journal of Machine Learning Research, vol. 13, pp. 281–305, Feb 2012.

[79] G. Louppe, “Understanding random forests: From theory to practice,” PhD

thesis, University of Liege, Jul. 28, 2014. arXiv: 1407.7502. [Online]. Available:

http://arxiv.org/abs/1407.7502 (visited on 06/12/2019).

[80] F. Friedrichs and C. Igel, “Evolutionary tuning of multiple SVM parameters,”

Neurocomputing, Trends in Neurocomputing: 12th European Symposium on

Artificial Neural Networks 2004, vol. 64, pp. 107–117, Mar. 1, 2005, issn: 0925-

2312. doi: 10 . 1016 / j . neucom . 2004 . 11 . 022. [Online]. Available: http :

/ / www . sciencedirect . com / science / article / pii / S0925231204005223

(visited on 04/17/2019).

86

Appendix A: Description of Variables in SEER Dataset

The description of variables in SEER Dataset is from the official SEER website

[4].

87

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/1

of 1

1

Nov

embe

r 201

7 D

ata

Subm

issi

onIte

m #

refe

rs to

the

NAA

CCR

item

num

ber -

see

http

s://

ww

w.n

aacc

r.org

/Sta

ndar

dsan

dReg

istr

yOpe

ratio

ns/V

olum

eII.a

spx

CS=

Colla

bora

tive

Stag

ing

SSF

= Si

te-s

peci

fic F

acto

r

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

1Ag

e re

code

with

<1

year

old

s

The

age

reco

de v

aria

ble

is ba

sed

on A

ge a

t Dia

gnos

is (s

ingl

e-ye

ar a

ges)

. The

gro

upin

gs u

sed

in th

e ag

e re

code

var

iabl

e ar

e de

term

ined

by

the

age

grou

ping

s in

the

popu

latio

n da

ta. T

his r

ecod

e ha

s 19

age

grou

ps in

the

age

reco

de v

aria

ble

(< 1

yea

r, 1-

4 ye

ars,

5-9

yea

rs, .

.., 8

5+ y

ears

).

Will

be

in R

ace

and

Age

(cas

e da

ta o

nly)

in ra

te o

r pre

vale

nce

sess

ions

if a

n al

tern

ate

age

is us

ed a

s the

pop

ulat

ion

age

varia

ble.

See

ASCI

I tex

t file

des

crip

tion:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

AGE_

RECO

DE__

1_YE

AR_O

LDS

Age

at D

iagn

osis

(or R

ace

and

Age

(cas

e da

ta o

nly)

)1

(or 1

3)

2Ra

ce re

code

(Whi

te, B

lack

, Oth

er)

Race

reco

de is

bas

ed o

n th

e ra

ce v

aria

bles

and

the

Amer

ican

Indi

an/N

ativ

e Am

eric

an IH

S lin

k va

riabl

e. T

his r

ecod

e sh

ould

be

used

to li

nk to

the

popu

latio

ns fo

r whi

te, b

lack

and

oth

er.

It is

inde

pend

ent o

f Hisp

anic

eth

nici

ty.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

3Se

x22

0In

clud

es 1

= M

ale

and

2=Fe

mal

e fr

om S

ex [N

AACC

R Ite

m #

220]

plu

s a to

tal o

f mal

e an

d fe

mal

e. T

his i

s use

d to

link

to th

e co

rrec

t pop

ulat

ions

for m

ales

and

fe

mal

es w

hen

calc

ulat

ing

sex-

spec

ific

rate

s.Ra

ce, S

ex, Y

ear D

x, R

egist

ry, C

ount

y2

4Ye

ar o

f dia

gnos

is39

0Ye

ar o

f Dia

gnos

is: v

alue

s are

197

3-20

14 b

ut m

ay b

e a

subs

et d

epen

ding

on

the

file

that

is u

sed

and

the

regi

stry

that

is se

lect

ed.

Ther

e ar

e no

unk

now

n va

lues

on

the

file.

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

5SE

ER re

gist

ry40

This

field

show

the

SEER

regi

strie

s whi

ch c

ontr

ibut

e da

ta to

this

file.

Aft

er th

e na

me

of th

e re

gist

ry, t

here

is th

e be

ginn

ing

year

of d

iagn

osis

for t

hat

regi

stry

. Thi

s dat

a ite

m v

arie

s by

whi

ch d

ata

file

is se

lect

ed. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#RE

GIS

TRY_

IDRa

ce, S

ex, Y

ear D

x, R

egist

ry, C

ount

y2

6Lo

uisi

ana

2005

- 1s

t vs 2

nd h

alf o

f yea

rTh

is fie

ld is

use

d to

sepa

rate

Lou

isian

a ca

ses d

iagn

osed

in th

e fir

st h

alf o

f 200

5 fr

om th

ose

diag

nose

d in

the

seco

nd h

alf o

f 200

5 to

link

to d

iffer

ent

popu

latio

n es

timat

es u

sed

to a

ccou

nt fo

r disp

lace

d pe

rson

s due

to K

atrin

a/Ri

ta.

See

http

s://

seer

.can

cer.g

ov/d

ata/

hurr

ican

e.ht

ml

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

7Co

unty

90Co

unty

of r

esid

ence

at d

iagn

osis.

Thi

s mus

t be

used

in c

onju

nctio

n w

ith S

EER

regi

stry

or u

se S

tate

-cou

nty

varia

ble.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/man

uals/

2004

Revi

sion%

201/

SPM

_App

endi

xA.p

dfRa

ce, S

ex, Y

ear D

x, R

egist

ry, C

ount

y2

8St

ate-

coun

tySt

ate

and

coun

ty a

t dia

gnos

is. C

an b

e us

ed to

link

to th

e po

pula

tions

to p

rodu

ce ra

tes a

t the

stat

e/co

unty

leve

l.Ra

ce, S

ex, Y

ear D

x, R

egist

ry, C

ount

y2

9In

rese

arch

dat

a

Flag

indi

cate

s whe

ther

it is

supp

lem

enta

l dat

a or

not

. In

rese

arch

dat

abas

es w

hich

incl

ude

Loui

siana

, Jul

y-De

cem

ber 2

005

case

s are

con

sider

ed

supp

lem

enta

l dat

a. T

hese

cas

es/p

opul

atio

ns a

re se

t to

"No"

for t

his f

ield

and

are

typi

cally

exc

lude

d in

SEE

R an

alys

es.

Thi

s fie

ld is

ass

ocia

ted

with

the

Rese

arch

dat

a ch

eck

box

on th

e se

lect

ion

tab

in a

ll SE

ER*S

tat s

essio

ns o

ther

than

pre

vale

nce.

For

mor

e in

form

atio

n, se

e:

http

s://

seer

.can

cer.g

ov/d

ata/

hurr

ican

e.ht

ml

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

10CH

SDA

2012

This

data

item

iden

tifie

s whe

ther

or n

ot th

e co

unty

of d

iagn

osis

is se

rved

by

CHSD

A. T

he p

rimar

y us

e of

this

field

is to

be

able

to li

mit

anal

yses

of A

I/AN

ra

ce to

are

as se

rved

by

CHSD

A.

See

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/ra

ce_e

thni

city

O

R ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/c

ount

yatt

ribs/

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

11CH

SDA

Regi

onTh

is da

ta it

em is

a g

roup

ing

of c

ount

ies t

hat i

s prim

arily

use

d w

hen

wor

king

with

CHS

DA 2

006.

See

ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty O

R ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/c

ount

yatt

ribs/

Race

, Sex

, Yea

r Dx,

Reg

istry

, Cou

nty

2

12Si

te re

code

ICD-

O-3

/WHO

200

8

A re

code

bas

ed o

n Pr

imar

y Si

te a

nd H

istol

ogy

in o

rder

to m

ake

anal

yses

of s

ite/h

istol

ogy

grou

ps e

asie

r. F

or e

xam

ple,

the

lym

phom

as a

re e

xclu

ded

from

st

omac

h an

d Ka

posi

and

mes

othe

liom

a ar

e se

para

te c

ateg

orie

s bas

ed o

n hi

stol

ogy.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/site

reco

de/ic

do3_

dwho

hem

e/Si

te a

nd M

orph

olog

y3

13Be

havi

or re

code

for a

naly

sis

This

reco

de w

as c

reat

ed so

that

dat

a an

alys

es c

ould

elim

inat

e m

ajor

gro

ups o

f hist

olog

ies/

beha

vior

s tha

t wer

en't

colle

cted

con

siste

ntly

ove

r tim

e, fo

r ex

ampl

e be

nign

bra

in, m

yelo

dypl

astic

synd

rom

es, a

nd b

orde

rline

tum

ors o

f the

ova

ry.

Crea

ted

from

ICD-

O-3

beh

avio

r and

hist

olog

y. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/b

ehav

reco

deSi

te a

nd M

orph

olog

y3

14AY

A si

te re

code

/WHO

200

8A

site/

hist

olog

y re

code

that

is m

ainl

y us

ed to

ana

lyze

dat

a on

ado

lesc

ent a

nd y

oung

adu

lts.

The

reco

de w

as a

pplie

d to

all

case

s no

mat

ter t

he a

ge in

ord

er

that

age

com

paris

ons c

an b

e m

ade

with

thes

e gr

oupi

ngs.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/aya

reco

de/

Site

and

Mor

phol

ogy

3

15Ly

mph

oma

subt

ype

reco

de/W

HO 2

008

A sit

e/hi

stol

ogy

reco

de th

at is

mai

nly

used

to a

naly

ze d

ata

on ly

mph

oma

sub-

type

s. B

ased

on

ICD-

O-3

. Not

e th

at c

ases

dia

gnos

ed b

efor

e 20

01 w

ere

not

code

d un

der I

CD-O

-3 a

nd w

ere

conv

erte

d to

ICD-

O-3

and

may

not

hav

e th

e sp

ecifi

city

of c

ases

aft

er 2

000

that

wer

e co

ded

dire

ctly

und

er IC

D-O

-3. F

or

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/lym

phom

arec

ode/

Site

and

Mor

phol

ogy

3

16IC

CC si

te re

code

ICD-

O-3

/WHO

200

8

A sit

e/hi

stol

ogy

reco

de th

at is

mai

nly

used

to a

naly

ze d

ata

on c

hild

ren.

The

reco

de w

as a

pplie

d to

all

case

s no

mat

ter t

he a

ge in

ord

er th

at a

ge

com

paris

ons c

an b

e m

ade

with

thes

e gr

oupi

ngs.

Bas

ed o

n IC

D-O

-3. N

ote

that

cas

es d

iagn

osed

bef

ore

2001

wer

e no

t cod

ed u

nder

ICD-

O-3

and

wer

e co

nver

ted

to IC

D-O

-3 a

nd m

ay n

ot h

ave

the

spec

ifici

ty o

f cas

es a

fter

200

0 th

at w

ere

code

d di

rect

ly u

nder

ICD-

O-3

. For

mor

e in

form

atio

n on

this

Inte

rnat

iona

l Cla

ssifi

catio

n of

Chi

ldho

od C

ance

r (IC

CC) s

ite re

code

, see

htt

ps:/

/see

r.can

cer.g

ov/ic

ccSi

te a

nd M

orph

olog

y3

88

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/2

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

17CS

Sch

ema

v020

4+

CS in

form

atio

n is

colle

cted

und

er th

e sp

ecifi

catio

ns o

f a p

artic

ular

sche

ma

base

d on

site

and

hist

olog

y. T

his r

ecod

e sh

ould

use

d in

any

ana

lysis

of A

JCC

7th

ed st

age

and

T, N

, M.

For m

ore

info

rmat

ion

see

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#CS

_SCH

EMA_

v020

4_Si

te a

nd M

orph

olog

y3

18CS

Sch

ema

- AJC

C 6t

h ed

CS in

form

atio

n is

colle

cted

und

er th

e sp

ecifi

catio

ns o

f a p

artic

ular

sche

ma

base

d on

site

and

hist

olog

y. T

his r

ecod

e sh

ould

use

d in

any

ana

lysis

of A

JCC

6th

ed st

age

and

T, N

, M.

Base

d on

CS

vers

ion

1, it

shou

ld n

ot b

e us

ed fo

r SSF

s col

lect

ed o

r mod

ified

und

er C

S v0

2.

http

s://

canc

erst

agin

g.or

g/cs

tage

/sch

ema.

htm

l. Fo

r mor

e in

form

atio

n se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

CS_S

CHEM

A_AJ

CC_6

TH_E

D__P

REVI

OU

SSi

te a

nd M

orph

olog

y3

19Pr

imar

y Si

te -

labe

led

400

This

prov

ides

the

prim

ary

site

code

in IC

D-O

-3 a

nd a

des

crip

tive

prim

ary

site

labe

l. N

ote

that

the

labe

l is t

he p

refe

rred

ICD-

O-3

bol

ded

nam

e an

d th

ere

may

be

oth

er si

tes o

r sub

-site

s inc

lude

d in

the

code

but

not

refle

cted

in th

e pr

efer

red

term

. Ref

er to

ICD-

O-3

for f

urth

er in

form

atio

n. C

ases

with

yea

rs o

f di

agno

sis b

efor

e 19

92 w

ere

conv

erte

d to

ICD-

O-3

from

ear

lier v

ersio

ns.

Site

and

Mor

phol

ogy

3

20Pr

imar

y Si

te40

0Co

des a

re fo

und

in th

e To

pogr

aphy

sect

ion

of th

e In

tern

atio

nal C

lass

ifica

tion

of D

iseas

es fo

r Onc

olog

y (IC

D-O

) 3rd

edi

tion.

ICD

-O-2

cod

es u

sed

for 1

992-

2000

are

sim

ilar.

Prim

ary

site

code

s for

197

3-19

91 w

ere

conv

erte

d to

ICD-

O-3

and

may

lack

the

spec

ifici

ty o

f the

ICD-

O-3

prim

ary

site

code

s.

Site

and

Mor

phol

ogy

3

21Hi

stol

ogic

Typ

e IC

D-O

-352

2Ba

sed

on h

istol

ogy

code

s in

ICD-

O-3

. Ca

ses d

iagn

osed

in 1

973-

2000

wer

e co

ded

in e

arlie

r ver

sions

and

con

vert

ed to

ICD-

O-3

and

cod

ed d

irect

ly fo

r 200

1+.

Site

and

Mor

phol

ogy

3

22Be

havi

or c

ode

ICD-

O-3

Base

d on

beh

avio

r cod

es in

ICD-

O-3

. Beh

avio

r cod

e ha

s had

the

sam

e de

finiti

ons i

n pr

evio

us v

ersio

ns b

ut m

ay b

e as

soci

ated

diff

eren

tly w

ith h

istol

ogie

s ov

er ti

me.

For

exa

mpl

e, b

orde

rline

of t

he o

vary

wer

e co

nsid

ered

mal

igna

nt in

ICD-

O-2

but

ben

ign

in IC

D-O

-3. c

ases

dia

gnos

ed in

197

3-20

00 w

ere

code

d in

ea

rlier

ver

sions

and

con

vert

ed to

ICD-

O-3

and

cod

ed d

irect

ly fo

r 200

1+.

In si

tu b

ladd

er c

ases

hav

e be

en c

onve

rted

to m

alig

nant

in th

is fie

ld.

See

Beha

vior

re

code

for c

onsis

tenc

y ov

er ti

me.

For

mor

e in

form

atio

n se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

BEHA

VIO

R_CO

DE_I

CD_O

_3Si

te a

nd M

orph

olog

y3

23G

rade

440

Base

d on

gra

de c

odes

in IC

D-O

-3.

Case

s dia

gnos

ed in

197

3-20

00 w

ere

code

d in

ear

lier v

ersio

ns a

nd m

ay la

ck th

e sp

ecifi

city

of t

he 2

001+

cas

es th

at w

ere

code

d di

rect

ly, e

spec

ially

for l

ymph

omas

/leuk

emia

s. F

or m

ore

info

rmat

ion

see

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#G

RADE

Site

and

Mor

phol

ogy

3

24La

tera

lity

410

See

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#LA

TERA

LITY

Site

and

Mor

phol

ogy

3

25Di

agno

stic

Con

firm

atio

n49

0Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

DIAG

NO

STIC

_CO

NFI

RMAT

ION

Site

and

Mor

phol

ogy

3

26IC

D-O

-3 H

ist/

beha

vLa

bele

d ve

rsio

n of

ICD-

O-3

val

ues f

or a

ll be

havi

ors.

See

SEE

R*St

at d

ictio

nary

for l

abel

s.Si

te a

nd M

orph

olog

y3

27IC

D-O

-3 H

ist/

beha

v, m

alig

nant

Labe

led

vers

ion

of IC

D-O

-3 v

alue

s for

mal

igna

nt tu

mor

s. A

ll no

n-m

alig

nant

tum

ors a

re g

roup

ed in

to o

ne v

alue

. Se

e SE

ER*S

tat d

ictio

nary

for l

abel

s.Si

te a

nd M

orph

olog

y3

28Hi

stol

ogy

reco

de -

broa

d gr

oupi

ngs

Base

d on

Hist

olog

ic ty

pe IC

D-O

-3.

For m

ore

info

rmat

ion

see

ASCI

I tex

t file

des

crip

tion:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

HIST

OLO

GY_

RECO

DE_B

RAIN

_GRO

UPI

NG

Site

and

Mor

phol

ogy

3

29Hi

stol

ogy

reco

de -

brai

n gr

oupi

ngs

Base

d on

Hist

olog

ic ty

pe IC

D-O

-3.

For m

ore

info

rmat

ion

see

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#HI

STO

LOG

Y_RE

CODE

_BRA

IN_G

ROU

PIN

GSi

te a

nd M

orph

olog

y3

30IC

CC si

te re

c ex

tend

ed IC

D-O

-3/W

HO 2

008

Base

d on

ICD-

O-3

. Fo

r mor

e in

form

atio

n on

this

Inte

rnat

iona

l Cla

ssifi

catio

n of

Chi

ldho

od C

ance

r (IC

CC) s

ite/h

istol

ogy

reco

de, s

ee

http

s://

seer

.can

cer.g

ov/ic

cc.

Whi

le th

e re

code

is n

orm

ally

use

d fo

r chi

ldho

od c

ance

rs, i

t is o

n th

e fil

e fo

r all

ages

so th

at c

hild

hood

can

cers

cou

ld b

e co

mpa

red

acro

ss a

ge g

roup

s. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/ic

cc/ic

cc-w

ho20

08.h

tml

Site

and

Mor

phol

ogy

3

31Si

te re

code

B IC

D-O

-3/W

HO 2

008

A re

code

bas

ed o

n Pr

imar

y Si

te a

nd H

istol

ogy

in o

rder

to m

ake

anal

yses

of s

ite/h

istol

ogy

grou

ps e

asie

r for

mul

tiple

prim

ary

anal

yses

. F

or e

xam

ple,

the

lym

phom

as a

re e

xclu

ded

from

stom

ach

and

Kapo

si an

d m

esot

helio

ma

are

sepa

rate

cat

egor

ies b

ased

on

hist

olog

y. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

itere

code

_b/ic

do3_

who

2008

/Si

te a

nd M

orph

olog

y3

32De

rived

AJC

C St

age

Gro

up, 7

th e

d (2

010+

)34

30ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_AJ

CC_7

_STA

GE_

GRP

Stag

e - A

JCC

4

33De

rived

AJC

C St

age

Gro

up, 6

th e

d (2

004+

)30

00

The

stag

e ca

tego

ry fo

r AJC

C 6t

h ed

ition

is d

eriv

ed fr

om C

olla

bora

tive

Stag

ing

data

ele

men

ts fo

r 200

4+ c

ases

. See

the

CS si

te-s

peci

fic sc

hem

a fo

r det

ails

(htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

) and

the

ASCI

I tex

t file

des

crip

tion

for a

llow

able

val

ues f

or d

ispla

y co

des a

nd st

orag

e va

lues

. Se

eht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_AJ

CC_6

_STA

GE_

GRP

Stag

e - A

JCC

4

34Br

east

- Ad

just

ed A

JCC

6th

Stag

e (1

988+

)Cr

eate

d fr

om m

erge

d EO

D 3r

d Ed

ition

and

Col

labo

rativ

e St

age

dise

ase

info

rmat

ion.

For

mor

e in

form

atio

n se

e:

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/6th

/.St

age

- AJC

C

89

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/3

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

35De

rived

AJC

C - F

lag

(200

4+)

3030

This

flag

curr

ently

refle

cts o

nly

whe

n AJ

CC st

age

is de

rived

bas

ed o

n CS

. If

AJCC

stag

e is

deriv

ed b

ased

on

case

s bef

ore

2004

, it w

ill c

urre

ntly

be

foun

d in

a

sepa

rate

fiel

d an

d no

t ove

rlaye

d in

the

Deriv

ed A

JCC

field

s.

See

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_AJ

CC_F

LAG

Stag

e - A

JCC

4

36AJ

CC st

age

3rd

editi

on (1

988-

2003

)De

rived

by

algo

rithm

from

ext

ent o

f dise

ase

(EO

D).

Not

ava

ilabl

e fo

r all

year

s or f

or a

ll sit

es. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#AJ

CC_S

TAG

E_3r

d_ED

ITIO

N__

1988

_20

Stag

e - A

JCC

4

37SE

ER m

odifi

ed A

JCC

stag

e 3r

d (1

988-

2003

)

Deriv

ed b

y al

gorit

hm fr

om e

xten

t of d

iseas

e (E

OD)

. N

ot a

vaila

ble

for a

ll ye

ars o

r for

all

sites

. Th

e m

odifi

ed v

ersio

n st

ages

cas

es th

at w

ould

be

unst

aged

un

der s

tric

t AJC

C st

agin

g ru

les.

For

exa

mpl

e, it

ass

umes

NX

is N

0. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#SE

ER_M

ODI

FIED

_AJC

C_ST

AGE_

3rd_

EDSt

age

- AJC

C4

38Ly

mph

oma

- Ann

Arb

or S

tage

(198

3+)

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/ann

-arb

or/

Stag

e - A

JCC

4

39De

rived

AJC

C T,

7th

ed

(201

0+)

3400

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/St

age

- TN

M5

40De

rived

AJC

C N

, 7th

ed

(201

0+)

3410

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/St

age

- TN

M5

41De

rived

AJC

C M

, 7th

ed

(201

0+)

3420

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/St

age

- TN

M5

42De

rived

AJC

C T,

6th

ed

(200

4+)

2940

The

T ca

tego

ry fo

r AJC

C 6t

h ed

ition

is d

eriv

ed fr

om C

olla

bora

tive

Stag

ing

data

ele

men

ts fo

r 200

4+ c

ases

. See

the

CS si

te-s

peci

fic sc

hem

a fo

r det

ails

(htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

) and

the

ASCI

I tes

t file

des

crip

tion

for a

llow

able

val

ues f

or d

ispla

y co

des a

nd st

orag

e va

lues

. Se

eht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_AJ

CC_6

_TSt

age

- TN

M5

43De

rived

AJC

C N

, 6th

ed

(200

4+)

2960

The

N c

ateg

ory

for A

JCC

6th

editi

on is

der

ived

from

Col

labo

rativ

e St

agin

g da

ta e

lem

ents

for 2

004+

cas

es. S

ee th

e CS

site

-spe

cific

sche

ma

for d

etai

ls (h

ttps

://s

eer.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge) a

nd th

e AS

CII t

est f

ile d

escr

iptio

n fo

r allo

wab

le v

alue

s for

disp

lay

code

s and

stor

age

valu

es.

See

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

DERI

VED_

AJCC

_6_N

Stag

e - T

NM

5

44De

rived

AJC

C M

, 6th

ed

(200

4+)

2980

The

M c

ateg

ory

for A

JCC

6th

editi

on is

der

ived

from

Col

labo

rativ

e St

agin

g da

ta e

lem

ents

for 2

004+

cas

es. S

ee th

e CS

site

-spe

cific

sche

ma

for d

etai

ls (h

ttps

://s

eer.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge) a

nd th

e AS

CII t

est f

ile d

escr

iptio

n fo

r allo

wab

le v

alue

s for

disp

lay

code

s and

stor

age

valu

es.

See

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

DERI

VED_

AJCC

_6_M

Stag

e - T

NM

5

45T

valu

e - b

ased

on

AJCC

3rd

(198

8-20

03)

Deriv

ed b

y al

gorit

hm fr

om e

xten

t of d

iseas

e (E

OD)

. N

ot a

vaila

ble

for a

ll ye

ars o

r for

all

sites

. See

ASC

II te

xt fi

le d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

T_VA

LUE_

__BA

SED_

ON

_AJC

C_3r

d__1

9St

age

- TN

M5

46N

val

ue -

base

d on

AJC

C 3r

d (1

988-

2003

)De

rived

by

algo

rithm

from

ext

ent o

f dise

ase

(EO

D).

Not

ava

ilabl

e fo

r all

year

s or f

or a

ll sit

es. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#N

_VAL

UE_

__BA

SED_

ON

_AJC

C_3r

d__1

9St

age

- TN

M5

47M

val

ue -

base

d on

AJC

C 3r

d (1

988-

2003

)De

rived

by

algo

rithm

from

ext

ent o

f dise

ase

(EO

D).

Not

ava

ilabl

e fo

r all

year

s or f

or a

ll sit

es. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#M

_VAL

UE_

__BA

SED_

ON

_AJC

C_3r

d__1

9St

age

- TN

M5

48Br

east

- Ad

just

ed A

JCC

6th

T (1

988+

)Cr

eate

d fr

om m

erge

d EO

D 3r

d Ed

ition

and

Col

labo

rativ

e St

age

dise

ase

info

rmat

ion.

For

mor

e in

form

atio

n se

e:

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/6th

/St

age

- TN

M5

49Br

east

- Ad

just

ed A

JCC

6th

N (1

988+

)Cr

eate

d fr

om m

erge

d EO

D 3r

d Ed

ition

and

Col

labo

rativ

e St

age

dise

ase

info

rmat

ion.

For

mor

e in

form

atio

n se

e:

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/6th

/St

age

- TN

M5

50Br

east

- Ad

just

ed A

JCC

6th

M (1

988+

)Cr

eate

d fr

om m

erge

d EO

D 3r

d Ed

ition

and

Col

labo

rativ

e St

age

dise

ase

info

rmat

ion.

For

mor

e in

form

atio

n se

e:

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/6th

/St

age

- TN

M5

51De

rived

SS1

977

(200

4+)

3010

Deriv

ed S

umm

ary

Stag

e 19

77 is

der

ived

from

Col

labo

rativ

e St

agin

g da

ta e

lem

ents

for 2

004+

cas

es.

See

the

CS si

te-s

peci

fic sc

hem

a fo

r det

ails

(htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

) and

the

ASCI

I tes

t file

des

crip

tion

for a

llow

able

val

ues f

or d

ispla

y co

des a

nd st

orag

e va

lues

. Se

eht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_SS

1977

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

52De

rived

SS2

000

(200

4+)

3020

Deriv

ed S

umm

ary

Stag

e 20

00 is

der

ived

from

Col

labo

rativ

e St

agin

g da

ta e

lem

ents

for 2

004+

cas

es.

See

the

CS si

te-s

peci

fic sc

hem

a fo

r det

ails

(htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

) and

the

ASCI

I tes

t file

des

crip

tion

for a

llow

able

val

ues f

or d

ispla

y co

des a

nd st

orag

e va

lues

. Se

e ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#DE

RIVE

D_SS

2000

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

90

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/4

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

53Su

mm

ary

stag

e 20

00 (1

998+

)

Sum

mar

y St

age

2000

is d

eriv

ed fr

om C

olla

bora

tive

Stag

e (C

S) fo

r 200

4+ a

nd E

xten

t of D

iseas

e (E

OD)

from

199

8-20

03.

It is

a sim

plifi

ed v

ersio

n of

stag

e: in

sit

u, lo

caliz

ed, r

egio

nal,

dist

ant,

& u

nkno

wn.

Use

d in

the

SEER

CSR

and

mor

e re

cent

SEE

R pu

blic

atio

ns.

For m

ore

info

rmat

ion

incl

udin

g sit

es a

nd y

ears

for

whi

ch it

isn'

t cal

cula

ted,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/lrd

-sta

ge/

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

54SE

ER h

isto

ric st

age

A

SEER

Hist

oric

Sta

ge A

is D

eriv

ed fr

om C

olla

bora

tive

Stag

e (C

S) fo

r 200

4+ a

nd E

xten

t of D

iseas

e (E

OD)

from

197

3-20

03.

It is

a sim

plifi

ed v

ersio

n of

stag

e: in

sit

u, lo

caliz

ed, r

egio

nal,

dist

ant,

& u

nkno

wn.

For

mor

e in

form

atio

n in

clud

ing

sites

and

yea

rs fo

r whi

ch it

isn'

t cal

cula

ted,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/lrd

-sta

ge/

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

55SE

ER su

mm

ary

stag

e 20

00 (2

001-

2003

)

Sum

mar

y St

age

2000

for 2

001-

2003

is b

ased

on

SEER

Ext

ent o

f Dise

ase

(EO

D) fo

llow

ing

a SE

ER a

lgor

ithm

. Th

is va

riabl

e is

prov

ided

on

the

NAA

CCR

call

for

data

. For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/too

ls/ss

m/

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

56SE

ER su

mm

ary

stag

e 19

77 (1

995-

2000

)

SEER

sum

mar

y st

age

1977

(199

5-20

00) i

s bas

ed o

n SE

ER E

xten

t of D

iseas

e (E

OD)

follo

win

g a

SEER

alg

orith

m.

This

varia

ble

is pr

ovid

ed o

n th

e N

AACC

R ca

ll fo

r dat

a. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/s

sm_1

977.

pdf

Stag

e - L

RD (S

umm

ary

and

Hist

oric

)6

57RX

Sum

m--S

urg

Prim

Site

(199

8+)

1290

NAA

CCR

Nam

e=RX

Sum

m--S

urg

Prim

Site

, Ite

m #

=129

0. T

he in

form

atio

n in

this

field

is si

te-s

peci

fic.

The

amou

nt/d

etai

l of i

nfor

mat

ion

has v

arie

d ov

er ti

me

and

caut

ion

shou

ld b

e us

ed w

hen

look

ing

at tr

ends

ove

r tim

e.

For s

ite-s

peci

fic c

odes

, see

App

endi

x C

of h

ttps

://s

eer.c

ance

r.gov

/too

ls/co

ding

man

uals/

Data

for 1

998-

2002

are

con

vert

ed fr

om S

urge

ry o

f prim

ary

site

(199

8-20

02),

docu

men

ted

here

: htt

ps:/

/see

r.can

cer.g

ov/m

anua

ls/hi

stor

ic/A

ppen

dC.p

df.

See

here

for c

hang

es b

etw

een

the

two

codi

ng sy

stem

s: h

ttps

://s

eer.c

ance

r.gov

/too

ls/SE

ER20

03.c

ode.

chan

ges.

1223

02.p

dfTh

erap

y7

58RX

Sum

m--S

cope

Reg

LN

Sur

(200

3+)

1292

Chan

ged

to U

nkno

wn

or n

ot a

pplic

able

for b

reas

t can

cer c

ases

. Se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

regi

onal

_ln/

for m

ore

info

rmat

ion.

Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

RX_S

UM

M_S

COPE

_REG

_LN

_SU

RTh

erap

y7

59RX

Sum

m--S

urg

Oth

Reg

/Dis

(200

3+)

1294

See

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#RX

_SU

MM

_SU

RG_O

TH_R

EG_D

ISTh

erap

y7

60Re

ason

no

canc

er-d

irect

ed su

rger

y13

40Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

REAS

ON

_FO

R_N

O_S

URG

ERY

Ther

apy

7

61Sc

ope

of re

g ly

mph

nd

surg

(199

8-20

02)

1647

Chan

ged

to U

nkno

wn

or n

ot a

pplic

able

for b

reas

t can

cer c

ases

. Se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

regi

onal

_ln/

for m

ore

info

rmat

ion.

Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

RX_S

UM

M_S

COPE

_REG

_98_

02Th

erap

y7

62RX

Sum

m--R

eg L

N E

xam

ined

(199

8-20

02)

1296

See

ASCI

I tex

t file

des

crip

tion:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

RX_S

UM

M_R

EG_L

N_E

XAM

INED

Ther

apy

7

63Su

rger

y of

oth

reg/

dis s

ites (

1998

-200

2)16

48Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

RX_S

UM

M_S

URG

_OTH

_98_

02Th

erap

y7

64Si

te sp

ecifi

c su

rger

y (1

973-

1997

var

ying

det

ail

by y

ear a

nd si

te)

1640

NAA

CCR

Nam

e=RX

Sum

m--S

urge

ry T

ype,

Item

#=1

640.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

surg

ery.

Ther

apy

7

65CS

tum

or si

ze (2

004+

)28

00In

form

atio

n on

tum

or si

ze.

Avai

labl

e fo

r 200

4+.

Ear

lier c

ases

may

be

conv

erte

d an

d ne

w c

odes

add

ed w

hich

wer

en't

avai

labl

e fo

r use

prio

r to

the

curr

ent

vers

ion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

66CS

ext

ensi

on (2

004+

)28

10

Info

rmat

ion

on e

xten

sion

of th

e tu

mor

. Av

aila

ble

for 2

004+

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

cod

es a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/. N

ote:

this

item

was

orig

inal

ly a

2 d

igit

field

and

was

exp

ande

d to

3 d

igits

dur

ing

conv

ersio

n. G

ener

ally

, a ze

ro w

as a

dded

to th

e rig

ht o

f the

exi

stin

g 2

digi

t fie

ld e

xcep

t for

99

whi

ch b

ecam

e 99

9.Ex

tent

of D

iseas

e - C

S8

67CS

lym

ph n

odes

(200

4+)

2830

Info

rmat

ion

on in

volv

emen

t of l

ymph

nod

es. A

vaila

ble

for 2

004+

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

cod

es a

dded

whi

ch w

eren

't av

aila

ble

for u

se

prio

r to

the

curr

ent v

ersio

n of

CS.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Not

e: th

is ite

m w

as o

rigin

ally

a 2

di

git f

ield

and

was

exp

ande

d to

3 d

igits

dur

ing

conv

ersio

n. G

ener

ally

, a ze

ro w

as a

dded

to th

e rig

ht o

f the

exi

stin

g 2

digi

t fie

ld e

xcep

t for

99

whi

ch b

ecam

e 99

9.Ex

tent

of D

iseas

e - C

S8

68CS

met

s at d

x (2

004+

)28

50 In

form

atio

n on

dist

ant m

etas

tasis

. Ava

ilabl

e fo

r 200

4+.

Earli

er c

ases

may

be

conv

erte

d an

d ne

w c

odes

add

ed w

hich

wer

en't

avai

labl

e fo

r use

prio

r to

the

curr

ent v

ersio

n of

CS.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Ex

tent

of D

iseas

e - C

S8

69ER

Sta

tus R

ecod

e Br

east

Can

cer (

1990

+)Cr

eate

d by

com

bini

ng in

form

atio

n fr

om T

umor

mar

ker 1

(199

0-20

03) (

NAA

CCR

Item

#=1

150)

, with

info

rmat

ion

from

CS

site-

spec

ific

fact

or 1

(200

4+)

(NAA

CCR

Item

#=2

880)

. Th

is fie

ld is

bla

nk fo

r non

-bre

ast c

ases

and

cas

es d

iagn

osed

bef

ore

1990

.Ex

tent

of D

iseas

e - C

S8

70PR

Sta

tus R

ecod

e Br

east

Can

cer (

1990

+)Cr

eate

d by

com

bini

ng in

form

atio

n fr

om T

umor

mar

ker 2

(199

0-20

03) (

NAA

CCR

Item

#=1

150)

, with

info

rmat

ion

from

CS

site-

spec

ific

fact

or 2

(200

4+)

(NAA

CCR

Item

#=2

880)

. T

his f

ield

is b

lank

for n

on-b

reas

t cas

es a

nd c

ases

dia

gnos

ed b

efor

e 19

90.

Exte

nt o

f Dise

ase

- CS

8

91

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/5

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

71De

rived

HER

2 Re

code

(201

0+)

Crea

ted

with

com

bine

d in

form

atio

n fr

om se

vera

l CS

site-

spec

ific

fact

ors.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/da

taba

ses/

ssf/

her2

-de

rived

.htm

l.Ex

tent

of D

iseas

e - C

S8

72Br

east

Sub

type

(201

0+)

Crea

ted

with

com

bine

d in

form

atio

n fr

om E

R St

atus

Rec

ode

Brea

st C

ance

r (19

90+)

, PR

Stat

us R

ecod

e Br

east

Can

cer (

1990

+), a

nd D

eriv

ed H

ER2

Reco

de

(201

0+).

Exte

nt o

f Dise

ase

- CS

8

73CS

site

-spe

cific

fact

or 1

(200

4+ v

aryi

ng b

y sc

hem

a)28

80

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

74CS

site

-spe

cific

fact

or 2

(200

4+ v

aryi

ng b

y sc

hem

a)28

90

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

75CS

site

-spe

cific

fact

or 3

(200

4+ v

aryi

ng b

y sc

hem

a)29

00

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

76CS

site

-spe

cific

fact

or 4

(200

4+ v

aryi

ng b

y sc

hem

a)29

10

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

77CS

site

-spe

cific

fact

or 5

(200

4+ v

aryi

ng b

y sc

hem

a)29

20

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

78CS

site

-spe

cific

fact

or 6

(200

4+ v

aryi

ng b

y sc

hem

a)29

30

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

79CS

site

-spe

cific

fact

or 7

(200

4+ v

aryi

ng b

y sc

hem

a)28

61

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

80CS

site

-spe

cific

fact

or 8

(200

4+ v

aryi

ng b

y sc

hem

a)28

62

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

81CS

site

-spe

cific

fact

or 9

(200

4+ v

aryi

ng b

y sc

hem

a)28

63

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

82CS

site

-spe

cific

fact

or 1

0 (2

004+

var

ying

by

sche

ma)

2864

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

83CS

site

-spe

cific

fact

or 1

1 (2

004+

var

ying

by

sche

ma)

2865

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

84CS

site

-spe

cific

fact

or 1

2 (2

004+

var

ying

by

sche

ma)

2866

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

92

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/6

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

85CS

site

-spe

cific

fact

or 1

3 (2

004+

var

ying

by

sche

ma)

2867

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

86CS

site

-spe

cific

fact

or 1

5 (2

004+

var

ying

by

sche

ma)

2869

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

87CS

site

-spe

cific

fact

or 1

6 (2

004+

var

ying

by

sche

ma)

2870

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

88CS

site

-spe

cific

fact

or 2

5 (2

004+

var

ying

by

sche

ma)

2879

Each

CS

site-

spec

ific

fact

or (S

SF) i

s sch

ema

depe

nden

t. T

hey

can

prov

ide

info

rmat

ion

need

ed to

stag

e th

e ca

se, c

linic

ally

rele

vant

info

rmat

ion,

or

prog

nost

ic in

form

atio

n. A

vaila

ble

for v

aryi

ng y

ears

and

sche

mas

dep

endi

ng o

n st

anda

rd se

tter

requ

irem

ents

. Ea

rlier

cas

es m

ay b

e co

nver

ted

and

new

co

des a

dded

whi

ch w

eren

't av

aila

ble

for u

se p

rior t

o th

e cu

rren

t ver

sion

of C

S. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.

Exte

nt o

f Dise

ase

- CS

8

89CS

Tum

or S

ize/

Ext E

val (

2004

+)28

20Av

aila

ble

for 2

004+

, but

not

requ

ired

for t

he e

ntire

tim

efra

me.

Will

be

blan

k in

cas

es n

ot c

olle

cted

. Fo

r mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Exte

nt o

f Dise

ase

- CS

8

90CS

Reg

Nod

e Ev

al (2

004+

)28

40Av

aila

ble

for 2

004+

, but

not

requ

ired

for t

he e

ntire

tim

efra

me.

Will

be

blan

k in

cas

es n

ot c

olle

cted

. Fo

r mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Exte

nt o

f Dise

ase

- CS

8

91CS

Met

s Eva

l (20

04+)

2860

Avai

labl

e fo

r 200

4+, b

ut n

ot re

quire

d fo

r the

ent

ire ti

mef

ram

e. W

ill b

e bl

ank

in c

ases

not

col

lect

ed.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/.Ex

tent

of D

iseas

e - C

S8

92Re

gion

al n

odes

exa

min

ed (1

988+

)83

0

Curr

ently

cod

ed u

nder

CS

-htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/. C

ases

cod

ed 1

988-

2003

use

d sli

ghtly

diff

eren

t def

initi

ons -

see

SEER

cod

ing

man

ual f

or th

ose

year

s.

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

REG

ION

AL_N

ODE

S_EX

AMIN

EDEx

tent

of D

iseas

e - C

S8

93Re

gion

al n

odes

pos

itive

(198

8+)

820

Curr

ently

cod

ed u

nder

CS

- see

htt

ps:/

/see

r.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/aj

cc-s

tage

/. C

ases

cod

ed 1

988-

2003

use

d sli

ghtly

diff

eren

t def

initi

ons -

see

SEER

cod

ing

man

ual f

or 1

988.

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#RE

GIO

NAL

_NO

DES_

POSI

TIVE

Exte

nt o

f Dise

ase

- CS

8

94Ly

mph

-vas

cula

r Inv

asio

n (2

004+

var

ying

by

sche

ma)

1182

Requ

ired

for c

ases

orig

inal

ly c

oded

und

er C

Sv2

or d

iagn

osed

201

0+ fo

r the

sche

mas

for p

enis

and

test

is on

ly.

On

the

rese

arch

file

LVI

is sh

own

for t

estis

an

d pe

nis b

ecau

se it

is n

eede

d fo

r AJC

C 6t

h &

7th

ed

stag

ing.

Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

LYM

PH_V

ASCU

LAR_

INVA

SIO

NEx

tent

of D

iseas

e - C

S8

95CS

met

s at D

X-bo

ne (2

010+

)28

51Av

aila

ble

for 2

010+

. Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

CS_M

ETS_

AT_D

X_BO

NE

Exte

nt o

f Dise

ase

- CS

8

96CS

met

s at D

X-br

ain

(201

0+)

2852

Avai

labl

e fo

r 201

0+.

See

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#CS

_MET

S_AT

_DX_

BRAI

NEx

tent

of D

iseas

e - C

S8

97CS

met

s at D

X-liv

er (2

010+

)28

53Av

aila

ble

for 2

010+

. Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

CS_M

ETS_

AT_D

X_LI

VER

Exte

nt o

f Dise

ase

- CS

8

98CS

met

s at D

X-lu

ng (2

010+

)28

54Av

aila

ble

for 2

010+

. Se

e AS

CII t

ext f

ile d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

CS_M

ETS_

AT_D

X_LU

NG

Exte

nt o

f Dise

ase

- CS

8

99CS

ver

sion

inpu

t cur

rent

(200

4+)

2937

Data

item

show

s wha

t ver

sion

was

in e

ffect

aft

er in

put f

ield

s hav

e be

en u

pdat

ed fo

r rec

oded

for t

his c

ase.

Thi

s dat

a ite

m a

long

with

CS

vers

ion

inpu

t or

igin

al g

ives

info

on

wha

t doc

umen

t to

use

for t

he C

S co

des.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Ex

tent

of D

iseas

e - C

S8

100

CS v

ersi

on in

put o

rigin

al (2

004+

)29

35Da

ta it

em sh

ows w

hat v

ersio

n w

as in

effe

ct th

e fir

st ti

me

that

CS

was

cod

ed fo

r thi

s cas

e. T

his d

ata

item

alo

ng w

ith C

S ve

rsio

n in

put c

urre

nt g

ives

info

on

wha

t doc

umen

t to

use

for t

he C

S co

des.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

ajcc

-sta

ge/.

Ex

tent

of D

iseas

e - C

S8

101

CS v

ersi

on d

eriv

ed (2

004+

)29

36Da

ta it

em sh

ows w

hat C

S ve

rsio

n w

as u

sed

to d

eriv

e th

e CS

Der

ived

item

fiel

ds in

clud

ing

T, N

, M, A

JCC

stag

e an

d SE

ER S

umm

ary

stag

es 1

977

and

2000

.Ex

tent

of D

iseas

e - C

S8

102

EOD

10 -

Pros

tate

pat

h ex

t (19

95-2

003)

800

Info

rmat

ion

on e

xten

sion

of th

e tu

mor

from

the

prim

ary

base

d on

info

rmat

ion

from

the

pros

tate

ctom

y fo

r pro

stat

e ca

ncer

onl

y. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/EO

D10D

ig.p

ub.p

df.

Not

e: fo

r 200

4+ si

mila

r typ

e of

info

rmat

ion

was

col

lect

ed in

CS

SSF

3 in

the

colla

bora

tive

stag

e va

riabl

es.

Exte

nt o

f Dise

ase

- Hist

oric

9

93

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/7

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

103

EOD

10 -

exte

nt (1

988-

2003

)79

0

Info

rmat

ion

on e

xten

sion

of th

e tu

mor

from

the

prim

ary

and

dist

ant m

etas

tase

s. F

or m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/EO

D10D

ig.p

ub.p

df.

Not

e: fo

r 200

4+ si

mila

r typ

e of

info

rmat

ion

was

col

lect

ed in

CS

exte

nsio

n (2

004+

) and

CS

Met

s at D

X (2

004+

) in

the

colla

bora

tive

stag

e va

riabl

es.

Exte

nt o

f Dise

ase

- Hist

oric

9

104

EOD

10 -

node

s (19

88-2

003)

810

Info

rmat

ion

on ly

mph

nod

e in

volv

emen

t. In

form

atio

n is

site-

spec

ific.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/man

uals/

EOD1

0Dig

.pub

.pdf

. N

ote:

for 2

004+

sim

ilar t

ype

of in

form

atio

n w

as c

olle

cted

in C

S Ly

mph

Nod

es (2

004+

) in

the

colla

bora

tive

stag

e va

riabl

es.

Exte

nt o

f Dise

ase

- Hist

oric

9

105

EOD

10 -

size

(198

8-20

03)

780

Info

rmat

ion

on si

ze o

f tum

or.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/EO

D10D

ig.p

ub.p

df.

Not

e: fo

r 200

4+ si

mila

r typ

e of

info

rmat

ion

was

col

lect

ed in

CS

Tum

or S

ize (2

004+

) in

the

colla

bora

tive

stag

e va

riabl

es.

Exte

nt o

f Dise

ase

- Hist

oric

9

106

Tum

or m

arke

r 1 (1

990-

2003

)11

50

This

data

item

reco

rds p

rogn

ostic

indi

cato

rs fo

r bre

ast c

ases

(ERA

199

0-20

03),

pros

tate

cas

es (P

AP 1

998-

2003

) and

test

is ca

ses (

AFP

1998

-200

3).

Plea

se

see

CS S

SFs f

or si

mila

r inf

orm

atio

n fo

r 200

4+.

For b

reas

t can

cer c

ases

ERA

ove

r tim

e is

avai

labl

e in

ER

Stat

us R

ecod

e Br

east

Can

cer (

1990

+).

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

TUM

OR_

MAR

KER_

1Ex

tent

of D

iseas

e - H

istor

ic9

107

Tum

or m

arke

r 2 (1

990-

2003

)11

60

This

data

item

reco

rds p

rogn

ostic

indi

cato

rs fo

r bre

ast c

ases

(PRA

199

0-20

03) a

nd te

stis

case

s (hC

G 1

998-

2003

). P

leas

e se

e CS

SSF

s for

sim

ilar i

nfor

mat

ion

for 2

004+

. Fo

r bre

ast c

ance

r cas

es P

RA o

ver t

ime

is av

aila

ble

in P

R St

atus

Rec

ode

Brea

st C

ance

r (19

90+)

. ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#TU

MO

R_M

ARKE

R_2

Exte

nt o

f Dise

ase

- Hist

oric

9

108

Tum

or m

arke

r 3 (1

998-

2003

)11

70Th

is da

ta it

em re

cord

s pro

gnos

tic in

dica

tors

for t

estis

cas

es (L

DH 1

998-

2003

). P

leas

e se

e CS

SSF

s for

sim

ilar i

nfor

mat

ion

for 2

004+

. ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#TU

MO

R_M

ARKE

R_3

Exte

nt o

f Dise

ase

- Hist

oric

9

109

Codi

ng sy

stem

-EO

D (1

973-

2003

)87

0Fl

ag to

indi

cate

whi

ch ty

pe o

f EO

D w

as c

oded

: 0 N

on-s

peci

fic (N

); 1

two-

digi

t; 2

Expa

nded

(EEO

D)Ex

tent

of D

iseas

e - H

istor

ic9

110

2-Di

git N

S EO

D pa

rt 1

(197

3-19

82)

850

This

field

is u

sed

in c

onju

nctio

n w

ith 2

-dig

it N

S EO

D pa

rt 2

to d

escr

ibe

a ve

ry ru

dim

enta

ry st

age.

Use

d in

197

3-19

82 fo

r som

e ye

ars a

nd so

me

sites

/hist

olog

ies.

Use

Cod

ing

syst

em-E

OD

with

a 0

to u

se th

ese

defin

ition

s. F

or m

ore

info

rmat

ion

see

the

intr

oduc

tion

of

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

111

2-Di

git N

S EO

D pa

rt 2

(197

3-19

82)

850

See

the

desc

riptio

n fo

r 2-D

igit

NS

EOD

part

1 (1

973-

1982

).Ex

tent

of D

iseas

e - H

istor

ic9

112

2-Di

git S

S EO

D pa

rt 1

(197

3-19

82)

850

This

field

is u

sed

in c

onju

nctio

n w

ith 2

-dig

it SS

EO

D pa

rt 2

to d

escr

ibe

a tw

o-di

git s

ite-s

peci

fic E

OD.

Use

d in

197

3-19

82 fo

r som

e ye

ars a

nd so

me

sites

/hist

olog

ies.

Use

Cod

ing

syst

em-E

OD

with

a 1

to u

se th

ese

defin

ition

s. F

or m

ore

info

rmat

ion

see

the

appr

opria

te si

te-s

peci

fic p

age

of

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

113

2-Di

git S

S EO

D pa

rt 2

(197

3-19

82)

850

See

the

desc

riptio

n fo

r 2-D

igit

SS E

OD

part

1 (1

973-

1982

).Ex

tent

of D

iseas

e - H

istor

ic9

114

Expa

nded

EO

D(1)

- CP

53 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--1

st D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

115

Expa

nded

EO

D(2)

- CP

54 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--2

nd D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

116

Expa

nded

EO

D(1,

2) -

CP53

,54

(197

3-19

82)

840

NAA

CCR

Nam

e=EO

D--O

ld 1

3 Di

git-

-1st

and

2nd

Dig

it, It

em #

=840

. Fo

r mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/man

uals/

hist

oric

/EO

D_19

77.p

df.

Exte

nt o

f Dise

ase

- Hist

oric

9

117

Expa

nded

EO

D(3)

- CP

55 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--3

rd D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

118

Expa

nded

EO

D(4)

- CP

56 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--4

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

119

Expa

nded

EO

D(5)

- CP

57 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--5

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

120

Expa

nded

EO

D(6)

- CP

58 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--6

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

121

Expa

nded

EO

D(7)

- CP

59 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--7

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

122

Expa

nded

EO

D(8)

- CP

60 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--8

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

123

Expa

nded

EO

D(9)

- CP

61 (1

973-

1982

)84

0N

AACC

R N

ame=

EOD-

-Old

13

Digi

t--9

th D

igit,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

124

Expa

nded

EO

D(10

) - C

P62

(197

3-19

82)

840

NAA

CCR

Nam

e=EO

D--O

ld 1

3 Di

git-

-10t

h Di

git,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

125

Expa

nded

EO

D(11

) - C

P63

(197

3-19

82)

840

NAA

CCR

Nam

e=EO

D--O

ld 1

3 Di

git-

-11t

h Di

git,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

94

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/8

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

126

Expa

nded

EO

D(12

) - C

P64

(197

3-19

82)

840

NAA

CCR

Nam

e=EO

D--O

ld 1

3 Di

git-

-12t

h Di

git,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

127

Expa

nded

EO

D(13

) - C

P65

(197

3-19

82)

840

NAA

CCR

Nam

e=EO

D--O

ld 1

3 Di

git-

-13t

h Di

git,

Item

#=8

40.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1977

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

128

EOD

4 - e

xten

t (19

83-1

987)

860

NAA

CCR

Nam

e=EO

D--O

ld 4

Dig

it--3

rd D

igit,

Item

#=8

60.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1984

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

129

EOD

4 - n

odes

(198

3-19

87)

860

NAA

CCR

Nam

e=EO

D--O

ld 4

Dig

it--4

th D

igit,

Item

#=8

60.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1984

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

130

EOD

4 - s

ize

(198

3-19

87)

860

NAA

CCR

Nam

e=EO

D--O

ld 4

Dig

it--1

st a

nd 2

nd D

igit,

Item

#=8

60.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/m

anua

ls/hi

stor

ic/E

OD_

1984

.pdf

.Ex

tent

of D

iseas

e - H

istor

ic9

131

COD

to si

te re

code

Th

e un

derly

ing

caus

e of

dea

th fr

om th

e de

ath

cert

ifica

te w

as g

roup

ed in

to a

reco

de si

mila

r to

the

inci

denc

e sit

e re

code

. Fo

r mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/cod

reco

de/1

969_

d041

6201

2. S

tudy

cut

off d

ate

has b

een

appl

ied,

i.e.

cod

ed a

s aliv

e if

deat

h oc

curr

ed a

fter

stud

y cu

toff.

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

10

132

SEER

cau

se-s

peci

fic d

eath

cla

ssifi

catio

n

Crea

ted

for u

se in

cau

se-s

peci

fic su

rviv

al.

This

varia

ble

desig

nate

s tha

t the

per

son

died

of t

heir

canc

er fo

r cau

se-s

peci

fic su

rviv

al.

For m

ore

info

rmat

ion,

se

e ht

tps:

//se

er.c

ance

r.gov

/cau

sesp

ecifi

c.Ca

use

of D

eath

(CO

D) a

nd F

ollo

w-u

p10

133

SEER

oth

er c

ause

of d

eath

cla

ssifi

catio

n

Crea

ted

for u

se in

left

-tru

ncat

ed li

fe ta

ble

sess

ion.

Th

is va

riabl

e de

signa

tes t

hat t

he p

erso

n di

ed o

f cau

ses o

ther

than

thei

r can

cer.

For

mor

e in

form

atio

n,

see

http

s://

seer

.can

cer.g

ov/c

ause

spec

ific.

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

10

134

Surv

ival

mon

ths

Crea

ted

usin

g co

mpl

ete

date

s, in

clud

ing

days

, the

refo

re m

ay d

iffer

from

surv

ival

tim

e ca

lcul

ated

from

yea

r and

mon

th o

nly.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/sur

viva

ltim

e/.

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

10

135

Surv

ival

mon

ths f

lag

Crea

ted

usin

g co

mpl

ete

date

s, in

clud

ing

days

, the

refo

re m

ay d

iffer

from

surv

ival

tim

e ca

lcul

ated

from

yea

r and

mon

th o

nly.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/sur

viva

ltim

e/.

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

10

136

COD

to si

te re

c KM

This

is a

reco

de b

ased

on

unde

rlyin

g ca

use

of d

eath

to d

esig

nate

cau

se o

f dea

th in

to g

roup

s sim

ilar t

o th

e in

cide

nce

site

reco

de w

ith K

S an

d m

esot

helio

ma.

Fo

r mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/cod

reco

de/1

969_

d041

6201

2/.

Stud

y cu

toff

date

has

bee

n ap

plie

d, i.

e. c

oded

as a

live

if de

ath

occu

rred

af

ter s

tudy

cut

off.

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

10

137

Vita

l sta

tus r

ecod

e (s

tudy

cut

off u

sed)

An

y pa

tient

that

die

s aft

er th

e fo

llow

-up

cut-

off d

ate

is re

code

d to

aliv

e as

of t

he c

ut-o

ff da

te.

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

VITA

L_ST

ATU

S_RE

CODE

Caus

e of

Dea

th (C

OD)

and

Fol

low

-up

1013

8Ty

pe o

f fol

low

-up

expe

cted

2180

NAA

CCR

Nam

e=SE

ER T

ype

of F

ollo

w-U

p, It

em #

=218

0Ca

use

of D

eath

(CO

D) a

nd F

ollo

w-u

p10

139

Sequ

ence

num

ber

380

NAA

CCR

Nam

e=Se

quen

ce N

umbe

r--C

entr

al, I

tem

#=3

80M

ultip

le P

rimar

y Fi

elds

11

140

Firs

t mal

igna

nt p

rimar

y in

dica

tor

Ba

sed

on a

ll th

e tu

mor

s in

SEER

. Tu

mor

s not

repo

rted

to S

EER

are

assu

med

mal

igna

nt.

Mul

tiple

Prim

ary

Fiel

ds11

141

Prim

ary

by in

tern

atio

nal r

ules

Crea

ted

usin

g IA

RC m

ultip

le p

rimar

y ru

les.

Did

not

incl

ude

beni

gn tu

mor

s or n

on-b

ladd

er in

situ

tum

ors i

n al

gorit

hm.

No

tum

or in

form

atio

n w

as m

odifi

ed

on a

ny re

cord

s.

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

PRIM

ARY_

BY_I

NTE

RNAT

ION

AL_R

ULE

SM

ultip

le P

rimar

y Fi

elds

11

142

Reco

rd n

umbe

r21

90N

AACC

R N

ame=

SEER

Rec

ord

Num

ber,

Item

#=2

190.

Seq

uent

ially

num

bers

a p

erso

n's t

umor

s with

in e

ach

SEER

subm

issio

n. O

rder

is b

ased

on

sequ

ence

#.

All f

eder

ally

repo

rtab

le tu

mor

s (se

quen

ce #

< 6

0) a

re p

rior t

o al

l sta

te/r

egist

ry re

port

able

tum

ors (

sequ

ence

# 6

0+) r

egar

dles

s of d

iagn

osis

date

.M

ultip

le P

rimar

y Fi

elds

11

143

Reco

rd n

umbe

r rec

ode

S

eque

ntia

lly n

umbe

rs a

per

son'

s tum

ors w

ithin

eac

h SE

ER su

bmiss

ion.

Ord

er is

bas

ed o

n da

te o

f dia

gnos

is an

d th

en se

quen

ce #

. M

ultip

le P

rimar

y Fi

elds

11

144

Tota

l num

ber o

f in

situ

/mal

igna

nt tu

mor

s for

pa

tient

Ba

sed

on m

axim

um se

quen

ce n

umbe

r of a

ny m

alig

nant

/in si

tu tu

mor

s in

SEER

thro

ugh

the

last

rele

ased

yea

r of d

iagn

osis.

Thi

s val

ue is

the

sam

e ac

ross

all

tum

ors f

or a

per

son.

Mul

tiple

Prim

ary

Fiel

ds11

145

Tota

l num

ber o

f ben

ign/

bord

erlin

e tu

mor

s for

pa

tient

Ba

sed

on m

axim

um se

quen

ce n

umbe

r of a

ny b

enig

n/bo

rder

line

tum

ors i

n SE

ER th

roug

h th

e la

st re

leas

ed y

ear o

f dia

gnos

is. T

his v

alue

is th

e sa

me

acro

ss

all t

umor

s for

a p

erso

n.M

ultip

le P

rimar

y Fi

elds

1114

6Be

havi

or c

ode

ICD-

O-2

Co

nver

ted

from

ICD-

O-3

for 2

001+

and

cod

ed d

irect

ly fo

r 197

3-20

00.

Site

and

Mor

phol

ogy

- Hist

oric

(ICD

-O-1

and

I12

147

Hist

olog

y IC

D-O

-242

0N

AACC

R Ite

m #

=420

Site

and

Mor

phol

ogy

- Hist

oric

(ICD

-O-1

and

I12

148

Reco

de IC

D-O

-2 to

9

Prim

ary

site/

type

reco

ded

into

ICD-

9. A

n un

ders

core

in th

e rig

ht-m

ost p

ositi

on o

f an

unla

bele

d va

lue

repr

esen

ts a

bla

nk.

i.e.,

C00_

-C00

9 is

all v

alue

s st

artin

g w

ith C

00.

All t

umor

s not

orig

inal

ly c

oded

in IC

D-O

-2 w

ere

first

con

vert

ed fr

om IC

D-O

-1 o

r ICD

-O-3

and

then

con

vert

ed to

ICD-

9.Si

te a

nd M

orph

olog

y - H

istor

ic (I

CD-O

-1 a

nd I

12

149

Reco

de IC

D-O

-2 to

10

Pr

imar

y sit

e/ty

pe re

code

d in

to IC

D-10

. An

und

ersc

ore

in th

e rig

ht-m

ost p

ositi

on o

f an

unla

bele

d va

lue

repr

esen

ts a

bla

nk.

i.e.,

C00_

-C00

9 is

all v

alue

s st

artin

g w

ith C

00.

All t

umor

s not

orig

inal

ly c

oded

in IC

D-O

-2 w

ere

first

con

vert

ed fr

om IC

D-O

-1 o

r ICD

-O-3

and

then

con

vert

ed to

ICD-

10.

Site

and

Mor

phol

ogy

- Hist

oric

(ICD

-O-1

and

I12

150

Age

reco

de w

ith si

ngle

age

s and

85+

Pa

tient

's ag

e at

dia

gnos

is w

ith a

ll ag

es o

ver 8

5 gr

oupe

d to

geth

er.

In ra

te a

nd p

reva

lenc

e se

ssio

ns, t

his f

ield

can

be

sele

cted

as t

he p

opul

atio

n ag

e va

riabl

e (o

n th

e da

ta ta

b), i

n w

hich

cas

e it

will

be

in th

e Ag

e at

Dia

gnos

is ca

tego

ry (1

)Ra

ce a

nd A

ge (c

ase

data

onl

y)(o

r Age

at D

iagn

osis)

13(o

r 1)

95

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/9

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

151

Race

reco

de (W

, B, A

I, AP

I)

Caut

ion

shou

ld b

e ex

erci

sed

whe

n us

ing

this

varia

ble.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/.

Race

and

Age

(cas

e da

ta o

nly)

13

152

Orig

in re

code

NHI

A (H

ispa

nic,

Non

-His

p)

Caut

ion

shou

ld b

e ex

erci

sed

whe

n us

ing

this

varia

ble.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/.

Race

and

Age

(cas

e da

ta o

nly)

13

153

Race

and

orig

in re

code

(NHW

, NHB

, NHA

IAN

, N

HAPI

, His

pani

c)

Caut

ion

shou

ld b

e ex

erci

sed

whe

n us

ing

this

varia

ble.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/.

Race

and

Age

(cas

e da

ta o

nly)

13

154

Age

at d

iagn

osis

230

Age

at D

iagn

osis

is th

e pa

tient

's ag

e at

dia

gnos

is of

this

canc

er a

nd is

a th

ree

digi

t fie

ld b

ased

on

singl

e ye

ar o

f age

. Thi

s can

not

be

used

in a

rate

or

prev

alen

ce se

ssio

n to

link

to p

opul

atio

ns. S

ee A

SCII

text

file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#AG

E_AT

_DIA

GN

OSI

S.Ra

ce a

nd A

ge (c

ase

data

onl

y)13

155

Race

/eth

nici

ty

Reco

de w

hich

giv

es p

riorit

y to

non

-whi

te ra

ces f

or p

erso

ns o

f mix

ed ra

ces.

Not

e th

at n

ot a

ll co

des w

ere

in e

ffect

for a

ll ye

ars.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/.

Race

and

Age

(cas

e da

ta o

nly)

13

156

NHI

A De

rived

His

p O

rigin

191

NAA

CCR

Item

#=1

91.

For m

ore

info

rmat

ion,

see

http

s://

seer

.can

cer.g

ov/s

eers

tat/

varia

bles

/see

r/ra

ce_e

thni

city

/.Ra

ce a

nd A

ge (c

ase

data

onl

y)13

157

IHS

Link

192

Inci

denc

e fil

es a

re p

erio

dica

lly li

nked

with

Indi

an H

ealth

Ser

vice

(IHS

) file

s to

iden

tify

Nat

ive

Amer

ican

s. T

he ra

ce re

code

use

s inf

orm

atio

n fr

om th

is fie

ld

and

race

to d

eter

min

e if

a pe

rson

is N

ativ

e Am

eric

an o

r not

. Se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

race

_eth

nici

ty/

Race

and

Age

(cas

e da

ta o

nly)

13

158

Year

of b

irth

240

NAA

CCR

Nam

e=Bi

rth

Date

--Yea

r, Ite

m #

=240

. Th

e SE

ER d

ates

on

this

file

do n

ot h

ave

the

corr

espo

ndin

g da

te fl

ag fi

eld

incl

uded

. Bla

nk m

eans

unk

now

n.Da

tes

15

159

Mon

th o

f dia

gnos

is39

0N

AACC

R N

ame=

Date

of D

iagn

osis-

-Mon

th, I

tem

#=3

90. S

EER

date

s on

this

file

do n

ot h

ave

the

corr

espo

ndin

g da

te fl

ag fi

eld

incl

uded

. Bl

ank

mea

ns

unkn

own.

Date

s15

160

Mon

th o

f dia

gnos

is re

code

Es

timat

es m

onth

of d

iagn

osis,

bas

ed o

n ot

her k

now

n da

tes f

or th

at p

atie

nt, w

hen

actu

al m

onth

of d

iagn

osis

is un

know

n.

Date

s15

161

SS se

q #

- mal

+ins

(mos

t det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

al+i

ns (m

ost d

etai

l). B

ased

on

all t

he

tum

ors i

n SE

ER.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

162

SS se

q #

1975

+ - m

al+i

ns (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mos

t det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1975

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

163

SS se

q #

1992

+ - m

al+i

ns (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mos

t det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1992

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

164

SS se

q #

2000

+ - m

al+i

ns (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mos

t det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

2000

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

165

Site

- m

al+i

ns (m

ost d

etai

l)

Shou

ld b

e us

ed in

con

junc

tion

with

and

onl

y w

ith th

e va

riabl

es S

S se

q #

- mal

+ins

(mos

t det

ail),

SS

seq

# 19

75+

- mal

+ins

(mos

t det

ail),

SS

seq

# 19

92+

- m

al+i

ns (m

ost d

etai

l), o

r SS

seq

# 20

00+

- mal

+ins

(mos

t det

ail).

Gro

upin

gs sh

ould

not

be

crea

ted.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

166

SS se

q #

- mal

(mos

t det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

alig

nant

(mos

t det

ail).

Bas

ed o

n al

l the

tu

mor

s in

SEER

.Si

te S

peci

fic S

eque

nce

Num

bers

18

167

SS se

q #

1975

+ - m

al (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

ost d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

75+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

168

SS se

q #

1992

+ - m

al (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

ost d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

92+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

169

SS se

q #

2000

+ - m

al (m

ost d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

ost d

etai

l). B

ased

on

tum

ors

diag

nose

d 20

00+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

170

Site

- m

alig

nant

(mos

t det

ail)

Sh

ould

be

used

in c

onju

nctio

n w

ith a

nd o

nly

with

the

varia

bles

SS

seq

# - m

al (m

ost d

etai

l), S

S se

q #

1975

+ - m

al (m

ost d

etai

l), S

S se

q #

1992

+ - m

al (m

ost

deta

il), o

r SS

seq

# 20

00+

- mal

(mos

t det

ail).

Gro

upin

gs sh

ould

not

be

crea

ted.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

171

SS se

q #

- mal

+ins

(mid

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

al+i

ns (m

id d

etai

l). B

ased

on

all t

he

tum

ors i

n SE

ER.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

96

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/10

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

172

SS se

q #

1975

+ - m

al+i

ns (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mid

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1975

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

173

SS se

q #

1992

+ - m

al+i

ns (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mid

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1992

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

174

SS se

q #

2000

+ - m

al+i

ns (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(mid

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

2000

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

175

Site

- m

al+i

ns (m

id d

etai

l)

Shou

ld b

e us

ed in

con

junc

tion

with

and

onl

y w

ith th

e va

riabl

es S

S se

q #

- mal

+ins

(mid

det

ail),

SS

seq

# 19

75+

- mal

+ins

(mid

det

ail),

SS

seq

# 19

92+

- m

al+i

ns (m

id d

etai

l), o

r SS

seq

# 20

00+

- mal

+ins

(mid

det

ail).

Gro

upin

gs sh

ould

not

be

crea

ted.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

176

SS se

q #

- mal

(mid

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

alig

nant

(mid

det

ail).

Bas

ed o

n al

l the

tu

mor

s in

SEER

.Si

te S

peci

fic S

eque

nce

Num

bers

18

177

SS se

q #

1975

+ - m

al (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

id d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

75+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

178

SS se

q #

1992

+ - m

al (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

id d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

92+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

179

SS se

q #

2000

+ - m

al (m

id d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (m

id d

etai

l). B

ased

on

tum

ors

diag

nose

d 20

00+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

180

Site

- m

alig

nant

(mid

det

ail)

Sh

ould

be

used

in c

onju

nctio

n w

ith a

nd o

nly

with

the

varia

bles

SS

seq

# - m

al (m

id d

etai

l), S

S se

q #

1975

+ - m

al (m

id d

etai

l), S

S se

q #

1992

+ - m

al (m

id

deta

il), o

r SS

seq

# 20

00+

- mal

(mid

det

ail).

Gro

upin

gs sh

ould

not

be

crea

ted.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

181

SS se

q #

- mal

+ins

(lea

st d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

+ins

(lea

st d

etai

l). B

ased

on

all t

he

tum

ors i

n SE

ER.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

182

SS se

q #

1975

+ - m

al+i

ns (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

al+i

ns (l

east

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1975

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

183

SS se

q #

1992

+ - m

al+i

ns (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

al+i

ns (l

east

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

1992

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

184

SS se

q #

2000

+ - m

al+i

ns (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

al+i

ns (l

east

det

ail).

Bas

ed o

n tu

mor

s di

agno

sed

2000

+.Si

te S

peci

fic S

eque

nce

Num

bers

18

185

Site

- m

al+i

ns (l

east

det

ail)

Sh

ould

be

used

in c

onju

nctio

n w

ith a

nd o

nly

with

the

varia

bles

SS

seq

# - m

al+i

ns (l

east

det

ail),

SS

seq

# 19

75+

- mal

+ins

(lea

st d

etai

l), S

S se

q #

1992

+ -

mal

+ins

(lea

st d

etai

l), o

r SS

seq

# 20

00+

- mal

+ins

(lea

st d

etai

l). G

roup

ings

shou

ld n

ot b

e cr

eate

d.Si

te S

peci

fic S

eque

nce

Num

bers

18

186

SS se

q #

- mal

(lea

st d

etai

l)

Site

spec

ific

sequ

ence

num

ber o

f the

tum

or a

ssoc

iate

d w

ith th

e sit

e cl

assif

icat

ion

sche

me

in th

e va

riabl

e Si

te -

mal

igna

nt (l

east

det

ail).

Bas

ed o

n al

l the

tu

mor

s in

SEER

.Si

te S

peci

fic S

eque

nce

Num

bers

18

187

SS se

q #

1975

+ - m

al (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

alig

nant

(lea

st d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

75+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

188

SS se

q #

1992

+ - m

al (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

alig

nant

(lea

st d

etai

l). B

ased

on

tum

ors

diag

nose

d 19

92+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

189

SS se

q #

2000

+ - m

al (l

east

det

ail)

Si

te sp

ecifi

c se

quen

ce n

umbe

r of t

he tu

mor

ass

ocia

ted

with

the

site

clas

sific

atio

n sc

hem

e in

the

varia

ble

Site

- m

alig

nant

(lea

st d

etai

l). B

ased

on

tum

ors

diag

nose

d 20

00+.

Site

Spe

cific

Seq

uenc

e N

umbe

rs18

190

Site

- m

alig

nant

(lea

st d

etai

l)

Shou

ld b

e us

ed in

con

junc

tion

with

and

onl

y w

ith th

e va

riabl

es S

S se

q #

- mal

(lea

st d

etai

l), S

S se

q #

1975

+ - m

al (l

east

det

ail),

SS

seq

# 19

92+

- mal

(lea

st

deta

il), o

r SS

seq

# 20

00+

- mal

(lea

st d

etai

l). G

roup

ings

shou

ld n

ot b

e cr

eate

d.Si

te S

peci

fic S

eque

nce

Num

bers

18

191

Patie

nt ID

20

This

field

use

d in

con

junc

tion

with

SEE

R re

gist

ry to

uni

quel

y id

entif

y a

pers

on.

One

per

son

can

have

mul

tiple

prim

arie

s but

has

the

sam

e Pa

tient

ID.

See

the

sequ

ence

num

ber f

or m

ore

info

rmat

ion

abou

t the

prim

ary.

Thi

s is a

dum

my

num

ber a

nd is

not

the

num

ber u

sed

by th

e re

gist

ry to

iden

tify

the

patie

nt.

The

sam

e nu

mbe

r is n

ot u

sed

acro

ss a

ll su

bmiss

ions

for e

ach

patie

nt.

Oth

er19

97

Dict

iona

ry o

f SEE

R*St

at V

aria

bles

N

ovem

ber 2

017

Sub

mis

sion

(rel

ease

d Ap

ril 2

018)

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/11

of 1

1

Fiel

d nu

mbe

rN

ame

NAA

CCR

Item

#D

escr

iptio

nCa

tego

ry n

ame

Cate

gory

nu

mbe

r

192

Type

of R

epor

ting

Sour

ce50

0

Cont

ains

info

rmat

ion

on w

here

the

info

rmat

ion

from

this

case

cam

e. I

s use

d in

surv

ival

ana

lysis

to e

limin

ate

case

s whi

ch a

re fo

und

at a

utop

sy o

r the

onl

y in

form

atio

n is

from

a d

eath

cer

tific

ate.

See

ASC

II te

xt fi

le d

escr

iptio

n:

http

s://

seer

.can

cer.g

ov/d

ata-

soft

war

e/do

cum

enta

tion/

seer

stat

/nov

2017

/Tex

tDat

a.Fi

leDe

scrip

tion.

pdf#

TYPE

_OF_

REPO

RTIN

G_S

OU

RCE

Oth

er19

193

Insu

ranc

e Re

code

(200

7+)

Crea

ted

from

NAA

CCR

Fiel

d Pr

imar

y Pa

yer a

t DX,

Item

#=6

30.

Caut

ion

is ur

ged

whe

n us

ing

this

varia

ble.

Cau

tion

shou

ld b

e ex

erci

sed

whe

n us

ing

this

varia

ble.

For

mor

e in

form

atio

n, se

e ht

tps:

//se

er.c

ance

r.gov

/see

rsta

t/va

riabl

es/s

eer/

insu

ranc

e-re

code

/.

Oth

er19

194

Mar

ital s

tatu

s at d

iagn

osis

150

See

ASCI

I tex

t file

des

crip

tion:

ht

tps:

//se

er.c

ance

r.gov

/dat

a-so

ftw

are/

docu

men

tatio

n/se

erst

at/n

ov20

17/T

extD

ata.

File

Desc

riptio

n.pd

f#M

ARIT

AL_S

TATU

S_AT

_DX

Oth

er19

98

Appendix B: An Example of Record in SEER Dataset

Here shows an example of one record of tumor (patient ID: 4552105) from the

specific database “Incidence - SEER 18 Regs Research Data + Hurricane Katrina

Impacted Louisiana Cases, Nov 2017 Sub (1973-2015 varying)”:

Feature Name ValueAge recode with <1 year olds 65-69 yearsRace recode (White, Black, Other) Other (American Indian/AK Native, Asian/Pacific Islander)Sex FemaleYear of diagnosis 2013SEER registry San Francisco-Oakland SMSA - 1973+Louisiana 2005 - 1st vs 2nd half of year Not applicable (1973-2004, 2005 not Louisiana, or 2006+)County 1State-county CA: Alameda County (06001)In research data YesCHSDA 2012 Not CHSDACHSDA Region Pacific CoastState CaliforniaSite recode ICD-O-3/WHO 2008 Lung and BronchusBehavior recode for analysis MalignantAYA site recode/WHO 2008 8.3 Carcinoma of trachea,bronchus, and lungLymphoma subtype recode/WHO 2008 UnclassifiedICCC site recode ICD-O-3/WHO 2008 XI(f) Other and unspecified carcinomasCS Schema v0204+ LungCS Schema - AJCC 6th Edition LungPrimary Site - labeled C34.1-Upper lobe, lungPrimary Site 341Histologic Type ICD-O-3 8140Behavior code ICD-O-3 MalignantGrade Moderately differentiated; Grade IILaterality Left - origin of primaryDiagnostic Confirmation Positive histologyICD-O-3 Hist/behav 8140/3: Adenocarcinoma, NOSICD-O-3 Hist/behav, malignant 8140/3: Adenocarcinoma, NOSHistology recode - broad groupings 8140-8389: adenomas and adenocarcinomasHistology recode - Brain groupings Not BrainICCC site rec extended ICD-O-3/WHO 2008 XI(f.4) Carcinomas of lungSite recode B ICD-O-3/WHO 2008 Lung and BronchusDerived AJCC Stage Group, 7th ed (2010+) IADerived AJCC Stage Group, 6th ed (2004+) IABreast - Adjusted AJCC 6th Stage (1988+) Blank(s)Derived AJCC - Flag (2004+) AJCC 6th ed derived from CS manual/coding instructions, v1.0AJCC stage 3rd edition (1988-2003) Blank(s)SEER modified AJCC stage 3rd (1988-2003) Blank(s)Lymphoma - Ann Arbor Stage (1983+) N/ADerived AJCC T, 7th ed (2010+) T1aDerived AJCC N, 7th ed (2010+) N0Derived AJCC M, 7th ed (2010+) M0Derived AJCC T, 6th ed (2004+) T1Derived AJCC N, 6th ed (2004+) N0Derived AJCC M, 6th ed (2004+) M0T value - based on AJCC 3rd (1988-2003) Blank(s)N value - based on AJCC 3rd (1988-2003) Blank(s)M value - based on AJCC 3rd (1988-2003) Blank(s)Breast - Adjusted AJCC 6th T (1988+) Blank(s)Breast - Adjusted AJCC 6th N (1988+) Blank(s)Breast - Adjusted AJCC 6th M (1988+) Blank(s)Derived SS1977 (2004+) LDerived SS2000 (2004+) LSummary stage 2000 (1998+) LocalizedSEER historic stage A LocalizedSEER summary stage 2000 (2001-2003) Blank(s)SEER summary stage 1977 (1995-2000) Blank(s)RX Summ–Surg Prim Site (1998+) 33RX Summ–Scope Reg LN Sur (2003+) 4 or more regional lymph nodes removedRX Summ–Surg Oth Reg/Dis (2003+) None; diagnosed at autopsyReason no cancer-directed surgery Surgery performedScope of reg lymph nd surg (1998-2002) Blank(s)RX Summ–Reg LN Examined (1998-2002) Blank(s)Surgery of oth reg/dis sites (1998-2002) Blank(s)CS tumor size (2004+) 20CS extension (2004+) 100CS lymph nodes (2004+) 0CS mets at dx (2004+) 0

99

Feature Name ValueER Status Recode Breast Cancer (1990+) Not 1990+ BreastPR Status Recode Breast Cancer (1990+) Not 1990+ BreastDerived HER2 Recode (2010+) Not 2010+ BreastBreast Subtype (2010+) Not 2010+ BreastCS site-specific factor 1 (2004+ varying by schema) 0CS site-specific factor 2 (2004+ varying by schema) 0CS site-specific factor 3 (2004+ varying by schema) Blank(s)CS site-specific factor 4 (2004+ varying by schema) Blank(s)CS site-specific factor 5 (2004+ varying by schema) Blank(s)CS site-specific factor 6 (2004+ varying by schema) Blank(s)CS site-specific factor 7 (2004+ varying by schema) Blank(s)CS site-specific factor 8 (2004+ varying by schema) Blank(s)CS site-specific factor 9 (2004+ varying by schema) Blank(s)CS site-specific factor 10 (2004+ varying by schema) Blank(s)CS site-specific factor 11 (2004+ varying by schema) Blank(s)CS site-specific factor 12 (2004+ varying by schema) Blank(s)CS site-specific factor 13 (2004+ varying by schema) Blank(s)CS site-specific factor 15 (2004+ varying by schema) Blank(s)CS site-specific factor 16 (2004+ varying by schema) Blank(s)CS site-specific factor 25 (2004+ varying by schema) 988Regional nodes examined (1988+) 6Regional nodes positive (1988+) 0Lymph-vascular Invasion (2004+ varying by schema) Blank(s)CS mets at DX-bone (2010+) NoCS mets at DX-brain (2010+) NoCS mets at DX-liver (2010+) NoCS mets at DX-lung (2010+) NoCS version input current (2004+) 20540CS version input original (2004+) 20440CS version derived (2004+) 20550EOD 10 - Prostate path ext (1995-2003) Blank(s)EOD 10 - extent (1988-2003) Blank(s)EOD 10 - nodes (1988-2003) Blank(s)EOD 10 - size (1988-2003) Blank(s)Tumor marker 1 (1990-2003) Blank(s)Tumor marker 2 (1990-2003) Blank(s)Tumor marker 3 (1998-2003) Blank(s)Coding system-EOD (1973-2003) Blank(s)2-Digit NS EOD part 1 (1973-1982) Blank(s)2-Digit NS EOD part 2 (1973-1982) Blank(s)2-Digit SS EOD part 1 (1973-1982) Blank(s)2-Digit SS EOD part 2 (1973-1982) Blank(s)Expanded EOD(1) - CP53 (1973-1982) Blank(s)Expanded EOD(2) - CP54 (1973-1982) Blank(s)Expanded EOD(1,2) - CP53,54 (1973-1982) Blank(s)Expanded EOD(3) - CP55 (1973-1982) Blank(s)Expanded EOD(4) - CP56 (1973-1982) Blank(s)Expanded EOD(5) - CP57 (1973-1982) Blank(s)Expanded EOD(6) - CP58 (1973-1982) Blank(s)Expanded EOD(7) - CP59 (1973-1982) Blank(s)Expanded EOD(8) - CP60 (1973-1982) Blank(s)Expanded EOD(9) - CP61 (1973-1982) Blank(s)Expanded EOD(10) - CP62 (1973-1982) Blank(s)Expanded EOD(11) - CP63 (1973-1982) Blank(s)Expanded EOD(12) - CP64 (1973-1982) Blank(s)Expanded EOD(13) - CP65 (1973-1982) Blank(s)EOD 4 - extent (1983-1987) Blank(s)EOD 4 - nodes (1983-1987) Blank(s)EOD 4 - size (1983-1987) Blank(s)COD to site recode Lung and BronchusSEER cause-specific death classification Dead (attributable to this cancer dx)SEER other cause of death classification Alive or dead due to cancerSurvival months 13Survival months flag Complete dates are available and there are more than 0 days of survivalCOD to site rec KM Lung and BronchusVital status recode (study cutoff used) DeadType of follow-up expected Active follow-upSequence number One primary onlyFirst malignant primary indicator YesPrimary by international rules YesRecord number 1Record number recode 2Total number of in situ/malignant tumors for patient 1Total number of benign/borderline tumors for patient 1Behavior code ICD-O-2 MalignantHistology ICD-O-2 8140Recode ICD-O-2 to 9 1623Recode ICD-O-2 to 10 C341Age recode with single ages and 85+ 65 yearsRace recode (W, B, AI, API) Asian or Pacific IslanderOrigin recode NHIA (Hispanic, Non-Hisp) Non-Spanish-Hispanic-LatinoRace and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) Non-Hispanic Asian or Pacific IslanderAge at diagnosis 65Race/ethnicity ChineseNHIA Derived Hisp Origin Non-Spanish-Hispanic-LatinoIHS Link Record sent for linkage, no IHS match

100

Feature Name ValueYear of birth 1947Month of diagnosis AugustMonth of diagnosis recode AugustSS seq # - mal+ins (most detail) 1SS seq # 1975+ - mal+ins (most detail) 1SS seq # 1992+ - mal+ins (most detail) 1SS seq # 2000+ - mal+ins (most detail) 1Site - mal+ins (most detail) Lung and Bronchus - mal+insSS seq # - mal (most detail) 1SS seq # 1975+ - mal (most detail) 1SS seq # 1992+ - mal (most detail) 1SS seq # 2000+ - mal (most detail) 1Site - malignant (most detail) Lung and Bronchus - malSS seq # - mal+ins (mid detail) 1SS seq # 1975+ - mal+ins (mid detail) 1SS seq # 1992+ - mal+ins (mid detail) 1SS seq # 2000+ - mal+ins (mid detail) 1Site - mal+ins (mid detail) Respiratory System - mal+insSS seq # - mal (mid detail) 1SS seq # 1975+ - mal (mid detail) 1SS seq # 1992+ - mal (mid detail) 1SS seq # 2000+ - mal (mid detail) 1Site - malignant (mid detail) Respiratory System - malSS seq # - mal+ins (least detail) 1SS seq # 1975+ - mal+ins (least detail) 1SS seq # 1992+ - mal+ins (least detail) 1SS seq # 2000+ - mal+ins (least detail) 1Site - mal+ins (least detail) Respiratory System - mal+insSS seq # - mal (least detail) 1SS seq # 1975+ - mal (least detail) 1SS seq # 1992+ - mal (least detail) 1SS seq # 2000+ - mal (least detail) 1Site - malignant (least detail) Respiratory System - malPatient ID 4552105Type of Reporting Source Hospital inpatient/outpatient or clinicInsurance Recode (2007+) InsuredMarital status at diagnosis Married (including common law)

101

Curriculum Vitae : Haoze Du

Personal Details

Gender: Male

Date of birth: May 20th, 1995

Place of birth: Xinxiang, China

Email: [email protected]

Research Interests

I am interested in machine learning and its application, and digital image process-

ing and recognition.

Education

09/2013–06/2017

Nanjing University of Aeronautics and Astronautics, ChinaBachelor of Engineering in Computer ScienceThesis: Research on Migrating xv6 OS to MIPS platform

Since 8/2017Wake Forest UniversityGraduate student in the Department of Computer Science

Working Experience

07/2016–01/2017Software Engineering InternPacteria, Wuxi

Since 8/2017Teaching AssistantWake Forest University

102

Scholarships

• Award of the Third Prize Outstanding Student Scholarship, Nanjing University

of Aeronautics and Astronautics (No.1321361) Nov.25, 2014.

• Award of the Third Prize Outstanding Student Scholarship, Nanjing University

of Aeronautics and Astronautics (No.1336388) Nov.20, 2015.

Papers

• Xianfang Wang, Haoze Du, Shuai Zhang. Dynamic multi-objective coopera-

tive optimization of biochemical process based on kinetic model and MOPSO.

Metallurgical and Mining Industry, 2015,7(6), pp:392-399.

• Xianfang Wang, Haoze Du, Jinglu Tan. Online Fault Diagnosis for Biochemi-

cal Process Based on FCM and SVM. Interdiscip Sci Comput Life Sci. Published

online 29 April 2016.

• WANG Xian-fang, WANG Sui-hua, DU Hao-ze, WANG Ping. Fault diagnosis

of chemical industry process based on FRS and SVM. Control and Decision,

2015,30(2), pp:353-356. (In Chinese)

Activities

• Member of UPE.

• Member of ACM and China Computer Federation.

103

• Award of Honorable Mention in the Mid-Atlantic Regional of ACM-ICPC. Nov,

2017

• Award of Honorable Mention in the 18th Annual MCM/ICM Competition.

USA. Apr 25, 2016.

• Award of the Second Prize of College Group A for C/C++C Program Design in

the 6th Annual Blue Bridge Cup National Software and Information Technology

Professional Talent Contest Jiangsu Division (No.010601451) Apr 17, 2015.

104