COMP 204 - Assignment #5yueli/teaching/COMP204_Winter...The breast tissue contains many cells but...

COMP 204 - Assignment #5

Due date: Friday, April 12, 23:59

What to submit: This Jupyter notebook,with code completed for each question

Background: Predicting IDC in Breast Cancer Histology Images

Breast cancer is the most common form of cancer in women, and invasive ductal carcinoma (IDC) is the mostcommon form of breast cancer. Accurately identifying and categorizing breast cancer subtypes is an importantclinical task, and automated methods can be used to save time and reduce error.

In this assignment, we will use machine learning approach to identify IDC when it is present in otherwiseunlabeled histopathology images. The dataset consists of approximately five thousand 50x50 pixel RGB digitalimages of color stained breast histopathology samples that are labeled as either IDC or non-IDC.

For easy importing, the digital images are extracted from breast tissue samples and saved as numpy arrays.Each image patch is a 50x50x3 image with 3 color channels representing red, blue and green (RBG). Therefore,The breast tissue contains many cells but only some of them are cancerous. Patches that are labeled "1"contain cells that are characteristic of invasive ductal carcinoma.

For more information about the data, see https://www.ncbi.nlm.nih.gov/pubmed/27563488(https://www.ncbi.nlm.nih.gov/pubmed/27563488) andhttp://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872(http://spie.org/Publications/Proceedings/Paper/10.1117/12.2043872).

For more information about IDC and breast cancer, please review the following publications:

https://www.ncbi.nlm.nih.gov/pubmed/27864452 (https://www.ncbi.nlm.nih.gov/pubmed/27864452)https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3893344/(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3893344/)https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4952020/(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4952020/)https://arxiv.org/abs/1802.00752 (https://arxiv.org/abs/1802.00752) (Deep learning approach)

First, let's import all of the required modules

In [1]: import numpy as np from matplotlib import pyplot import matplotlib.pylab as plt from scipy.misc import imresize, imread import sklearn from sklearn import model_selection from sklearn.model_selection import train_test_split, KFold, cross_val_score, StratifiedKFold, learning_curve, GridSearchCV from sklearn.linear_model import LogisticRegression from sklearn.metrics import roc_curve from sklearn.metrics import roc_auc_score from sklearn.metrics import confusion_matrix import random

For the purpose of this assignment, we will be working with only 500 out of the 5000 images and labels data.The following code snippet (commented out) subsamples randomly 500 samples and saves it in the datadirectory.

DO NOT RUN THIS block but rather using the file A5_X.npy and A5_Y.npy that are provided to you in thisassignment.

In [2]: # X = np.load('X.npy') # images # Y = np.load('Y.npy') # labels associated to images (0 = no IDC, 1 = IDC) # random.seed(9001) # idx = np.array(random.sample(range(Y.shape[0]), 500)) # X = X[idx] / 256.0 # normalize the data within range 0 and 1 # Y = Y[idx] # np.save('A5_X.npy',X) # np.save('A5_Y.npy',Y)

Load the subsampled image and label data

In [3]: X = np.load('A5_X.npy') # images Y = np.load('A5_Y.npy') # labels associated to images (0 = no IDC, 1 = IDC)

We can print some simple statistics of the data as follows

In [4]: print('Total number of images: {}'.format(len(X))) print('Number of IDC(-) Images: {}'.format(np.sum(Y==0))) print('Number of IDC(+) Images: {}'.format(np.sum(Y==1))) print('Percentage of IDC(+) images: {:.2f}%'.format(100*np.mean(Y))) print('Image shape (Width, Height, Channels): {}'.format(X[0].shape))

We can visiualize the image data as follows. Notice the distinct patterns between IDC(-) and IDC(+) samples?

In [5]: def plotImg(X0,X1): for row in range(3): plt.figure(figsize=(20, 10)) for col in range(3): # plot 3 IDC(-) and 3 IDC(+) in one row plt.subplot(1,8,col+1) plt.title('IDC (-)') plt.imshow(X0[row+col]) plt.axis('off') plt.subplot(1,8,col+4) plt.title('IDC (+)') plt.imshow(X1[row+col]) plt.axis('off') plotImg(X[Y==0], X[Y==1])

Total number of images: 500 Number of IDC(-) Images: 235 Number of IDC(+) Images: 265 Percentage of IDC(+) images: 53.00% Image shape (Width, Height, Channels): (50, 50, 3)

The machine learning methods we are using are not capable of operate on two-dimensional image with 3-colorchannels.

To make them amenable to our data, we will need to flatten the image to represent them as 1D features

DO NOT MODIFY THIS BLOCK OF CODE OTHERWISE YOUR RESULTS WILL NOT BE CORRECT

In [ ]: X_shape = X.shape[1]*X.shape[2]*X.shape[3] X_flat = X.reshape(X.shape[0], X_shape) X_flat.shape

This gives us a 500 x 7500 array with 500 images of 7500 features (i.e., )

Question 1 (10 points) Create training and testing

First, split the data into 80% training and 20% testing set

In [ ]: # YOUR CODE HERE print("Training Data Shape:", X_train.shape, Y_train.shape) print("Testing Data Shape:", X_test.shape, Y_test.shape)

Question 2 (20 points) Train and evaluate logistic regression classifier

Next, write a function evaluateAccuracy_LR to evaluate the accuracy of logistic regression (LR) classifier.

You will use the training set (i.e., 80% of dataset) to train the LR classifer and use the testing set (i.e., 20% of thedataset) to evaluate your trained LR model.

In [ ]: #runLogisticRegression def evaluateAccuracy_LR(train_x, train_y, test_x, test_y): # YOUR CODE HERE return acc_train,tpr_train,fpr_train,acc_test,tpr_test,fpr_test

In [ ]: acc_train,tpr_train,fpr_train,acc_test,tpr_test,fpr_test = evaluateAccuracy_LR(X_train, Y_train, X_test, Y_test) print(f"Train: accuracy: {acc_train:.2}, TPR: {tpr_train:.2}, FPR: {fpr_train:.2}") print(f"Testing: accuracy: {acc_test:.2}, TPR: {tpr_test:.2}, FPR: {fpr_test:.2}")

As we can see above, we obtained perfect training performance and much worse testing performance. Thisimplies a severe overfitting of the data.

Dimensional reduction

We have 7500 features for only 500 training examples. Even if we use the full 5000 examples, the number ofexamples is still lower than the number of features. This is commonly known as the "curse of dimensionality",which in some sense implies that models like logistic regression (without sparse regularization) will be doomedto overfit the data.

To address this issue, we will will an a priori feature selection technique called Principal Component Analysis(PCA) that reduces the dimensionality of the problem. PCA is an unsupervized learning technique. The basicidea behind PCA is to rotate the co-ordinate axes of the feature space. We first find the direction in which thedata varies the most. We set up one co-ordinate axes along this direction, which is called the first principalcomponent. We then look for a perpendicular direction in which the data varies the second most. This is thesecond principal component. The diagram illustrates this process. There are as many principal components asthe feature dimension: all we have done is a rotation.

We'll reduce our dimensionality from 7500 to 10.

(diagram taken from http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues (http://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) which also has nice discussions)

In [ ]: from sklearn.decomposition import PCA pca = PCA(n_components=10) pca.fit(X_flat) X_pca = pca.transform(X_flat) X_pca.shape

We can examine how much total variance the 10 PCs explain:

In [ ]: pca.explained_variance_ratio_.sum()

In total, 10 PCs can explain 60% of the total variance, which is sufficient for our purpose.

Also, we can examine how much variance each PC explains:

In [ ]: plt.bar(range(10), pca.explained_variance_ratio_*100) plt.xlabel("Principal Components (PCs)") plt.ylabel("Variance explained") plt.show()

From the above bar plot, we see that the first three PC explain a substantially more variance than the rest of the7 PCs.

Question 3 (10 points) apply PCA to training and testing sets

Re-train the PCA ONLY on the training data set and then apply the trained PCA to transform both the trainingand testing data sets.

In [ ]: # YOUR CODE HERE

Plot the first three PCs in a three-dimensional scatter plot with PC1 in x-axis, PC2 in y-axis, and PC3 in z-axis.Visualize the data points by coloring the IDC(+) in red and IDC(-) in blue. Using circle 'o' to represent trainingdata points and 'x' to represent testing data points.

Learn how to generate 3D scatter plot here (https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html(https://matplotlib.org/mpl_toolkits/mplot3d/tutorial.html))


The expected plot is shown below

Question 4 (10 points) Evaluate the logistic regression using the 10 PCs

In [ ]: # YOUR CODE HERE print(f"Train: accuracy: {acc_train:.2}, TPR: {tpr_train:.2}, FPR: {fpr_train:.2}") print(f"Testing: accuracy: {acc_test:.2}, TPR: {tpr_test:.2}, FPR: {fpr_test:.2}")

We see a much better performance on the testing set and also no obvious overfitting occurs after the PCAdimensionality reduction!

Question 5 (20 points) Five-fold Cross Validation

As we discussed in Lecture 31, instead of evaluting the model only on the 20% of data, we want to evaluate themodel on ALL data points. Implement a 5-fold cross-validation to evaluate the model as follows:

1. Divide the dataset into 5 folds2. Train a PCA model on 4 folds and train the LR models on the PCA-transformed 4 folds3. Apply the trained 10-component PCA to the held-out fold4. Apply the LR to the PCA-transform held-out fold. Save the probabilities of the model predictions5. Repeat 2-4 steps on each fold (i.e., 5 times in total)6. Calculate and draw ROC curve on ALL of the predictions over the 5 folds7. Calculate the AUC from the ROC curve.

Also, implement a 5-fold CV on LR operating on the untransformed 7500-features data.

Compare the performances between the LR on PCA-transformed 10-dimensional dataset and LR on the original7500-dimensional dataset.

In [ ]: def cross_validate(model, X_input, Y_output, pca_transform=False): """ Input: model: A sklearn machine learning model object X_input: a numpy array for the input data Y_output: output label pca_transform: a boolean variable indicate whether PCA should be applied to transform the training and validation data to 10-PCs before applying the machine learning model Returns: true_labels: Y_output but saved in the same order of the cross-validation folds pred_scores: predicted scores or probabilities for each CV """ # YOUR CODE HERE return (true_labels, pred_scores)

Run the cross-validation implmented above using LR on PCA transformation and LR on original data.

Save the FPR, TPR, and AUC for each choice for plotting the ROC


Plot the ROC curves for both the PCA-LR and LR on the same plot.

Annotate the plot with the corresponding AUCs for the two models.


The expected ROC plot is shown below. We see a 6% improvement for LR operating on the 10 PCs comparedto LR operating on 7500 features!

Question 6 (30 points) Comparing several classifiers

Using the function implemented above, experiment the following classifiers:

1. LogisticRegression(solver="liblinear")2. KNeighborsClassifier()3. SVC(probability=True, gamma='auto')4. DecisionTreeClassifier()5. RandomForestClassifier(n_estimators=100)

Note each of the initializer create an classifier object of the class so that we can pass them to cross_validationfunction directly without hard code the specific classifier you want to use inside the function.

For each of the five method experiment it with and without 10-PC PCA transformation. Therefore, in total thereare 10 methods to compare.

Using the same 5-fold cross-validation to get the ROC and AUC for each method and plot them in a single plotfor comparison.

In [ ]: from sklearn.neighbors import KNeighborsClassifier from sklearn.svm import SVC from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier # YOUR CODE HERE

Plot ROC curve in a single plot


The expected plot is displayed below.

As we can see not all of the methods significantly benefit from PCA transformation. Some of the reasons are:

Some of the methods themselves have feature selection algorithm built-in such as RF and DT insteadof trying to fit all of the features as in the LRSVC transforms the data to hyperplane which is linearly separaable and often do not benefit fromexternal PCA transformation

The best we obtain from this analysis is: Random Forest (RF) without PCA transformation with AUC = 0.83.

We will use RF to compare with a convolutional neural network in the bonus question below.

Bonus (10 points) Running Convoluational Neural Network (CNN) with Keras

Until now, we treat the image features as flat independent 7500 features. However, the image have spatialinformation in them. For instance, pixels adjacent to each other tend to look more similar than those far apart.

We would like to moving an imaginary lense across the image and extract meaningful image feature patches tohelpful in classifying breast cancer samples.

To this end, we will use Convoluational Neural Network (CNN). CNN is a member of the Deep Learning modelsfamily that specifically suitable for image inputs.

The above figure was extracted from the following paper: https://arxiv.org/abs/1802.00752(https://arxiv.org/abs/1802.00752)

We use Keras to train a CNN. Keras is a high-level wrapper library of Google TensorFlow and provide simplerAPI for training neural network with various architectures. When running on a GPU computer, Keras/TensorFlowwill automatically use the GPU for the computing without additional efforts to specify.

Feel free to compare the performance speed on a GPU-enabled computer (e.g., one of those gamingdesktop/laptop) versus a CPU-only computer.

Our CNN will not be as "deep" as the one shown above because that will take long time to run.

In [ ]: import keras from keras import backend as K from keras.callbacks import Callback, EarlyStopping, ReduceLROnPlateau, ModelCheckpoint from keras.preprocessing.image import ImageDataGenerator from keras.utils.np_utils import to_categorical from keras.models import Sequential, model_from_json from keras.optimizers import SGD, RMSprop, Adam, Adagrad, Adadelta from keras.layers import Dense, Dropout, Activation, Flatten, BatchNormalization, Conv2D, MaxPool2D, MaxPooling2D %matplotlib inline

In [ ]: def trainCNN(x_train, y_train): """ Run Keras CNN: https://github.com/fchollet/keras/blob/master/examples/mnist_cnn.py """ num_classes = 2 img_rows, img_cols = x_train.shape[1],x_train.shape[2] input_shape = (img_rows, img_cols, 3) model = Sequential() model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=input_shape)) model.add(Conv2D(64, (3, 3), activation='relu')) model.add(MaxPooling2D(pool_size=(2, 2))) model.add(Dropout(0.25)) model.add(Flatten()) model.add(Dense(128, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(num_classes, activation='softmax')) datagen = ImageDataGenerator( featurewise_center=False, # set input mean to 0 over the dataset samplewise_center=False, # set each sample mean to 0 featurewise_std_normalization=False, # divide inputs by std of the dataset samplewise_std_normalization=False, # divide each input by its std zca_whitening=False, # apply ZCA whitening rotation_range=10, # randomly rotate images in the range (degrees, 0 to 180) width_shift_range=0.1, # randomly shift images horizontally (fraction of total width) height_shift_range=0.1, # randomly shift images vertically (fraction of total height) horizontal_flip=True, # randomly flip images vertical_flip=False) # randomly flip images model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.Adadelta(), metrics=['accuracy']) model.fit_generator(datagen.flow(x_train, y_train,batch_size=32), steps_per_epoch=len(x_train) / 32, epochs=12, verbose=1) return model

The model is a sequential which allows us to create the model layer-by-layer. The architecture consists of twoconvolutional layers, one max pooling layer, one dropout layer, one flatten layer (2D to 1D), one fully connectedlayers, anotehr dropout layer, and finally the output layer with sigmoid activation output unit for prediction.

We also include a data augmentation step to learn rotation and translation invariant features

In [ ]: def cross_validate_cnn(X, Y): # YOUR CODE HERE return (true_labels, pred_scores)

Let's do a 5-fold CV on the CNN model using all of the 5000 images

In [ ]: X = np.load('X.npy')/256.0 Y = np.load('Y.npy') true_labels,pred_scores = cross_validate_cnn(X, Y)

Let's plot the ROC based on the 5-fold cross validation results on the CNN. This will 1-3 hours depending onyour computer.


The expected plot is display below.

We see that the CNN we trained above is not as good as RF (besides taking much longer to train).

Training deep neural network requires a lot of data points and a lot of engineering of the model parameters andarchitectures (usually requires lots of bells and whistles).

There are efficient automated ways to alleviate some of the manual tunings. See this guide(https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/(https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/))

Play around with various architectures and settings to see whether you can improve it.

COMP 204 - Assignment #5yueli/teaching/COMP204_Winter...The breast tissue contains many cells but...

Documents

Transcript of COMP 204 - Assignment #5yueli/teaching/COMP204_Winter...The breast tissue contains many cells but...