Unifying Structure Analysis and Surrogate-driven Function ... · portant role in glaucoma diagnosis...

Unifying Structure Analysis andSurrogate-driven Function Regression for

Glaucoma OCT Image Screening

Xi Wang1, Hao Chen2?, Luyang Luo1, An-ran Ran3, Poemen P. Chan3,Clement C. Tham3, Carol Y. Cheung3, and Pheng-Ann Heng1,4

{xiwang,hchen}@cse.cuhk.edu.hk

1 Department of Computer Science and Engineering,The Chinese University of Hong Kong, Hong Kong, China

2 Imsight Medical Technology, Co., Ltd., China3 Department of Ophthalmology and Visual Sciences,

The Chinese University of Hong Kong, Hong Kong, China4 Guangdong Provincial Key Laboratory of Computer Vision and Virtual Reality

Technology, Shenzhen Institutes of Advanced Technology, Chinese Academy ofSciences, Shenzhen, China

Abstract. Optical Coherence Tomography (OCT) imaging plays an im-portant role in glaucoma diagnosis in clinical practice. Early detectionand timely treatment can prevent glaucoma patients from permanentvision loss. However, only a dearth of automated methods has been de-veloped based on OCT images for glaucoma study. In this paper, wepresent a novel framework to effectively classify glaucoma OCT imagesfrom normal ones. A semi-supervised learning strategy with smoothnessassumption is applied for surrogate assignment of missing function re-gression labels. Besides, the proposed multi-task learning network is ca-pable of exploring the structure and function relationship from the OCTimage and visual field measurement simultaneously, which contributesto classification performance boosting. Essentially, we are the first tounify the structure analysis and function regression for glaucoma screen-ing. It is also worth noting that we build the largest glaucoma OCTimage dataset involving 4877 volumes to develop and evaluate the pro-posed method. Extensive experiments demonstrate that our frameworkoutperforms the baseline methods and two glaucoma experts by a largemargin, achieving 93.2%, 93.2% and 97.8% on accuracy, F1 score andAUC, respectively.

1 Introduction

Glaucoma is the most frequent cause of irreversible blindness worldwide, whichis a heterogeneous group of disease that damages the optic nerve and leads tovision loss [1]. It is featured by loss of retinal ganglion cells, thinning of retinal

? Corresponding author

arX

iv:1

907.

1292

7v1

[cs

.CV

] 2

6 Ju

l 201

9

2 X. Wang et al.

nerve fibre layer (RNFL), and cupping of the optic disc. Early detection and di-agnosis are essential as they can facilitate immediate treatment and prevent theprogression of the disease, especially for early-stage glaucoma patients. On clin-ical grounds, diagnosis of glaucoma usually requires a variety of tests, includingfunctional measurement (e.g., visual field test) and structure assessment (e.g.,optical coherence tomography). In general, visual field (VF) test provides threeimportant global indices: visual field index (VFI), mean deviation (MD) andpattern standard deviation (PSD) to depict the functional changes. Optical co-herence tomography (OCT) is a non-contact and non-invasive imaging modalitythat generates high-resolution, cross-sectional images (B-scans) of the retina. Itcan provide objective and quantitative assessment of various retinal structures.However, the examination of OCT imaging requires highly trained ophthalmol-ogist, which is always partly subjective and time-consuming. Thus, automatedglaucoma OCT image screening tool is of great needs in clinical practice. More-over, evaluation of the relationship between structural and functional damagecan provide valuable insight into how visual function works according to the de-gree of structural damage, which can help our understanding of glaucoma. Thisis quite significant in diagnosing, staging and monitoring glaucomatous patients.

Many researchers have devoted their efforts to solving the glaucoma OCT im-age screening problem and have achieved significant progress. Nevertheless, mostof the previous works were based on machine learning methods and heavily reliedon established features, such as the measurements on retinal nerve fiber layerthickness and ganglion cell layer thickness [2–4]. Differently, Ramzan et al. [5]proposed an approach that first segmented inner limiting membrane and reti-nal pigment epithelium layers in OCT images for cup-to-disc ratio calculation,and then differentiated glaucoma based on the calculated ratio. A pioneeringwork [6] recently proposed a 3D convolutional neural network (CNN) to directlyclassify the downsampled OCT volumes into glaucoma or normal, which consid-erably outperformed various feature-based machine learning algorithms. Withrespect to modelling the structure-function relationship [7, 8], Leite et al. uti-lized locally weighted scatterplot smoothing and regression analysis to evaluateparapapillary RNFL thickness sectors and corresponding topographic standardautomated perimetry locations [8]. However, there are still three main challengesthat have not been fully investigated yet. First, heretofore there is a dearth ofstudies that explores the structure and function relationship based on the rawOCT image and visual field measurement for glaucoma screening. Second, owingto very limited images in current datasets, validation experiments of previousmethods were not comprehensive, which indeed constrained the development ofrobust and reliable approaches. Third, building a large medical dataset is alwaysconfronted with many difficulties, e.g., it is inevitable to acquire incomplete clin-ical records due to some unexpected reasons, which is a common phenomenonin retrospective studies.

In this paper, we carried thorough investigation on all of the aforementionedchallenges. To the best of our knowledge, we are the first to unify structureanalysis and function regression for glaucoma screening based on OCT images.

Structure Analysis and Function Regression for Glaucoma OCT Screening 3

We develop a novel framework that explores the structure and function relation-ship between OCT image and visual field measurement via a semi-supervisedmulti-task learning network. Specifically, a semi-supervised learning method isfirst introduced to find the surrogates to fill the vacancy of missing visual fieldmeasurements. Afterwards, both class labels (glaucoma/normal) and visual fieldmeasurements (ground truth and pseudo labels) are utilized to jointly train amulti-task learning network. In particular, we concatenate the classification fea-tures with those learned from the regression module to identify glaucoma. Itindicates that the structure and function relationship learned by our network in-deed contributes to classification performance improvement. It is worthwhile toemphasize that to our best knowledge, the largest glaucoma OCTimage datasetcomposed of 4877 volumes of the optic disc is constructed in this study foralgorithm development and evaluation.

2 Method

Our main goal is to effectively classify glaucoma OCT images assisted by ex-ploring the structure and function relationship of glaucoma. Fig.1 illustratesthe overview of the proposed method, which consists of two parts. The firstpart uses a semi-supervised learning technique to solve the missing label prob-lem for function regression. In particular, a B-scan-based CNN is trained underfull supervision of class labels, aiming to extract the holistic representation foreach OCT volume. Afterwards, pseudo labels are automatically generated byprobing the nearest neighbour of OCT images without VF measurement amonghomogeneous groups with CNN-encoded features. In the second part, a multi-task learning network is trained to unify structure analysis and surrogate-drivenfunction regression for more accurate glaucoma screening.

2.1 Surrogate-driven Labelling with Semi-supervised Learning

In order to solve the problem of missing VF measurement labels, we borrow thespirit from semi-supervised learning and come up with an appropriate solution.In semi-supervised learning domain, the smoothness assumption points out thatfeatures close to each other are more likely to share the same label. This as-sumption is intimately linked to a definition of what it means for one feature tobe near another feature, which can be embodied in a similarity function S(·, ·)on input space [9].

To project OCT volumes into the input space for similarity calculation, weadopted ResNet18 [10] as the feature representation model. Specifically, thisCNN is first trained based on B-scans under full supervision of class labels.Then we apply the trained model to extract the embedded feature of each B-scanfrom the Global Average Pooling (GAP) layer. Finally following [11], the norm-3

pooling, F = (∑n

i=1 f3i )

13 , is applied to aggregate B-scan features into a holistic

representation F for each OCT volume. Here, f is the feature representation ofB-scan input, and n is the number of B-scans from the same volume.

4 X. Wang et al.

GAP

OCT volume

B-scan deep feature

holistic feature: feature vector of Gl (or Nl ) : feature vector of Gu (or Nu)

Norm-3 Pooling

Feature Extraction Module

MD

PSD

VFI

Normal

Glaucoma

Conv x 2

Conv x 2

GAP FC

FC

(a) (b)Classification Module

Regression Module

(c)

B-scan

Different colors denote different VF values.

concatenation

Conv: 3 x 3 convolutional layer GAP: global average pooling FC: fully connected layer: feature map concatenation

Shared weights

ResNet

ResNet

ResNet

200

1024

200

1024200

visual field measurement

class label

2.7

7.4

5.1

similarity calculation surrogate discovery label assignment

Fig. 1: An overview of the proposed method. (a) Initially, we train a CNN modelwith class labels (glaucoma/normal) to extract features of B-scans. Then theextracted features are aggregated to form a global representation of an OCTvolume. (b) Next, we compute similarities between homogeneous groups to findthe nearest neighbour of OCT images without VF measurement and enable itsVF values for surrogate assignment. Here, Gl and Gu indicate glaucoma imageswith and without VF labels while N l and Nu denote normal images with andwithout VF labels. (c) Lastly, class labels and VF labels (ground truth andpseudo labels) are unified to train a multi-task learning network.

Based on the availability of class labels and the VF measurements, we groupthe training set into four clusters: (1) Gl: glaucoma images with VF measure-ment; (2) Gu: glaucoma images without VF measurement; (3) N l: normal imageswith VF measurement and (4) Nu: normal images without VF measurement.Through the feature representation model, each OCT volume is represented bya feature vector, F ∈ R512. The similarity between any pair of homogeneousOCT images, e.g., F l ∈ Gl and Fu ∈ Gu, is measured by Euclidean distance:

S(F l,Fu) = [

m∑i=1

(F li −Fu

i )2]12 (1)

Inspired by the semi-supervised smoothness assumption, for each OCT imagewithout VF measurement Fu

j ∈ Gu (or Fuj ∈ Nu), we find its nearest neighbour

in the group of the same class that has VF measurement F li∗ ∈ Gl (or F l

i∗ ∈ N l),where i∗ = arg mini S(F l

i ,Fuj ), and appoint the VF measurement of F l

i∗ toFu

j . Consequently, we successfully find suitable surrogates for all missing VFmeasurements in this semi-supervised learning fashion.

2.2 Multi-task Learning for Structure and Function Analysis

We proposed an end-to-end multi-task learning CNN with a primary task to clas-sify B-scan images into glaucomatous and normal, and an auxiliary task to inves-


tigate the relationship between structural and functional changes of glaucomaeyes. Particularly, this network is composed of three components, the sharedfeature extraction module, the classification module and the regression module.As illustrated in Fig.1(c), we employ ResNet18, initialized with ImageNet pre-trained weights, as the backbone in the feature extraction module whose weightsare shared by both classification and regression tasks. The rest of the networkis composed of two branches, one for glaucoma discrimination and the other forvisual field measurement regression. These tasks are trained jointly.

Visual Field Measurement Regression. For each attribute of visual fieldmeasurement, i.e., VFI, MD and PSD, we formulate it as an individual regressiontask. Within the regression module, two convolutional layers with ReLU acti-vation and batch-normalization are inserted before GAP. Fully connected layerswith sigmoid activation are then used to regress these attributes. The regressiontasks are driven by minimizing the mean square error:

Lreg =1

N

N∑i=1

‖yri − yri (xi; θs, θreg)‖22 (2)

where N denotes the number of training samples, xi is the input of three adjacentB-scan images, yri is the clinical measurement of visual field attribute and yri isthe corresponding prediction of the network. θs denotes shared weights in thefeature extraction module and θreg denotes features in the regression module,respectively. In this paper, we use superscripts r and c for discrimination betweenregression and classification tasks.

Glaucoma Classification. The classification module performs the primarytask for glaucoma screening. In routine clinical practice, visual field measurementis an important indicator of functional change for glaucoma diagnosis. Hence, itis reasonable to hypothesize that if the relationship between structure and func-tion is appropriately discovered, the learned features in the regression modulecould exert a positive influence on the classification task. Based on this assump-tion, we concatenate the attribute regression feature maps with those originatedfrom the feature extraction module, supposing that the features learned in theregression module could provide the classifier with helpful guidance. Here, thetypical binary cross-entropy loss is utilized to train this classifier:

Lcls = − 1

N

N∑i=1

yci log yci (xi; θs, θcls) (3)

where yci is the ground truth label while yci is the likelihood predicted by theclassifier. Similarly, θcls stands for weights in the classification module.

Joint Training of Multi-task Learning Network. Finally, the multi-tasklearning network is trained by minimizing the weighted combination of the meansquare error losses and the binary cross-entropy loss.

6 X. Wang et al.

Ltotal = Lcls +

3∑j=1

αjLjreg (4)

where αj is the hyper-parameter balancing Lcls and Ljreg determined by cross

validation. At the inference stage, we take the average of multiple B-scan-levelprobabilities as a single volume-level prediction.

3 Experiments and Results

Dataset: In this study, we constructed the largest scale cohort for evaluation.This dataset consists of 4877 volumetric OCT images (glaucoma: 2926; normal:1951) from 930 subjects. It is worth noting that each investigated subject (oneeye or both two eyes involved) might have several follow-ups, and also severalOCT images may be taken during each follow-up, which eventually results in3182 eye visits (glaucoma: 1901; normal: 1281) in total. Specially, we denote theeye-visit result as case-level result. A part of VF measurements for some follow-ups are unavailable in this study. Two glaucoma specialists worked individuallyto label all the OCT images into glaucoma and normal, taking VF tests andother clinical records as reference. A senior glaucoma expert was consulted incase of disagreement. Subsets of 2895, 1015 and 967 images are randomly se-lected for training, validation and testing, respectively. The random samplingis at patient level so as to prevent leakage and biased estimation of the testingperformance. According to the accessibility of VF measurements in the trainingset, we re-configure the training set as follows: (i). Part : 1979 images whose VFmeasurements exist. (ii). All : all 2895 images.Quantitative Evaluation and Comparison: The performance is measuredvia three criteria: classification accuracy, F1 score and Area Under ROC Curve(AUC). To obtain the case-level prediction, averaging method is used to aggre-gate the results of several images during each eye visit to a single one.

At present, there is only a dearth of studies on glaucoma OCT image clas-sification. Hence, several baseline methods, including an existing method aswell as three variants of the proposed method, were implemented for compari-son: (i). 3D-CNN: the implementation of the approach proposed in [6] which istrained with downsampled 3D volumes. (ii). 3D-ResNet: A 3D implementationof ResNet18 [10] that takes raw volumes as input. (iii). 2D-ResNet: ResNet18trained with B-scan images from OCT volumes. (iv). 2D-ResNet-MT: 2D-ResNetwith Multi-Task learning network as shown in Fig.1(c) without surrogate la-bel assignment. Specifically, if the training sample is lack of visual field mea-surement, the regression loss is ignored. (v). 2D-ResNet-SEMT: The proposedSEmi-supervised Multi-Task learning network with surrogate label assignment.

The experimental results are listed in Table 1. At first, trained with the sameset All, 2D-ResNet is superior to 3D-ResNet. There are two possible reasons. Oneis that training a 3D network with such high-dimensional input is extremelydifficult. In fact, validation loss oscillates wildly during training, which makes


Table 1: Comparison with different methods and expert performance.

Data MethodsAccuracy F1 Score AUC

image-level case-level image-level case-level image-level case-level

Part2D-ResNet 0.823 0.829 0.835 0.830 0.960 0.9552D-ResNet-MT 0.878 0.882 0.890 0.887 0.971 0.968

All

3D-CNN [6] 0.884 0.889 0.881 0.890 0.962 0.9593D-ResNet 0.880 0.875 0.878 0.874 0.958 0.9562D-ResNet 0.908 0.911 0.904 0.909 0.968 0.9642D-ResNet-MT 0.915 0.912 0.923 0.917 0.975 0.9712D-ResNet-SEMT 0.932 0.924 0.932 0.924 0.978 0.973

Expert 1 0.912 0.912 0.917 0.917 0.918 0.918Expert 2 0.905 0.905 0.913 0.913 0.914 0.914

it hard for selecting models. The other is that there is a deficiency of 3D pre-trained model available, thus training from scratch could readily result in localoptima. By exploring the structure and function relationship in 2D-ResNet-MT,the classification performance is improved, which verifies our hypothesis that theextra information from the regression module is helpful. Inspiringly, the proposedframework achieves the best results among the aforementioned methods on allmetrics. With surrogate label assignment for VF measurement, the classificationmodule can receive more reliable information from the regression branch.

When compared to the pioneering work 3D-CNN [6], our method outperformsit by a large margin, with 4.8%, 5.1%, and 1.6% performance improvement onaccuracy, F1 score and AUC at image level. By and large, 3D-CNN has twodrawbacks. First, the input volumes are compressed intensively, which may leadto discriminative information loss. Second, its capability is quite limited due tothe shallow network structure. To further explore the efficacy of our method, weinvited another two glaucoma experts to identify glaucoma based on the printoutof the OCT images in the format that ophthalmologists usually read in clinic.They reviewed the printouts individually masked from any other clinical notes tomake the decision, either glaucoma or normal, for each testing image. Apparently,the proposed method exceeds expert performance significantly, particularly onAUC.Qualitative Evaluation Class Activation Maps (CAMs) [12] were computedto visualize the discriminative regions that played a vital role in class prediction.For each pair in Fig.2, the upper one is the input, and the lower one is the cor-responding CAM overlapped with the input. Clearly, the discriminative regionsfound by CNN for glaucoma and normal images are quite different. In normal B-scan images, there is barely any response within retinal areas while such regionscontrarily have high responses in glaucoma images. It is corresponding with theclinical diagnosis of glaucoma.

4 Conclusion

In this study, we present a deep learning framework to screen glaucoma basedon volumetric OCT images of the optic disc. We first use a semi-supervised

8 X. Wang et al.

Norm

alGl

auco

ma

Fig. 2: Visualization of discriminative regions. The first two rows show the pairsof normal B-scan and corresponding CAM. The last two rows show the pairs ofglaucomatous ones. (Best viewed in color)

learning method to address the problem of incomplete visual field measurementlabels of training images. Then a multi-task learning network is built to explorethe relationship between functional and structural changes of glaucoma, whichis verified beneficial to classification improvement. Extensive experiments on ourlarge-scale dataset manifested the effectiveness of the proposed approach, whichoutperforms the baseline methods by a large margin. In addition, the comparisonwith glaucoma specialists provides strong evidence that our proposed frameworkhas promising potential for automated glaucoma screening in the near future.

Acknowledgements This project is supported in part by the National BasicProgram of China 973 Program under Grant 2015CB351706, grants from theNational Natural Science Foundation of China with Project No. U1613219, Re-search Grants Council - General Research Fund, Hong Kong (Ref: 14102418) andShenzhen Science and Technology Program (No. JCYJ20180507182410327).

References

1. J. B. Jonas, T. Aung, R. R. Bourne, A. M. Bron, R. Ritch, and S. Panda-Jonas,“Glaucoma,” The Lancet, vol. 390, pp. 2183–2193, 2017.

2. M.-L. Huang and H.-Y. Chen, “Development and comparison of automated clas-sifiers for glaucoma diagnosis using stratus optical coherence tomography,” Inves-tigative ophthalmology & visual science, vol. 46, no. 11, pp. 4121–4129, 2005.

3. H. J. Kim, S.-Y. Lee, K. H. Park, D. M. Kim, and J. W. Jeoung, “Glaucomadiagnostic ability of layer-by-layer segmented ganglion cell complex by spectral-domain optical coherence tomography,” Investigative ophthalmology & visual sci-ence, vol. 57, no. 11, pp. 4799–4805, 2016.

4. M. Christopher, A. Belghith, R. N. Weinreb, C. Bowd, M. H. Goldbaum, L. J.Saunders, F. A. Medeiros, and L. M. Zangwill, “Retinal nerve fiber layer fea-tures identified by unsupervised machine learning on optical coherence tomographyscans predict glaucoma progression,” Investigative ophthalmology & visual science,vol. 59, no. 7, pp. 2748–2756, 2018.


5. A. Ramzan, M. U. Akram, A. Shaukat, S. G. Khawaja, U. U. Yasin, and W. H.Butt, “Automated glaucoma detection using retinal layers segmentation and opticcup-to-disc ratio in optical coherence tomography images,” IET Image Processing,2018.

6. S. Maetschke, B. Antony, H. Ishikawa, and R. Garvani, “A feature agnostic ap-proach for glaucoma detection in oct volumes,” arXiv preprint arXiv:1807.04855,2018.

7. T. A. El Beltagi, C. Bowd, C. Boden, P. Amini, P. A. Sample, L. M. Zangwill, andR. N. Weinreb, “Retinal nerve fiber layer thickness measured with optical coherencetomography is related to visual function in glaucomatous eyes,” Ophthalmology,vol. 110, no. 11, pp. 2185–2191, 2003.

8. M. T. Leite, L. M. Zangwill, R. N. Weinreb, H. L. Rao, L. M. Alencar, and F. A.Medeiros, “Structure-function relationships using the cirrus spectral domain op-tical coherence tomograph and standard automated perimetry,” Journal of glau-coma, vol. 21, no. 1, p. 49, 2012.

9. O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning (chapelle, o. etal., eds.; 2006)[book reviews],” IEEE Transactions on Neural Networks, vol. 20,no. 3, pp. 542–542, 2009.

10. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog-nition,” in Proceedings of the IEEE conference on computer vision and patternrecognition, 2016, pp. 770–778.

11. X. Wang, H. Chen, C. Gan, H. Lin, Q. Dou, Q. Huang, M. Cai, and P.-A. Heng,“Weakly supervised learning for whole slide lung cancer image classification,” Med-ical Imaging with Deep Learning, 2018.

12. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deepfeatures for discriminative localization,” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, 2016, pp. 2921–2929.

Unifying Structure Analysis and Surrogate-driven Function ... · portant role in glaucoma diagnosis...

Documents

Transcript of Unifying Structure Analysis and Surrogate-driven Function ... · portant role in glaucoma diagnosis...