Build and evaluate an interactive dashboard to visualize...

Build and evaluate an interactive dashboard to visualize lung

adenocarcinoma data

A study submitted in partial fulfilment

of the requirements for the degree of

MSc Data Science

at

THE UNIVERSITY OF SHEFFIELD

by

Hailing Lu

September 2016

Build and evaluate an interactive dashboard to

visualise lung adenocarcinoma data

Abstract

Background

Substantial web-based portals have been built to offer various types of visualisations and analyses for exploring genomics datasets via interactive dashboards. These visualisations support not only scientists with cancer-related expertise, but also those without professional computational or bioinformatics background. However, seldom researches have evaluated if these visualisations are useful enough for cancer research scientists and how helpful they are.

Aim

This research aims to create an interactive dashboard for visualising lung adenocarcinoma data and then develop measurable approaches for evaluating its effectiveness in order to identify potential problems of existing visualisations for future dashboard development. Methods

The dashboard was created by visualising genomics datasets of The Cancer Genome Atlas (TCGA). Code packages provided by TCGA were used and conducted on Rstudio and the dashboard was published on the application platform of Rshiny. The dashboard was then assessed through an evaluation form completed by 17 participants from the department of Oncology & Metabolism, Biomedical Science and MSc Data Science programme in the University of Sheffield.

Results

The academic research (or working) experience, domain knowledge as well as visualisation familiarity present different impacts for visualisation and dashboard interpretation. Participants with data science expertise illustrate a better skill for locating answers while those with oncological or biomedical knowledge could think more out of only searching for answers but dashboard function development for further research utilisation. Annotation and help note also demonstrate significance for better dashboard comprehension.

Conclusions

When designing the dashboard for visualising genomics datasets, design principles should

focus more on meeting the requirement of users with different domain knowledge. This

study can be improved by using larger datasets and evaluation samples, wider analysis and

visualisation methods, more dimensional indicators and approaches for evaluating.

Acknowledgement

I would like to gratefully acknowledge various people who have offered support and

inspired me when I worked on this dissertation. First, I would like to thank all the lecturers

who made contributions to the courses of MSc Data Science programme, especially my

supervisor, Dr Gianluca Demartini and the programme coordinator, Professor Paul Clough. I

have learned a lot during this year. Second, I would like to thank my dear parents and sister,

friends and classmates who have encouraged me when I felt frustrated. Third, I would like

to thank Phil Chapman and Ketaki B Patil, who provided useful suggestions for my

dashboard development. Fourth, I would like to thank all the students, academic staff and

professors who assessed the dashboard for me. Finally, I would like to thank Marcin Kosiński,

who designed the useful RTCGA packages for visualisations and patiently answered my

questions on his blog.

Contents

Abstract

Acknowledgement

List of Figures

List of Tables

1.0 Introduction and Context 1

2.0 Research aims and Objectives 2

2.1 Aim 2

2.2 Objectives 2

3.0 Literature review 3

3.1 Heat maps 4

3.2 Networks 5

3.3 Survival Plots 5

3.4 Genomic Coordinates 6

3.5 Scatter Plots and Box Plots 6

3.6 Bar Charts 7

3.7 Evaluation 7

3.7.1 Case Studies 7

3.7.2 Evaluation Methods 10

3.8 Techniques and Data 10

4.0 Methodology 11

4.1 Research plan summary 11

4.2 Dataset Description 11

4.2.1 Term Explanation 11

4.2.2 General Description 12

4.2.3 Sub-dataset Description 14

4.3 Dashboard Development 15

4.3.1 Platform and Packages 15

4.3.2 Visualisation Creating 15

4.3.3 Dashboard Publishing and Updating 16

4.3.4 User Interface Designing 17

4.4 Design Evaluation and Questionnaire 19

4.4.1 Background Information Section 20

4.4.2 Evaluation Section & Dashboard usage impression Section 20

4.4.3 Select participants and Conduct assessment 22 4.5 Research Ethics 22

4.5.1 Questionnaire and Participants 22

4.5.2 Datasets and Data storage 23

5.0 Results 24

5.1 Participant Background Information 24

5.2 Dashboard description and Evaluation results 26

5.2.1 Survival Plot Section 1: “Different cancer types” 28

5.2.2 Survival Plot Section 2: “Comparison in different genes” 32

5.2.3 Box Plot Section 37

5.2.4 Heat map Section 41

5.2.5 The whole impression of this dashboard for participants 46

5.2.6 Further validation 48

6.0 Discussion and Dashboard design recommendations 55

6.1 Participants 55

6.2 Academic research (working) experience impacts 55

6.3 Domain knowledge familiarity impacts 55

6.4 Dashboard Comprehension 56

6.5 Dashboard design recommendations 56

7.0 Conclusions and Recommendations 58

7.1 Conclusions 58

7.2 Recommendations for future research 59

7.2.1 For data samples and participants 59

7.2.2 For the interactive dashboard 59

7.2.3 For the evaluation methods 59

8.0 References 60

9.0 Appendix 64 9.1 Mutated genes in lung adenocarcinoma 64 9.2 Template of the email for inviting participants for dashboard evaluation 65

9.3 Research Ethics 66

9.3.1 Ethics Application Form 66

9.3.2 Ethics Approval Letter 72

9.3.3 Consent Form 73

9.4 Questionnaire 75

9.5 Details of participants’ current profession 83

List of Figures

Figure 4.1 Case distributions of LUAD and LUSC in Lung Cancer Project of TCGA

Figure 4.2 Demographic distributions of cases within LUAD

Figure 4.3 Portal interface of TCGA (CDC) Data Portal

Figure 4.4 Portal interface of cBioPortal

Figure 4.5 Interface of the dashboard built in this study

Figure 4.6 Question examples of visualisation creating experience

Figure 4.7 Question examples for case study 1



Figure 4.10 Question examples for evaluating the graph itself

Figure 4.11 Question examples for evaluating the dashboard utility

Figure 5.1 Distribution of age, education level, profession year experience

Figure 5.2 Frequency of creating data visualisations

Figure 5.3 Preference of using visualisations and text

Figure 5.4 First page of the dashboard

Figure 5.5 Help note agreement

Figure 5.6 “Different cancer types” section: Two cancer type comparison (left) and three cancer type comparison

Figure 5.7 Answer locating distribution of “Three cancer type comparison”

Figure 5.8 Easy level agreements of graphs in the “Different cancer types” section

Figure 5.9 Agreement of too much information for graphs in the “Different cancer types” section

Figure 5.10 Possible elements that could cause erroneous answers of the “Different cancer types” section

Figure 5.11 Agreement of comparison among different cancers

Figure 5.12 Insight gaining agreement of the graphs in the “Different cancer types” section

Figure 5.13 “Comparison in different genes” section: MET gene (left) and TP53 gene

Figure 5.14 Agreement of comparison between mutation type and wild type

Figure 5.15 Answer locating distribution of “TP53” survival plot

Figure 5.16 Easy level agreements of graphs in the “Comparison in different genes” section

Figure 5.17 Agreement of too much information for graphs in the “Comparison in different genes” section

Figure 5.18 Insight gaining agreement of the graphs in the “Comparison in different genes” section

Figure 5.19 Possible elements that could cause erroneous answers of the “Comparison in different genes” section

Figure 5.20 Box Plot Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA

Figure 5.21 Answer locating distribution of LUAD & LUSC & BRCA & ACC box plot

Figure 5.22 Easy level agreements of graphs in the Box Plot section

Figure 5.23 Utility agreements of graphs in the Box Plot section

Figure 5.24 Insight gaining agreement of the graphs in the Box Plot section

Figure 5.25 Possible elements that could cause erroneous answers of the Box Plot section

Figure 5.26 Heat map Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA

Figure 5.27 Utility agreements of graphs in the Heat map section

Figure 5.28 Insight gaining agreement of the graphs in the Heat map section

Figure 5.30 Answer locating distribution of LUAD & LUSC & BRCA & BLCA heat map

Figure 5.31 Easy level agreements of graphs in the Heat map section

Figure 5.32 Agreement of too much information for graphs in the Heat map section

Figure 5.33 Possible elements that could cause erroneous answers of the Heat map section

Figure 5.34 Familiarity of the similar dashboard

Figure 5.35 Willingness of using the dashboard in the future works (or study)

Figure 5.36 Interpretation measures of different visualisation types

Figure 5.37 Interpretation measures of participants with different profession experience

Figure 5.38 Interpretation measures of participants with different levels of visualisation familiarity

Figure 5.39 Interpretation measures of participants with different domain knowledge

Figure 5.40 Willingness of participants with different domain knowledge

Figure 5.41 Willingness of participants with different frequencies of creating visualisations

List of Tables

Table 3.1 Main visualisations in the selected portals

Table 4.1 Tem Dictionary

Table 4.2 Summary of datasets building process

Table 4.3 Summary of main packages used for building the dashboard

1

1.0 Introduction and Context

The changed DNA letters in normal cells can result in cancer cells and the cancer genome is

made up from those changed genetic letters in chromosomes (The Cancer Genome Atlas,

2016). Due to the multidimensional characteristics of genomics data, large-scale cancer

genomic projects, such as the International Cancer Genome Consortium (ICGC), Cancer Cell

Line Encyclopaedia (CCLE), and The Cancer Genome Atlas (TCGA), are set up for gathering

and researching cancer genome data. These data are able to be analysed and interpreted

due to the improvement in genomics technologies (Chin, Hahn, Getz & Meyerson, 2011).

These computational and statistical advances in analysing cancer genomics data drive the

evolvement of visualisation techniques and tools for this type of data.

Numbers of web-based portals have been built to visualise genomics data for those cancer

genomic projects (Klonowska et al, 2015). These portals offer various types of visualisations

by representing them via interactive visual dashboards. It is stated that by visualising and

exploring the genomics data and its clinical information, cancer researchers can gain a

better understanding of the mechanisms and character of specific genes or cancer types

(Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013). Moreover, Schroeder, Gonzalez-Perez and

Lopez-Bigas (2013) also emphasise that these visualisations help the process of discovering

potential cancer drug targets.

These visualisations facilitate not only cancer scientists with different cancer-related fields,

but also those without professional computational or bioinformatics background

(Klonowska et al, 2015). However, it is argued that some bioinformatics and programming

knowledge is still necessary to efficiently utilise these visualisations (Schroeder, Gonzalez-

Perez & Lopez-Bigas, 2013). Although some portals have made great improvement for

providing helpful analyses and visualisations, seldom researches have evaluated if these

visualisations are useful enough for cancer research scientists and how helpful they are. This

means that not all the existing visualisations on those portals are useful or effective for

cancer researchers. The knowledge barrier between cancer scientists and experts with

computational and bioinformatics background is not easy to be minimised. Hence, this

research tries to minimise this barrier by creating and then assessing an interactive

dashboard for visualising cancer genome data and its clinical data. Specifically, after

comparing existing visualisations, potential visualisations will be created before the

assessment.

In order to reduce bias for comparison and evaluation, this research will implement more

than one visualisations based on a single cancer type, lung adenocarcinoma. Before

implementing potential visualisations, several oncological research questions based on lung

adenocarcinoma will be designed as the case studies. Then visualisations will be created to

support these case studies. For evaluation, a questionnaire will be designed and then

completed by both research staff and students from the department of Oncology &

2

Metabolism and Biomedical Science and students from MSc Data Science programme in the

University of Sheffield. This evaluation is designed to find out what aspects would probably

have influences on interactive dashboard usage in order to provide suggestions for

genomics research online portal development.

2.0 Research aim and Objectives

2.1 Aim

This research aims to create an interactive dashboard for visualising lung adenocarcinoma

data and then develop measurable approaches for evaluating its effectiveness in order to

identify potential problems of existing visualisations for future dashboard development.

2.2 Objectives

Explore different visualisations for visualising genomics datasets

Build an interactive dashboard by integrating those selected visualisations

Assess the effectiveness of different visualisations based on specific cancer research

case studies

Identify probable issues of the interactive dashboard from the evaluation results

Provide suggestions for this type of dashboard development

3.0 Literature review

3


Lung cancer can be categorised into two main types, small cell lung cancer (SCLC) and non-

small cell lung cancer (NSCLC), which accounts for around 12% and 87%, respectively

(Cancer Research UK, 2016). A survey in 2015 reported that Lung cancer was revealed to be

the leading cause of cancer-associated mortality in USA (Siegel, Miller & Jemal, 2015). Lung

adenocarcinoma is one of the frequently diagnosed subtypes in NSCLC in 2002 (Travis, 2002).

Hence, lung adenocarcinoma is chosen as the research target in this dashboard building and

evaluation, aiming to provide help for further research potentially.

To better interpret the multidimensional oncogenomics data, a number of web-based and

user-friendly portals have been built to support scientists from various cancer-related fields

(Klonowska et al, 2015). Most of these portals offer varieties of visualisation tools for

oncogenomics analyses which includes mutations, gene/protein expression, copy number

variation (CNV), and survival analyses (Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013; Gao

et al, 2013). To visualise these genetic events, different types of graphical representations

are used, including heat maps, networks, survival plots, scatter plots, genomic coordinates,

volcano plots etc. (Klonowska et al., 2015; Gao et al, 2013; Schroeder, Gonzalez-Perez &

Lopez-Bigas, 2013). The separate paragraphs listed below are reviews of each commonly

used visualisation, elaborating their principles and utilities in cancer and oncology research.

Main visualisation types of different cancer research portals were summarized in Table 3.1

based on their versions from 2016.

Table 3.1 Main visualisations in the selected portals

Visual Types Portals (Links)

Heat maps

cBioPortals

http://www.cbioportal.org/

Tumorscape

http://www.broadinstitute.org/

UCSC Genome Bioinformatics

https://genome.ucsc.edu/

COSMIC

http://cancer.sanger.ac.uk/cosmic

Oncomine

https://www.oncomine.org/resource/login.html

Networks cBioPortals

Oncomine

Survival Plots cBioPortals


4

UCSC Genome Bioinformatics

PPISURV

http://www.bioprofiling.de/GEO/PPISURV/ppisurvD.html

Oncomine

Genomic

coordinates

cBioPortals

Tumorscape

UCSC

Scatter Plots

cBioPortals

Oasis

http://www.oasis-genomics.org/

Oncomine

Box Plots

cBioPortals

Oasis

Oncomine

Bar Plots

COSMIC

Oasis

IntOGen

https://www.intogen.org/search

ICGC

https://dcc.icgc.org/

Volcano Plots Oasis

Tree maps cBioPortals

3.1 Heat maps

Clustered heat map, which is stressed to be the most popular visual display of genomics

data (Rajaram & Oono, 2010; Wilkinson & Friendly, 2012), can “compact large amounts of

information into a small space to bring out coherent patterns in the data” (Weinstein, 2008).

By representing genomics data in a matrix, columns and rows of a heat map can be

clustered hierarchically without limiting the order of data (Schroeder, Gonzalez-Perez &

Lopez-Bigas, 2013). For the patches of colour in a heat map, it reveals different levels of

value measuring the relationships among vertical axes (e.g. gene) and horizontal axes (e.g.

tumour samples) (Weinstein, 2008). In other words, this makes it possible to group and

compare genomes in distant loci (Schroeder, Gonzalez-Perez & Lopez-Bigas, 2013).


5

For those selected portals in the Table 3.1, most of them use heat maps for similar

presenting purposes, representing various types of alterations in genes as well as somatic

mutations across tumour types, for instance (Schroeder, Gonzalez-Perez & Lopez-Bigas,

2013, Stephens, 2009). Additionally, clustering the expression data or copy number

variations can grasp cancer subtypes or reveal subtype interactions (Vaske et al., 2010;

Czubak et al., 2015). Furthermore, molecular profile features are able to be visualised by

heat map as well (Cancer Genome Atlas Research Network, 2014).

In some portals, circle map, another type of heat map, is used to demonstrate relationships

among genes in a circular plot rather than rectangular one. The summary of copy number

variations of lung adenocarcinoma can be shown in circular plots (Klonowska et al., 2015). In

addition, each circle map can represent one type of gene and every circle map representing

different genes makes up a network layout showing connections among them (Schroeder,

Gonzalez-Perez & Lopez-Bigas, 2013).

However, a prominent limitation of the heat map still needs considered. Those nonlinear

relationships among genes are difficult to be extracted, which may result in

misunderstandings (Weinstein, 2008). Some portals try to combine with other graphs to

display the hiding structure relationships among genes. For example, using circle map as

nodes to create a network diagram or constructing pathway diagrams to derive functional

relationships among genes (Vaske et al., 2010).

3.2 Networks

Network diagrams present possible connections of targets (nodes) without restrictions of

the order of data. Compared with heat map, functional relationships among different

entities can be demonstrated in the network diagrams, such as genes and protein

expressions (Cline et al, 2007). In the gene networks, the neighbour genes are ranked by

genomic alteration frequency before being represented, which minimises the complexity of

multidimensional genomics data (Gao et al, 2013). In addition, by using networks, potential

cancer drug targets and cancer drivers are able to be discovered as well (Schroeder,

Gonzalez-Perez & Lopez-Bigas, 2013).

3.3 Survival Plots

Kaplan-Meier analyses can be visualised as survival plots if the clinical data are available

(Klonowska et al, 2015). Survival plots can be used to compare impacts of different entities,

such as gene mutations, copy number and expression of genes. Vaske et al. (2010)

combined survival plots with the heat map to identify which kind of clusters gained from the

heat map have the most influence on the survival of patients. Survival ability comparison

can be made within a plot to gain more insights from different oncogene-driven lung

cancers (Arcila et al., 2015). The overall survival outcome of a specific gene in breast cancer


6

is compared between mutated cases and wild-type cases in order to know more about the

impacts of that gene to breast cancer (Ciriello, Cerami, Sander & Schultz, 2012).

In addition, survival plots and boxplots are able to be combined to answer the same

question. Czubak et al. (2015) compared the effects of the copy number and expression of

different genes by using survival analysis and revealed the results of correlation analysis of

copy number types and expression level via boxplots. Gao et al. (2013) queried gene

mutations in cancer and demonstrated the comparisons between patients with and without

such mutations in survival plots.

3.4 Genomic Coordinates

Genomic coordinate, illustrating the alterations attached to their genomic positions is

another common visualisation for oncogenomics data (Schroeder, Gonzalez-Perez & Lopez-

Bigas, 2013). It is stated that the genomic coordinate is suitable to demonstrate the loci and

frequency of all alterations in a single graph, making the best of horizontal space (Gao et al.,

2013; Yin, Cook & Lawrence, 2012). In cBioPotal, the genomic coordinate is used as one of

the summary graphs to show how gene alterations locate in a specific gene (Klonowska et

al., 2015). Cancer Genome Atlas Research Network (2014) combined chromosome

coordinates and pie charts to identify potential candidate driver genes. Similarly, based on a

novel cancer gene, somatic mutations of non-small cell lung carcinoma (NSCLC) were

visualised in a genomic coordinate to show its gene expression (Yin et al., 2014). In addition,

OASIS, an open-access cancer research portal, combines genomic coordinate and table to

display tumour samples with somatic mutations and copy number alteration (CNA) data

(Fernandez-Banet et al., 2016). On Tumorscape, a cancer program resource gateway,

genomic coordinate is used to visualise chromosomal regions and then is combined with

heat map to show genomic relationships based on their specific location (Gao et al., 2013).

3.5 Scatter Plots and Box Plots

By comparing the boxplots of different cancer project portals, it can be found that the

boxplot is commonly used to present the basic distribution of gene expressions classified by

different entities. For instance, it is illustrated that expressions of a specific cancer gene can

be grouped into normal and tumour samples or differential cancer types (Wu et al., 2015;

Fernandez-Banet et al., 2016; Czubak et al., 2015; Yin et al., 2014). Correlation analysis

between copy number alterations and gene expression of a specific cancer type or gene can

be visualised by box plots (Czubak et al., 2015; Klonowska et al., 2015; Gao et al., 2013). The

box-and-whisker plots demonstrate and compare the distribution of mRNA expression for

each subtype (Jordi et al., 2012).

Scatter plots are useful to display discrete genetic events, correlations between genes, gene

expressions or copy number alterations, for instance (Gao et al., 2013). Both Suh et al. (2014)


7

and Yin et al. (2014) presentd the correlations between mRNA expressions by scatter plots

which were also used by Klonowska et al. (2015) to show the correlations between copy

number alterations and mRNA expression. Gene expressions and copy number ratios can be

the entities in scatter plots (Fernandez-Banet et al., 2016). Moreover, a combination plot

with scatter plots and boxplots is created to visualise both correlations and distribution of

gene expressions (Jordi et al., 2012).

3.6 Bar Charts

According to the features of oncogenomics data and clinical information, bar plots suit for

categorising genetic data (e.g. genes expression or copy number alterations) based upon

specific cancer questions. For example, bar plots are able to compare gene expression

patterns classified by different categories (e.g. male and female, normal and tumour

samples) (Fernandez-Banet et al., 2016; Suh et al., 2014). Cancer Genome Atlas Research

Network (2014) used bar plot to classify mutation patterns across different samples in order

to identify new candidate driver genes. Additionally, it is useful to indicate alterations such

as mutations, deletion and amplification of a specific gene across different cancer types by

bar plots (Fernandez-Banet et al., 2016; Gao et al., 2013; Klonowska et al., 2015; Stephens et

al., 2009). What’s more, the proportion or frequency of mutated genes of a specific cancer

type can be visualised by bar plots as well (Cancer Genome Atlas Research Network, 2014;

Devarakonda, Morgensztern & Govindan, 2015).

3.7 Evaluation

3.7.1 Case Studies

In order to implement and evaluate visualisations more efficiently and practically, three case

studies were designed according to previous case studies associated with cancer researches

and lung adenocarcinoma researches.

As is mentioned, cancers appear because of mutations in genes that lead to abnormal

increase of cells. However, not all the mutations contribute to tumour development

because some of them are “passenger” mutations instead of “driver” mutations (Ciriello,

Cerami, Sander & Schultz, 2012). Additionally, although over 400 such “cancer genes” have

been identified and recorded, there still potential cancer genes need to be detected (Futreal

et al., 2004). Thus, to identify the drivers more precisely is of vital importance in

oncogenomics research (Gonzalez-Perez & Lopez-Bigas, 2012). However, to distinguish

cancer drivers is one of the solutions as well as challenges to capture mutation patterns of

cancers. For the reason that mutation validation by experiments are not able to deal with

the soaring capacity for mutation identification, computational approaches to distinguish

cancer drivers are needed (Gonzalez-Perez & Lopez-Bigas, 2012).


8

For identifying those significant mutated genes, common methods are to detect recurrent

mutated genes. Mutation mutually exclusive is one of the useful means using this principle

to distinguish the driver genes (Klonowska et al., 2015). During the mutation mutually

exclusive process, the possibility of gene occurrence (amplification or deletion) is calculated

across substantial number of tumour samples. Then, the features of gene expression (gene

product: RNA or protein) are also considered to filter candidate genes (Ciriello, Cerami,

Sander & Schultz, 2012).

To support these two processes, survival plot as well as box plot can be used. For survival

plot, the survival analysis can represent the survival characteristics of amplified genes,

which are able to illustrate the survival differences between genes and then identify gene

drivers finally (Ciriello, Cerami, Sander & Schultz, 2012). For box plot, it can represent the

features of gene expression over different cancers. Driver genes can be overexpressed,

which means that their gene expression is out of normal size (Santarius, Shipley, Brewer,

Stratton & Cooper, 2010). Hence, it is significant to demonstrate the gene expression

features of different cancers in order to detect gene drivers across various cancers more

precisely.

To understand the biology of cancers, cancer profiling researches are verified to indicate

significant impact in numerous ways (Cowin, Anglesio, Etemadmoghadam & Bowtell, 2010).

Consequently, according to what has been discussed above, Case study 1, 2 and 3 were built

to support the procedure of identifying cancer drivers as well as reveal the molecular profile

characteristics of lung adenocarcinoma.

Case study 1: To support the detection of lung adenocarcinoma drivers

For lung adenocarcinoma (one of non-small-cell lung cancer types), its mutational landscape

particularly differs from small cell lung cancer (Devarakonda, Morgensztern & Govindan,

2015). Therefore, by comparing the mutational features with other cancer types, the

specific mutation patterns of lung adenocarcinoma can be recognised. As the survival

characteristics is one of the mutation features, the comparison of survival analysis results

between lung adenocarcinoma and other cancer types can be implemented. It can offer

help for the procedure of identifying driver genes of lung adenocarcinoma.

In addition to comparing survival characteristics among different cancer types, it is also

regular to make comparison among different gene types. TP53 gene and MET gene has been

significantly identified as mutated genes of lung adenocarcinoma (Devarakonda,

Morgensztern & Govindan, 2015) (details shown in Appendix 9.1). Consequently, the

survival characteristics of these two genes can be shown in survival plot to distinguish their

differences.


9

Case study 2: To support the identification of overexpressed genes of lung adenocarcinoma

It is stressed that gene amplification sometimes occurs with gene overexpression (Prelich,

2012). That is also one of the reason that why gene expression should be used to filter

candidate genes when implementing driver gene identification (Ciriello, Cerami, Sander &

Schultz, 2012). In addition, it is accentuated that correctly establishing the gene expression

level is of vital importance to gene therapy strategies (Prelich, 2012). This implies that

identifying the overexpressed genes can not only provide help with recognition of driver

genes, but also support the process of gene therapy tactics.

For lung adenocarcinoma, identifying the epidermal growth factor receptor (EGFR)

mutations can lead to therapeutic opportunities (Landi, Minuti & Salvini, 2013). Additionally,

MET gene is also distinguished as one of the important genes for resisting EGFR, which

indicating the potential tactics to overcome one of the challenges of lung adenocarcinoma

cure (Landi, Minuti & Salvini, 2013). Furthermore, it is highlighted that 25% to 75% of cases

in NSCLC indicated the frequent MET overexpression (Olivero et al., 1996). Thus, to visualise

the MET expression of lung adenocarcinoma can be conducted to illustrate its MET

expression features.

Case study 3: To visualise molecular profile features of lung adenocarcinoma

Lung cancer is consistent stressed that it is a “molecularly heterogeneous disease”

(Devarakonda, Morgensztern & Govindan, 2015, paragraph 8). Also, tumours of cancer are

so complex that they might be diagnosed differently from the molecular viewpoint

(Gonzalez-Perez & Lopez-Bigas, 2012). Consequently, considering what has been mentioned

above as well as the significantly different mutational landscape of lung adenocarcinoma

(Devarakonda, Morgensztern & Govindan, 2015), being able to comprehend the molecular

profiling of lung adenocarcinoma is beneficial to its identification (Cancer Genome Atlas

Research Network, 2014).

Heat map is an important visualisation tool for profiling the cancer genome. Six molecular

subtypes of ovarian cancers were identified in a heat map by correlating tumour samples (x-

axis) with genes (y-axis) and gene expression (patches of colour) (Cowin, Anglesio,

Etemadmoghadam & Bowtell, 2010). Similarly, it is also emphasized that molecular subtypes

in lung adenocarcinomas can be revealed by heat maps (Cancer Genome Atlas Research

Network, 2014). Moreover, circle maps representing alteration landscape can also be tried

in lung adenocarcinoma which is similar as the breast cancer visualisations (Vaske et al.,

2010). Based on previously practices mentioned above, the heat map presenting molecular

profile features of lung adenocarcinoma should be built in this case.


10

3.7.2 Evaluation Methods

It is underlined that the wide varieties of visualisation tools aim to support cancer

researchers, especially for those without computational and bioinformatics expertise.

However, the development of the evaluation for these visualisations seems not to catch up

with the speed of its visualisations development. Specific means for evaluating genetics

visualisations are limited.

Controlled experiments are existing evaluation methodologies for measuring effectiveness

of visualisations (Saraiya, North & Duca, 2005). Even though there is a common restriction

that the predefined tasks limit the sights of evaluators, the controlled experiment is still

commonly used due to its skills of informative tasks testing (Saraiya, North & Duca, 2005).

Thus, this research will use the methodology of controlled experiments by designing a

questionnaire based on the three case studies presented above.

As the traditional procedures required, environment and background of evaluators will be

controlled (Isenberg, Zuk, Collins & Carpendale, 2008), evaluating time and scores will be

calculated as the measure of effectiveness (Saraiya, North & Duca, 2005). There are a lot of

methods introducing how to evaluate an information retrieval platform for presenting

visualisations (Morse, Lewis & Olsen, 2000). It is stated that the visual taxonomy of the

platform will inference the effectiveness of understanding the graphs. Consequently, this

evaluation would try to design measurable questions for assessment so that different

variables could be compared.

3.8 Techniques and Data

This research will mainly use RStudio and RShiny to visualise lung adenocarcinoma data

while SPSS and Tableau will be used as the assistant tools. Using RShiny, it is convenient to

visualise the statistical analysis if the codes are well designed. As there are varieties of

genetics related R packages for visualisations available online, most of the visualisations

introduced above are able to find out suitable and similar references. The R package for

visualising the heat maps, networks, survival plots, scatter plots, box plots and bar plots can

be approached from RPubs (RPubs, 2015). The codes for creating circle maps and genomic

coordinates are summarised in ggbio package (Yin, Cook & Lawrence, 2012). Moreover,

websites such as Github and Gitools provide instructions and examples to visualise genomic

data.

For data, this research will use the cancer genomic data sets of The Cancer Genome Atlas

(TCGA) portal which has data covering various types of cancers as well as genes. So far,

there have been 39 cancer types including 14,531 cases in total, within which there are 585

lung adenocarcinoma cases (The Cancer Genome Atlas, 2016). As this research only focuses

on lung adenocarcinoma, only lung adenocarcinoma data will be accessed for visualisations.

4.0 Methodology

11

4.0 Methodology

4.1 Research plan summary

In order to make the dashboard more practical for cancer researchers and create an

effective evaluation process, four problems need to be solved in this research.

Step 1: To explore potential visualisations based on existing studies;

Step 2: To design suitable case studies for researching lung adenocarcinoma;

Step 2: To clean data sets for visualising;

Step 3: To conduct potential visualisations based on the three case studies;

Step 4: To develop and publish the interactive dashboard;

Step 5: To design the questionnaire and then evaluate the interactive dashboard.

As numbers of portals for visualising and analysing the cancer projects are available online,

the principles and features of existing visualisations have be summarised in literature review.

To gain a better summary, principles of each type of visualisations used on those portals

should be understood at first. Then, by combining other visualisations presented in previous

literary researches, all of the potential visualisations can be summarised to meet the

requirements of current cancer researches.

Graphical expression aims to interpret characteristics of data more directly, which can offer

supports for solving practical problems. Therefore, three case studies are designed

according to current lung adenocarcinoma research problems. These three studies not only

can be used to visualise the chosen dataset more practically, but also offer benchmarks for

later evaluation.

Before building the dashboard, some calculation and integration of data sets extracted from

TCGA should be done. During the building and updating process, great help was received

from “Github”, “Rpub”, Phil Chapman (a staff from Cancer Research UK) and fellow student

Ketaki B Patil.

After controlling potential influences, a questionnaire testing utility of those visualisations

and the dashboard were designed for different case studies. After the evaluation,

assessment will be analysed and compared to identify the effectiveness of the dashboard.

4.2 Dataset Description

4.2.1 Term Explanation

The table below depicts the explanation of term abbreviation used in the dashboard and

some of the information has also been shown in the dashboard as the help note. What

should be mentioned here is that the information of LUAD and LUSC is derived from

4.0 Methodology

12

National Cancer Institute (2016). The information about genes is derived from Gene Cards

(2016).

Table 4.1 Tem Dictionary

Term abbreviation Term Note

LUAD

Lung adenocarcinoma

It is one of the non-small cell lung

cancers accounting for about 40%

of all lung cancers.

LUSC

Lung squamous cell carcinoma

Another one of non-small cell lung

cancers accounting for about 30%

of all lung cancer.

BRCA Breast cancer

Information derived from TCGA. BLCA Bladder carcinoma

ACC Adenoid cystic carcinoma

MET MET gene It is a protein coding gene.

TP53 Tumour protein p53 It is a protein coding gene.

Mut.disease Mutant disease

Information derived from RTCGA. WILDorNOINFO Wild type disease or no information

ZNF500 expression Zinc finger protein 500 It is a protein coding gene.

LUAD.rnaseq RNA sequence of Lung adenocarcinoma

4.2.2 General Description

As is mentioned, this study used the cancer genomic data sets of TCGA, in which there are

1093 lung cancer cases. From Figure 4.1, there are 585 lung adenocarcinoma cases

representing 54% and Figure 4.2 shows demographic information of lung adenocarcinoma

cases. It can be seen that cases of male and female are nearly equal and most of donors are

still alive. Within them, 75% are not Hispanic or Latino, 24% did not report and only 1% are

Hispanic or Latino. For race distribution, donors who are white represent 75% and black or

African Americans rank third with 10% while only 2% of them are Asians.

4.0 Methodology

13

Figure 4.1 Case distributions of LUAD and LUSC in Lung Cancer Project of TCGA

Figure 4.2 Demographic distributions of cases within LUAD

585, 54%

504, 46%

Lung Cancer Project

LUAD LUSC

Alive 64%

Dead 36%

Vital Status

Female 54%

Male 46%

Gender

75%

24% 1%

Ethnicity

Not hispanic or latino

Not reported

Hispanic or latino

75%

13%

10%

Race

White

Not reported

Black or african american

Asian

Other

4.0 Methodology

14

4.2.3 Sub-dataset Description

As is mentioned in the literature review, there are 39 cancer types in the TCGA data sets

covering from 2011 to 2016. In this study, 5 cancer types across six years were used,

including LUAD, LUSC, BRCA, BLCA and ACC. Within these 5 cancer types, RNA sequence

datasets, Clinical datasets as well as Mutation datasets were extracted.

After extracting these three kinds of sub-datasets of five cancer types, some calculation and

dataset integration were done before developing the dashboard. Each step is shown as

follow and the final results are presented in the Table 4.2. After the dataset combination,

specific columns (gene type or expression type) were extracted when needed.

Step 1: The Clinical datasets of LUAD and LUSC were integrated;

Step 2: The Clinical datasets of LUAD, LUSC and BRCA were integrated;

Step 3: The Mutation datasets of LUAD and ACC were integrated;

Step 4: The Clinical datasets of LUAD and ACC were integrated;

Step 5: The RNA sequence datasets of LUAD, LUSC, BRCA and ACC were integrated;

Step 6: The RNA sequence datasets of LUAD, LUSC, BRCA and BLCA were integrated.

Table 4.2 Each step of datasets building process

Final data sets Separated data sets Notes

“luadlusc.survinfo” LUAD.clinical + LUSC.clinical Integrated by column

“admin.disease_codde”

“luadluscbrca.survinfo”

LUAD.clinical + LUSC.clinical

+ BRCA.clinical

Integrated by column

“admin.disease_codde”

“luad_acc.clinical_mutations”

LUAD.clinical + ACC.clinical +

LUAD.mutations +

ACC.mutations


“admin.disease_codde”;

Filter by “MET” or “TP53” gene;

Filter by “Mutated” or “Wild” type

“luad_lusc_brca_acc.rnaseq”

LUAD.rnaseq + LUSC.rnaseq

+ BRCA.rnaseq + ACC.rnaseq



Filter by “MET” and “ ZNF500”

“luad_lusc_brca_blca.rnaseq”


+ BRCA.rnaseq +

BLCA.rnaseq



Filter by “MET” and “ ZNF500”

“luad_lusc_brca_acc.1rnaseq”





Filter by “MET”;

4.0 Methodology

15

Calculate log1p (MET);

“luad_lusc_brca_acc.2rnaseq”





Filter by “MET”;

Calculate log1p (MET);

4.3 Dashboard Development

4.3.1 Platform and Packages

To build the dashboard, R shiny was used and all codes were implemented by building

“server.R” and “ui.R”. There are seven main packages used, which are summarised in Table

4.3. During the building process, it was of great help to utilise the tutorials on Rpub as well

as the instructions on Github.

Table 4.3 Summary of main packages used for building the dashboard

Packages Description Accessed from

RTCGA,

RTCGA.clinical,

RTCGA.mutations,

RTCGA.rnaseq

Nine specific packages

were used in the study:

boxplotTCGA, checkTCGA,

datasetsTcga,

heatmapTCGA, infoTCGA,

kmTCGA, mutationsTCGA,

readTCGA, survivalTCGA

All of these packages are offered by TCGA which

contain various cancer related analysis.

Authors:

Marcin Kosinski, [email protected]

Przemyslaw Biecek, [email protected]

Witold CDhodor, [email protected]

shiny “Web application

framework for R”

Rstudio

ggvis “Visualise a data set with a

ggvis graphic”

rsconnect

“Deployment interface for

R Markdown documents

and Shiny applications”

4.3.2 Visualisation Creating

Using all the data sets and R packages described above, along with design guidelines on

“Github” as well as tutorials on “Rpub”, considering the requirement of three case studies

designed in literature review, three kinds of visualisations were chosen for the final

interactive dashboard.

mailto:[email protected]



4.0 Methodology

16

For case study 1, survival plot was chosen to visualise the results of Kaplan-Meier survival

analysis in order to reach the aim of demonstrating survival characteristics of lung

adenocarcinoma.

For case study 2, even though heat maps, scatter plot, networks as well as circle map are

able to represent gene expression patterns, as is stated and summarised in literature review,

box plot was finally used in this dashboard. There are two reasons for this decision. Heat

map, networks and circle map are multi-dimension visuals. They can represent gene

expression patterns and correlation with one more variable at the same time, which means

that much information would be contained into one graph. This may lead to time consuming

and more difficulties for users to understand the visualisations. For the scatter plot and box

plot, because of the fewer dimensions, both of them can visualise data more directly.

Scatter plot is regularly used to emphasize the correlation between two variables, such as

gene and gene expression (Gao et al, 2013). Box plot highlights the distribution of the

results, medians, lower and upper quartiles for instance.

Thus, to visualise the MET expression of lung adenocarcinoma as well as other comparison

cancer types, a cancer with highest MET expression could be recognised, which could

potentially indicate the gene overexpression. In addition, by doing so, the results of case

study 2 can be used to compare the results demonstrated in case study, so that they are

able to be used as validation of previous research of gene drivers detection.

For case study 3, heat map is finally used to visualise the molecular profile features of lung

adenocarcinoma. It is noted that

“The cancer genome is characterized by point mutations, aberrant methylation and

gene expression patterns, and altered chromosomal and genomic structure including

changed DNA copy number and gene fusions” (Cowin, Anglesio, Etemadmoghadam

& Bowtell, 2010, chapter 2. paragraph 1)

This reveals that cancer profiling contains varieties of features. In order to keep consistence

and allow comparison among different case studies, visualisations of gene expression

pattern were conducted.

4.3.3 Dashboard Publishing and Updating

During the dashboard building process, problems were solved via email accessing from Phil

Chapman, the cancer research scientist from Cancer Research UK as well as direct help from

fellow student Ketaki B Patil. In addition, a meeting with Phil Chapman was arranged for

further improvement of the initial dashboard. This meeting lasted for over an hour and

numerous observations and feedbacks were received. Each amendment is listed as follow:

1. The note of technical terms can be explained before the dashboard evaluation;

4.0 Methodology

17

2. Some “radio buttons” functions in the user interface could be changed into “select

box” functions because more potential choices can be added for future or a larger

scale study and hidden selection can show a better user interface.

3. The dashboard could be published by being conducted in R Markdown, which makes

it accessible and more convenient for participants to evaluate the dashboard via the

Internet.

4. The datasets could be downloaded and then added into the dashboard before

running the codes, which would enable the dashboard to run faster.

4.3.4 User Interface Designing

In order to create a more user friendly interface, the current portal interface of TCGA Data

Portal as well as cBioPortal were compared. It can be found that the main sidebar of TCGA

Data Portal locates on the left of screen while in cBioPortal, it locates on the top as a

navigation menu. As the screenshots depict:

Figure 4.3 Portal interface of TCGA (CDC) Data Portal

4.0 Methodology

18

Figure 4.4 Portal interface of cBioPortal

For the dashboard complemented here, as there are not so much selection could be offered

in the sidebar and meanwhile, the graphs need to be demonstrated here are not as big as

the one shown in TCGA Data Portal. Hence, combining the advantages of sidebar and

navigation menu function and the suggestions from Phil Chapman and Ketaki B Patil, the

final dashboard is shown as follow. As this interactive dashboard has been published via

shinyapps.io (a platform for publishing Shiny web applications), here is the link of the

interactive dashboard: http://tcgaplotappluad.shinyapps.io/TCGA_Plot_Validation

http://tcgaplotappluad.shinyapps.io/TCGA_Plot_Validation

4.0 Methodology

19

Figure 4.5 Interface of the dashboard built in this study

4.4 Design Evaluation and Questionnaire

As is mentioned in the literature review as well as dashboard development, three case

studies were built and all of the visualisations were built to achieve the aims of each case

study. Thus, some of questions designed in the evaluation questionnaire are also related to

the case studies. This evaluation questionnaire contains three sections which are

background information, dashboard evaluation and impression of using the dashboard. The

separated parts below present the elaboration of questions in each section (the

questionnaire is attached in Appendix 9.4).

4.0 Methodology

20

4.4.1 Background Information Section

This section aims to gather the demographic information (e.g.: gender), the profession and

the visualisations using experience of participants. For the reason that the dashboard built

here aims to support cancer research scientists and students for their research, the related

knowledge would influence their judgement and assessment criteria. The example

demonstrated below is one of the questions for revealing visualisation experience.

Figure 4.6 Question examples of visualisation creating experience

4.4.2 Evaluation Section & Dashboard usage impression Section

These two sections aim to find out if the dashboard can help participants to locate answer

of specific question, if graphs are clear enough for them to understand, if these graphs could

help to the know more about lung adenocarcinoma and finally, if the dashboard could help

their further study. Consequently, it can be summarised that two kinds of questions are

included: questions for evaluating the dashboard utility for supporting case studies and

questions for evaluating graphs themselves.

As three types of visualisations were built for supporting three case studies respectively, the

evaluation section consists of three parts which are “Survival Plot”, “Heatmap” and “Box

Plot”. The following three questions are examples designed according to each case study

and the other two are used for indicating the graph and dashboard utility.


4.0 Methodology

21



Those three figures above are question examples for finding out if those graphs achieve the

aims of each case study. The level of these questions were not designed to too difficult for

participants to answer but due to the different difficulty levels of understanding different

graphs, questions also have different difficulty levels.

Figure 4.7 is the question of one of the survival plots, aiming to find out if the participants

could recognise the survival features of TP53, which links back to the requirement of case

study 1. Figure 4.8 relates to the box plot section, aiming to indicate if the participants could

identify the MET expression differences of different cancers, which links back to case study

2. The question in Figure 4.9 is more difficult as it is a question from the heat map, which is

used to visualise the relationship among MET gene, ZNF500 expression and each cancer

RNA sequence, linking back to case study 3.

The two figures below are examples assessing utility of each graph and the dashboard. Most

of the questions in these two sections were designed as linear scale, which are more precise

and objective to illustrate the level of participants’ agreement instead of only offering fixed

answers. Also, the scores of each question could be measured and compared.

4.0 Methodology

22

Figure 4.10 Question examples for evaluating the graph itself

Figure 4.11 Question examples for evaluating the dashboard utility

4.4.3 Select participants and Conduct assessment

As this dashboard is designed for helping cancer research scientists and students,

participants with oncology related background or who are conducting cancer research are

priorities. In addition, data scientists or clinical data managers should also be targeted.

However, after searching on the website of the University of Sheffield, it can be found that

some biomedical scientists could be contributing into the cancer research. Consequently,

samples were selected from the staff list of the department of Oncology & Metabolism and

Biomedical Science. Students of MSc Data Science programme were selected as samples as

well. In addition, one of the candidates is the students who graduated from the MSc Data

Science last year and now is working in the Cancer Research UK.

In order to improve possibility of the response, each email was designed according to the

research interests of each candidate and then sent as invitation attached with the research

statement, the ethics approval letter and the links of evaluation form and dashboard (the

template of the email is attached in Appendix 9.2).

4.5 Research Ethics

This study has been identified as “low risks” and approved on ethics grounds by Information

School. The ethics application form, ethics approval letter and consent form are attached in

the Appendix 9.3.

4.5.1 Questionnaire and Participants

Even if the questionnaire consists of demographic information of the participants, all the

information will be anonymised and no identity information will be demonstrated. Even

though this research will involve human participations, the questions are not culturally or

politically sensitive.

4.0 Methodology

23

4.5.2 Datasets and Data storage

This research, as is described above, used data from TCGA, which offers open-access cancer

project data and donor sample data. All of the donor data are anonymous without showing

any personal information to protect confidentiality. As TCGA is a public website for

visualisations and accessing data, signing the consent form is not necessary. All the

genomics datasets were downloaded from TCGA website and stored in the online drive of

my university account ([email protected]). The questionnaires were sent out by “google

form” spreadsheet and stored in my university account as well. After submitting the

dissertation, all the data will be sent to the database of Information School.


5.0 Results

24

5.0 Results

5.1 Participant Background Information

Most of participants have rich experience in doing research with high education level. Figure 5.1 presents the background information of

participants, in which is distribution of age, education level and their profession experience, respectively. As this study mainly focuses on

participants with rich academic experience or cancer research related background, most of them are over 26 years old (58.8%), in which

participants in “26 -35” and “36 -45” account for the same proportion (17.6%) while “46 – 55” and “56 – 65” have the same percentages

(11.8%) as well. 41.2% of them have attained a postdoctoral degree and there are 52.9% of them with master degrees, respectively. Most of

them have been in their current profession for less than 1 year (41.2%) and 29.4% has more than 10 years professional experience. They were

also required to write down their current profession (shown in Appendix 9.5). It can further approve that they have rich research experience in

their profession, which it is significant for this dashboard evaluation to gain more valuable and helpful suggestions.

Figure 5.1 Distribution of age, education level, profession year experience

Using data visualisations shows a great demand and necessary for participants’ daily work. Figure 5.2 shows the level of their demands to

create the visualisations (e.g.: bar charts, box plots, survival plots) in their work or study. It indicates that most of participants (47.1%) always

5.0 Results

25

need to do so and 47.1% of them often need to so as well. When asking if they rely more upon the visualisations or the text to present their

work, 41.2% of them rely more on visualisations and 29.4% stays neutral (Figure 5.3).

Figure 5.2 Frequency of creating data visualisations

Figure 5.3 Preference of using visualisations and text

5.0 Results

26

5.2 Dashboard description and Evaluation results

Figure 5.4 First page of the dashboard

For the first page of the dashboard (Figure 5.4), the explanation of term abbreviation used in this dashboard is presented as the help note in

order to help users have a general understanding before exploring the dashboard. This help note is also shown at the bottom of the graphs

when exploring each graph. When asking whether they agree that the note is helpful for better understanding the graphs (Figure 5.5), 70.6%

of them agree with it, in which there are 35.3% of them completely agree.

5.0 Results

27

Figure 5.5 Help note agreement

Figure 5.6 “Different cancer types” section: Two cancer type comparison (left) and three cancer type comparison

5.0 Results

28

5.2.1 Survival Plot Section 1: “Different cancer types”

Survival Plot section is divided into two sections including “Different cancer types” and “Comparison in different genes” and Figure 5.6 presents two of the graphs in the “Different cancer types” section. The first graph compares the survival patterns of LUAD and LUSC, because both of them belong to non-small cell lung cancers. It would be meaningful to find their differences or similarities. The right one makes comparison among LUAD, LUSC and BRCA, which means that a totally different type of cancer, breast cancer was added into this comparison. It can be illustrated that LUAD and LUSC have a much lower survival probability then breast cancer, which is also generally believed in medical research (Siegel, Miller & Jemal, 2015).

For case study 1, its task is to visualise the survival characteristics of LUAD. This means that what has been presented here achieved the aim of

case study 1. However, in the evaluation result of the right one graph (Figure 5.7), only 41.2% of participants could recognise that LUAD has

the lowest survival probability when comparing with the other two types of cancers. 23.5% of them were unsure which should be the right

answer. That is why there are 35.3% participants staying neutral when asking if they agree that the graphs are easy enough to understand (see

Figure 5.8 below). Nevertheless, what should be emphasized here is that there are still more than half (59%) participants think that the graphs

are not difficult to comprehend and within which, 23.5% of them completely agree with it. Moreover, only 23.6% of them suppose there is too

much information in these graphs (Figure 5.9 below), 47% of them disagree with it. Hence, all these evidences indicate that survival plot itself

is not the main problem leading to the confusion.

Figure 5.7 Answer locating distribution of “Three cancer type comparison” (top right)

5.0 Results

29

Figure 5.8 Easy level agreements of graphs in the “Different cancer types” section (graphs shown in Figure 5.6)

Figure 5.9 Agreement of too much information for graphs in the “Different cancer types” section (graphs shown in Figure 5.6)

5.0 Results

30

In order to find out what may cause confusing or erroneous answers, a multiple choice question was asked and its results are presented in

Figure 5.10 It can be found that most participants (52.9%) agree that the legend and labels of these graphs could result in the errors.

Specifically, one of the participant explained that a fixed time frame should be defined when making comparison as LUAD and LUSC survival

changes over time. In addition, the shading regions representing standard error of the mean (SEM) ought to be defined on the legend of the

graph as well. For the labels, the unit for time axis should be given to avoid misunderstandings. In addition to the legend and labels, the

unfamiliarity of the medical terms is mentioned to be one of the causes because it takes time to match the terms in the graphs with the help

note at the bottom to locate the answer.

Figure 5.10 Possible elements that could cause erroneous answers of the “Different cancer types” section

The reason of why there are three graphs in this section is that it would useful to figure out if it is more useful for cancer researchers to

compare more types of cancers in one time. From Figure 5.11, 64.7% participants agree that it is useful to make comparison among different

cancers. However, it is stated by Phil Chapman that it sometimes depends on what kind of problems needed to be solved. Maybe that is why

there are 23.5% of them staying neutral and 11.8% of them disagree with it. Hence, this implies that those existing online portals for visualising

genomics data sets should provide various choices to meet different needs of scientists. In this case, it may useful to offer function that could

present graphs with or without different cancer comparison at the same time.

The legend of the graph

The title of the graph

The labels of the graph

None of them

Other

5.0 Results

31

Figure 5.11 Agreement of comparison among different cancers

When being asked if these graphs have derived insight into researching LUAD, half of participants (29.4% - 4; 23.5% - 5) agree that, while 23.5% (5.9% - 1; 17.6% - 2) disagree with it (Figure 5.12).

Figure 5.12 Insight gaining agreement of the graphs in the “Different cancer types” section

5.0 Results

32

5.2.2 Survival Plot Section 2: “Comparison in different genes”

Figure 5.13 “Comparison in different genes” section: MET gene (left) and TP53 gene

Similarly, the second section of “Survival Plot” aims to represent survival patterns of LUAD but in different gene types this time (Figure 5.13).

For these two graphs, one of them is to demonstrate the survival patterns of MET gene in LUAD and ACC (left) and the other one is to show

TP53 gene (TP53). For the reason that, both of them have been significantly identified as one of the mutated genes of LUAD though TP53

showed a higher significance (Devarakonda, Morgensztern & Govindan, 2015). Thus, it may be meaningful to know what the difference of their

survival patterns is in LUAD. In addition, the gene situation of both LUAD and ACC were divided into mutated type and wild type. This result

can be found from Figure 5.14, there are 64.7% participants suppose it is useful to compare mutation type and wild type and still 29.4% of

them stay neutral about it though.

5.0 Results

33

Figure 5.14 Agreement of comparison between mutation type and wild type

Figure 5.15 Answer locating distribution of “TP53” survival plot

5.0 Results

34

Compared with the first question in last section, Figure 5.15 demonstrates a higher accuracy in this question. 70.6% (>41.2%) participants

could identify that ACC in mutated TP53 has the lowest survival probability. After comparing individual response, it is interesting to find that 6

participants who located the wrong answer the last section gained the correct answer in this section. In addition, more participants (82.3% >

58.8%) agree that the graphs in this section are easy enough to read (Figure 5.16). Moreover, only 5.9% agrees that these graphs present too

much information even though there are still more than half of them stay neutral (Figure 5.17). What should be highlighted is that no one

disagree that these graphs are of no use for deriving insight into researching LUAD and there 70.5% of them indicate that they are helpful

(Figure 5.18). All these evidences demonstrate a better comprehension of participants in this section.

Figure 5.16 Easy level agreements of graphs in the “Comparison in different genes” section (graphs shown in Figure 5.13)

5.0 Results

35

Figure 5.17 Agreement of too much information for graphs in the “Comparison in different genes” section (graphs shown in Figure 5.13)

Figure 5.18 Insight gaining agreement of the graphs in the “Comparison in different genes” section

5.0 Results

36

There may be two reasons. First, in this case, those four survival curves do not overlap or cross too much while in last section, the survival

curves of LUAD and LUSC crosses once during the whole time period. This may result in less confusion. Second, participants may get more

familiar with the survival plot after exploring the last section.

Although more the accurate answers are shown, elements that may result in erroneous answers still exist. Within them, label of the graph is

mentioned most (52.9%) and legend come second (35.3%) (Figure 5.19).

Figure 5.19 Possible elements that could cause erroneous answers of the “Comparison in different genes” section




None of them

Other

5.0 Results

37

5.2.3 Box Plot Section

Figure 5.20 Box Plot Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA

In the “Box Plot” section, there are two graphs and each of them compares the results of logarithm of MET among four types of cancer RNA

sequence at one time. As is stated in the methodology, this section aims to achieve the task of case study 2. This means that the box plot here

is used to visualise the MET expression feature of LUAD as well as other compared cancer types. By implementing this visualisation, a cancer

with highest MET expression could be recognised, which could potentially indicate the gene overexpression (Prelich, 2012). In Figure 5.20, the

label of x-axis “logarithm of MET” stands for the MET expression feature. As “LUAD.rnaseq” has the highest logarithm of MET median, this

implies that MET gene is more probable to overexpress in LUAD than other cancer types included here.

Compared with the “Survival Plot” section 1, the accuracy of the case associated question is higher in this section. It can be seen from Figure

5.21, those chose the right answer “LUAD” account for 70.6% (>41.2%). Additionally, no one disagree that these graphs are easy enough to

5.0 Results

38

understand (presented in Figure 5.22). All these imply that, in this dashboard, box plot itself is to a degree, easier than survival plot for

participants to read.

Figure 5.21 Answer locating distribution of LUAD & LUSC & BRCA & ACC (left) box plot

Figure 5.22 Easy level agreements of graphs in the Box Plot section (graphs shown in Figure 5.20)

5.0 Results

39

In order to find out how useful of the box plot presented here is, questions related to this were asked. 76.5% of respondents agree that it is useful to visualise its gene feature in this way (Figure 5.23). Besides, 76.4% respondents agree that they could gain insight in to researching LUAD by these graphs (Figure 5.24), which implies that the box plots visualised here have achieved the task of case study 2 though some elements still need to be improved (Figure 5.25).

Figure 5.23 Utility agreements of graphs in the Box Plot section

Figure 5.24 Insight gaining agreement of the graphs in the Box Plot section

5.0 Results

40

Figure 5.25 Possible elements that could cause erroneous answers of the Box Plot section

For those who got the wrong answer or were not sure about the answer, most of them suppose that the labels (52.9%) of these graphs are

more likely to result in the wrong answer. Specifically, some of the participants mentioned that the font of the labels is too small to identify. In

addition, one of the researches advised that the spots and the different length of each line should be elaborated here. Furthermore, one of the

participants, who is a research scientist working over 10 years suggested that the subscripted “.rnaseq” is not necessary in the label. Even

though the data used here is the RNA sequence data, it is still used to represent the MET expression features of LUAD when showing the label

“LUAD.rnaseq”, for example. In addition, it may cause confusion for users of the meaning between “LUAD” and “LUAS.rnaseq”. Therefore, it is

more direct to use “LUAD” instead of “LUAD.rnaseq” for better understanding.




None of them

Other

5.0 Results

41

5.4 Heat map Section

Figure 5.26 Heat map Section: LUAD & LUSC & BRCA & ACC (left) and LUAD & LUSC & BRCA & BLCA

The same at last section, “Heatmap” section consist of two graphs and each of them contains data from four different cancer types. Within

three dimensions of data, y-asix shows four intervals of the MET expression data, x-axis shows RNA sequence data of four cancer types while

the other gene expression, ZNF500 is presented as colour of each patch. Hence, the heat map here is used to reveal the correlation of ZNF500

expression and MET expression among four cancer types. As is stated above, the task of case study 3 is to visualise molecular profile features

of LUAD and visualising gene expression pattern was finally chosen to be conducted. In this case, the gene expression correlation of each

cancer types is one of their molecular profile features.

5.0 Results

42

In order to validate if the heat map used in the dashboard is useful or not for visualising the molecular profile features of LUAD, related

questions were asked. From Figure 5.27, 64.7% of respondents agree that it is helpful for understanding LUAD molecular profile patterns. In

addition, more than half (58.8%) of them agree that these visualisations have derived insight into researching LUAD.

Figure 5.27 Utility agreements of graphs in the Heat map section

Figure 5.28 Insight gaining agreement of the graphs in the Heat map section

5.0 Results

43

From Figure 5.29, it can be demonstrated that the accuracy of the case related question in this section is much lower than the last section.

Only one respondent could identify that the ZNF500 expression of BLCA has the highest correlation with MET gene. In addition, respondents

who found it difficult to the graphs represent 29.4%, which is much higher than the “Box Plot” section (0%). There are 17.6% of them agree

that it is too much information for them, which is more than the “Box Plot” section (11.8%) as well.

Figure 5.30 Answer locating distribution of LUAD & LUSC & BRCA & BLCA (right) heat map

5.0 Results

44

Figure 5.31 Easy level agreements of graphs in the Heat map section (graphs shown in Figure 5.26)

Figure 5.32 Agreement of too much information for graphs in the Heat map section (graphs shown in Figure 5.26)

5.0 Results

45

Most of them (52.9%) suppose that the labels of the graph could lead to wrong answers and the legend and title rank second (23.5%) and third

(17.6%) which is shown in Figure 5.33. What should be stressed is that, no one mentions that the title would be a problem in last two sections.

Hence, the title and legend of these graphs could be the main cause to errors in this case. Specifically, one of the respondents, who is an

academic researcher with experience for 6 – 10 years stated that, the heat map is difficult to interpret what kind of information it is intended

to present and why.

Figure 5.33 Possible elements that could cause erroneous answers of the Heat map section




None of them

Other

5.0 Results

46

5.5 The whole impression of this dashboard for participants

From Figure 5.34, it can be illustrated that more than half (52.9%) of the participants are quite familiar with the similar dashboard

(visualisations) prior to the one presented here and 29.4% of them are not familiar with it. After using this dashboard, 68.8% of them state that

they are willing to use it in their future work or study if needed and those stay neutral and rejected account for 25% and 6.3%, respectively

(Figure 5.35). Some of them provide the reason explaining why and all the responses can be summarised as follow.

1. This kind of dashboard could be a useful learning tool. It not only simplifies the process of interpreting, but also provides for the users

that what types of visualisations and analyses could be used for these types of datasets as a reference. Especially for those who are

beginners for learning about analysing genomic data.

2. This dashboard needs to be more flexible and sophisticated. One of the respondents stressed that this dashboard can do the job as a

basic one with limited data, but it would not be needed again due to its limited information and functions. In addition, it is suggested

that it will make it more functional if users could import external data into the dashboard for comparison. Thus, it is more useful if it

has function allowing users to import their data for visualisation by applying the survival plot, heat map and box plots options.

3. Much more annotation and guidance are required. As is mentioned above, the legend, labels, and titles of the graphs in this dashboard

need to be improved for better understanding. In addition, one of them also highlighted that it takes time for beginners to recognise

different cancer types and gene types when locating answers. This means that annotation and guidance on the dashboard is not

enough.

5.0 Results

47

Figure 5.34 Familiarity of the similar dashboard

Figure 5.35 Willingness of using the dashboard in the future works (or study)

5.0 Results

48

5.6 Further validation

As is mentioned above, box plot shows a better interpretation among participants, to further validate and compare in a quantity way, scores

were calculated. In the questionnaire, the same structure of questions was asked in each section of graphs. Thus, the average scores of one of

three questions were calculated for comparison. The specific questions include:

1. Which level do you agree that these graphs are easy enough to read (or understand)?

2. Which level do you agree that these graphs present too much information?

3. Which level do you agree that these graphs have derived insight into researching lung adenocarcinoma?

Figure 5.36 Interpretation measures of different visualisation types

This figure provides further and more direct validation that participants find it easier to understand and gain insight from the box plot and the

second survival plot. In the first measure “Easy level agreement”, the second survival plot and the box plot gain the same score (3.88), which is

slightly higher than the first survival plot (3.76) and much higher than heat map (3.24). In the second measure “Too much information”, the

3.88

2.65

3.94

3.24

2.76

3.29

3.88

2.47

3.88

3.76

2.65

3.47

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

Easy level agreement

Too much information

Insight gaining agreement

Survival 1

Survival 2

Heat map

Box Plot

5.0 Results

49

second survival plot received the lowest score (2.47) and box plot and the first survival plot come second (2. 65), which means that more

participants agree that heat map (2.76) contains too much information for them. The last measure, asking about how much insight gaining,

box plot got the highest score (3.94) and the heat map is the lowest (3.29). This could indicate that the difficulty to interpret the graph itself

would have direct impact for information expression.

Figure 5.37 Interpretation measures of participants with different profession experience

Figure 5.37 has the same measures as last figure, but in different categories. As participants were asked how long have they been in their

current profession, the comparison can be made to find out if their profession experience would influence their understanding of the

visualisations. What should be focused here is that, for those who have profession experience for less than 1 year, demonstrate lower scores in

the “Easy level agreement” and “Insight gaining agreement” and higher scores in “Too much information”. For those who have been worked

for more than 10 years, received the highest scores in insight gaining but still found that those graphs consist of too much information for

3.68

2.75

3.68

3.77

2.80

3.61

4.00

2.53

3.75

3.25

2.50

1.75

3.72

2.64

3.77

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50




More than 10 years

6 to 10 years

3 to 6 years

1 to 3 years

Less than 1 year

5.0 Results

50

them when comparing those who have been in their profession for 3 to 10 years. This means that, even though they could not interpret the

visualisations completely, they still can use their professional experience to understand the graphs. It is interesting to find that people who

have 3 to 6 years professional experience obtained a better interpretation results overall.

Figure 5.38 Interpretation measures of participants with different levels of visualisation familiarity

3.77

2.80

3.61

4.00

2.53

3.75

3.68

2.75

3.68

3.69

2.63

3.65

4.25

1.00

5.00

0.00 1.00 2.00 3.00 4.00 5.00 6.00




1

2

3

4

5

5.0 Results

51

In order to find out how familiarity to data visualisations would impact this dashboard interpretation, measures were conducted for the

following question:

According to this question scales, “1” in the legend stands for those rely more on visualisations while “5” represents those prefer to use text.

Although, there is no obvious linear trend in each measure, some findings are still significant to be stressed here. Participant who relies

completely on visualisations in the work illustrates the highest scores (5.00) in insight gaining and lowest scores (1.00) in “Too much

information” measure. The other side, those who completely prefer to use text for presenting have the lowest scores (3.61) in insight gaining

and agreed most (2.80) with too much information in the graphs. For the “Easy level agreement” measure, most preference for visualisations

shows the highest scores (4.25) as well.

5.0 Results

52

Figure 5.39 Interpretation measures of participants with different domain knowledge

For this figure5.39, the measures are categorised by participants with different knowledge expertise. In the category of “Academic staff &

students”, it stands for participants who are selected from the staff list of the department of Oncology & Metabolism and Biomedical Science.

“Data Science” represents the participants from the MSc Data Science while “Others” only contains one participant who is a data scientist

having been worked in Cancer Research UK for a year. This comparison is used to find out how the different domain knowledge would

influence the interpretation of these visualisations. From Figure X, “Others” presents the best comprehension with highest scores in both “Easy

level agreement” (4.00) and “Insight gaining agreement” (4.00) and lowest scores (2.00) in “Too much information” measure. However, what

should be emphasized is that, as it only represents the results of one person, it is not statistically significant.

Moreover, it is important to find that participants with data science expertise show better understanding than those with cancer or biomedical

related knowledge. This may due to the reason that, in this dashboard, users were not required to explore the genomics data sets. They could

use the help note for basic understanding thus, they do not need to comprehend the original data sets when interpreting the datasets. In

addition, as they are familiar with data visualisations and data analysis, they could firstly recognise what kind of analysis was used in the

4.00

2.00

4.00

4.00

2.59

3.72

3.69

2.63

3.65

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50




Academic staff & students

Data Science

Others

5.0 Results

53

graphs to understand what is expressed in the visualisations. Maybe that is also why they found it easier for interpretation (4.00) and slightly

more insight gaining (3.72) than those with cancer or biomedical relevant knowledge.

However, after exploring the comments for recommendation, most of the suggestions of participants with cancer or biomedical associated

knowledge focus much on the functional improvement instead of only visualisation base improvement. This means that, these types of

participants could think more out of the visualisation and dashboard base, but the data sets as well as the functions related development.

Figure 5.40 Willingness of participants with different domain knowledge

Willingness to use the dashboard in the future work was asked among participants with different domain knowledge. It can be seen that

participants with data science expertise obtained higher scores (4.63 > 4.19) than those are from the department of Oncology & Metabolism or

Biomedical Science. This means that the former is more willing to utilise this type of dashboard than the latter when necessary.

4.19

4.63

3.00

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 5.00

Academic staff & students

Data Science

Others

Willingness to use it in the future

5.0 Results

54

Figure 5.41 Willingness of participants with different frequencies of creating visualisations

The willingness was also measured among participants with different data visualisations creating frequencies. The question for categorising is

shown as follow:

It can be illustrated that all the participants involved in the study need to create data visualisation regularly as the minimum frequency is “3”.

From Figure 5.41, those who most frequently create data visualisations (5) gained highest scores (4.19) in usage willingness, which is slightly

higher (4.19 > 4.13) than those in the second rank of frequency (4). Those who create visualisations not that much in their work received the

lowest scores (3.00) in usage willingness.

3.00

4.13

4.19

0.00 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50

3

4

5

Willingness to use it in the future

55

6.0 Discussion and Dashboard design recommendations

6.1 Participants

The background of participants involved in this dashboard evaluation covers a broad and

balanced range of attributes, which improves the research diversity for this study. For the

age range, except the most proportion of 18-25 (41.2%), others with a wide range from 26

to 65 years old, have similar number of respondents. In addition, the number of students

and people who are working represent the same proportion (47.1%). The education level of

participants ranges from master to postdoctoral. Their profession covers data science

student, data scientist, professor/academic staff/research technician from Oncology &

Metabolism department and Biomedical Science department. Moreover, the time for

dedicating in their profession ranges from less than 1 year to more than 10 years.

6.2 Academic research (working) experience impacts

Academic research experience or working experience shows a positive impact not only for

better interpreting the visualisations and using the dashboard, but also for insight gaining.

As is mentioned shown in Figure 5.37, those with less professional experience demonstrate

a lower level of agreement of insight gaining from the visualisations than those who have

been worked for more than 10 years. In addition, those with rich research experience

expressed less agreement that the visualisations contain too much information.

However, the gap of each type is small, except for the only one respondent who has been

worked for 6 to 10 years. This may due to the reason that even though participants cover

different majors, all of them possess portion of the required knowledge to interpret those

visualisations. Additionally, the help note covered the shortage of some participants to a

certain extent. Furthermore, the limited sample size may have inferences to the study

significance as well.

6.3 Domain knowledge familiarity impacts

It is interesting to find that participants with data science background illustrate a bit more

positive results than those who are experts in cancer or biomedicine. This result, to some

extent, can prove statement of Klonowska et al. (2015) that those visualisations on

genomics research portals are able to facilitate those without bioinformatics background.

This is also the ambition of those existing cancer research portals. However, after further

exploring, for those only with data science background, cannot completely understand the

meanings of the data even if they could choose the right answers.

What should be emphasized here is that the one who possesses both domains of knowledge

(“others” in Figure 5.37) presents a much better interpreting result. This, at the same time,

provides evidence for the argument of Schroeder, Gonzalez-Perez, Lopez-Bigas (2013),

programming and bioinformatics knowledge are both necessary to efficiently using the

56

visualisations. Consequently, to completely interpret this dashboard, not only requires

general knowledge of interpreting visualisations, but also demands knowledge related to

oncology, genetics, cancer or biomedicine.

Additionally, visualisation usage is also another domain knowledge, which could affect the

comprehension to the visualisations. As is shown in Figure 5.38, the one who completely

prefer to use visualisations when presenting the work presents apparent higher insight

gaining agreement, much less agreement of too much information and higher agreement of

easy interpretation. After further exploring, this participant is professor of bioinformatics

imaging, which indicates a frequent usage of visualisations. This can be linked to the study

of Ellard (2015) that, both the familiarity with the dataset and the experience of dashboard

usage could influence uses’ confidence in referencing the information gained from the

dashboard. That is why the participants involved in this study were selected from

department of Oncology & Metabolism, Biomedical Science and Data Science as the gap of

domain knowledge could be recognised, to a degree.

6.4 Dashboard comprehension

In addition to domain knowledge and associated technical expertise, there are other

elements could make contribution to comprehend the dashboard more efficiently. In the

evaluation, the unclear annotation and limited guidance was commented by participants.

Even if for those who expertise in cancer or biomedical research, they still need clear

annotation for them to interpret the dataset and analysis methods used in the visualisations,

not mention to beginners. One of the students commented that she does not understand

the information in the visualisations and another one mentioned that it was time consuming

to identify the labels of cancer types and gene types when locating the answers.

Furthermore, the annotation and guidance (help note) should be designed according to

different types of graphs. In this dashboard, the second survival plot and box plot

demonstrate better interpretation results and heap map obtained the worst. This means

that more elaboration for the heat map should be provided for the users in this case.

The improvement of comprehension results of the second survival plot may also result from

skill of learning by exploring. It is stated by Elias (2012) that, the design principles of the

dashboard would have influence on learning process of users because the interactive

dashboard is a storytelling tool. Hence, when considering the how to arrange the

visualisation, designing the annotation or guidance of the interactive dashboard, the

potential links and relationship should be considered.

6.5 Dashboard design recommendations

The annotation and guidance (help note) should be clear and precise. In this dashboard, due

to the unclear legend and labels of the graphs, participants found it difficult to quickly locate

the answers of the questions and understand the visualisations as well. Specifically, as is

57

discussed above, the annotation and guidance are better to meet the users with different

domain knowledge as well as technical expertise, which would also have influences on

visualisation comprehension.

Interactive dashboard is more than just combination of visualisations. It can be used as a

self-learning tool to simplify the information extracting process from tremendous and

multidimensional data (Few, 2007), such as genomics data. It is commented by participants

that the dashboard created here not only offer help for interpreting this type of data sets,

but also provides the approaches for visualising and analysing genomics data sets.

In addition, some participants mentioned that the dashboard here should be more

functional and intelligent. For example, it can be improved as a transferred tool, which is

able to import data from existing genomics research portals such as ICGA, cBioPortal, as well

as the data from users. In this way, users could use the visualisation designed in this

dashboard to make comparison and are accessible to explore the data, which would

improve the comprehension level.

The design of the dashboard for visualising genomic data should meet the different

requirement of scientists from different domain. As is presented in Figure 5.39, for

participants with different domain knowledge, they expressed different level of willingness

to use this kind of dashboard for their future study or work. Those with data science

background are more willing to use this dashboard. It may due to the reason that, for those

without cancer or bioinformatics related expertise, using the interactive dashboard to

process data and access the visualisations is more direct and less expertise exploring.

58

7.0 Conclusions and Recommendations

7.1 Conclusions

By exploring various existing online portals for analysing and visualising genomics datasets,

substantial beautiful and sophisticated interactive visualisations can be found. This indicates

that great progress for visualising multidimensional genomics data has been made, which

has been helping cancer research scientists for more efficient decision making. However,

seldom researches have been conducted to evaluate whether these visualisations or portals

are useful enough for scientists, whether its usage efficiency varies from different types of

scientists and how helpful they are. Thus, this study has used the genomics datasets of TCGA

to create an interactive dashboard, which contains several visualisations aiming to provide

help for researching lung adenocarcinoma. Then, the systematic assessment was set up for

evaluating the efficiency of the dashboard. Some issues of the dashboard were identified

and they could be used for developing and improving the existing cancer research online

portals.

Academic research experience or working experience is identified to have positive impacts

on comprehending the visualisations as well as manipulating the interactive dashboard.

Those with rich profession experience illustrated less agreement with too much information

in the visualisations, for instance. In addition, compared with participants with oncological

or biomedical related expertise, those from MSc Data Science programme demonstrate

more positively for locating the right answers. However, participants with oncological or

biomedical associated knowledge could think more out of the dashboard interface

improvement and provided suggestions of how to make the dashboard more intelligent and

practical.

In addition to domain knowledge, previous visualisation usage could affect the comprehension

to the visualisations as well. Also, the type of graph illustrates different interpretation

results. Furthermore, the annotation and guidance show a significant position for better

visualisation understanding, because the indistinct legend and labels in the dashboard would result

in inefficient dashboard usage. Consequently, due to the domain knowledge and technical expertise

impacts, the annotation and guidance (help note) should be designed according to different

types of graphs and users.

It is interesting to find that, during the dashboard assessment, this interactive dashboard

was used more than a visualisation tool, but a self-learning tool for learning cancer related

knowledge as well as approaches for analysing and visualising genomics data sets. For the

reason that, some participants could recognise more functional future development for this

type of dashboard and suggested that being able to import data for more exploration and

make comparison with exiting datasets (e.g.: ICGC, TCGA) could be an active area. Other

participants, for example, expressed that it offers the potential methods for visualising and

analysing these types of datasets as a reference.

59

7.2 Recommendations for future research

7.2.1 For data samples and participants

The sample size should be increased for more solid results and further significantly

statistical analysis as the involved participants were limited in this study. This, to some

extent limits the reliability of the findings in this study. Larger dataset and much more types

of cancer data should be involved as this dashboard only focused on several cancer types

and gene types.

7.2.2 For the interactive dashboard

Broader range of analysis approaches could be conducted for evaluation, which also means

that wider varieties of visualisations should be created and need to be validated by users.

Before evaluation, more functions could be designed and added for a more functional and

intelligent dashboard.

7.2.3 For the evaluation methods

The approach to evaluate the dashboard should be improved by designing and using more

reliable and dimensional assessment indicators. In addition to using questionnaire, more

kinds of assessing approaches (e.g.: interview, real-time monitoring) could be conducted to

improve the measure process.

60

8.0 References

Arcila, M. E., Drilon, A., Sylvester, B. E., Lovly, C. M., Borsu, L., Reva, B., ... & Ladanyi, M.

(2015). MAP2K1 (MEK1) mutations define a distinct subset of lung adenocarcinoma

associated with smoking. Clinical Cancer Research, 21(8), 1935-1943. Retrieved from:

http://clincancerres.aacrjournals.org/content/21/8/1935.short

Cancer Genome Atlas Research Network. (2014). Comprehensive molecular profiling of lung

adenocarcinoma. Nature, 511(7511), 543-550. Retrieved from:

http://www.nature.com/nature/journal/v511/n7511/abs/nature13385.html

Ciriello, G., Cerami, E., Sander, C., & Schultz, N. (2012). Mutual exclusivity analysis identifies

oncogenic network modules. Genome research, 22(2), 398-406. Retrieved from:

http://genome.cshlp.org/content/22/2/398.short

Cline, M. S., Smoot, M., Cerami, E., Kuchinsky, A., Landys, N., Workman, C., ... & Hanspers, K.

(2007). Integration of biological networks and gene expression data using Cytoscape. Nature

protocols, 2(10), 2366-2382. Retrieved from:

http://www.nature.com/nprot/journal/v2/n10/abs/nprot.2007.324.html

Cowin, P. A., Anglesio, M., Etemadmoghadam, D., & Bowtell, D. D. (2010). Profiling the

cancer genome. Annual review of genomics and human genetics, 11, 133-159. Retrieved

from: http://www.annualreviews.org/doi/full/10.1146/annurev-genom-082509-141536

Czubak, K., Lewandowska, M. A., Klonowska, K., Roszkowski, K., Kowalewski, J., Figlerowicz,

M., & Kozlowski, P. (2015). High copy number variation of cancer-related microRNA genes

and frequent amplification of DICER1 and DROSHA in lung cancer. Oncotarget, 6(27), 23399-

23416. Retrieved from: http://europepmc.org/abstract/med/26156018

Devarakonda, S., Morgensztern, D., & Govindan, R. (2015). Genomic alterations in lung

adenocarcinoma. The Lancet Oncology, 16(7), e342-e351. Retrieved from:

http://www.sciencedirect.com/science/article/pii/S1470204515000777

Elias, M. (2012). Enhancing User Interaction with Business Intelligence Dashboards (Doctoral

dissertation, Ecole Centrale Paris). Retrieved from:

https://tel.archives-ouvertes.fr/file/index/docid/969170/filename/elias_thesis.pdf

Fernandez-Banet, J., Esposito, A., Coffin, S., Horvath, I. B., Estrella, H., Schefzick, S., ... &

Roberts, P. (2016). OASIS: web-based platform for exploring cancer multi-omics data. Nature

methods, 13(1), 9-10. Retrieved from:

http://www.nature.com/nmeth/journal/v13/n1/full/nmeth.3692.html

http://clincancerres.aacrjournals.org/content/21/8/1935.short

http://www.nature.com/nature/journal/v511/n7511/abs/nature13385.html

http://genome.cshlp.org/content/22/2/398.short

http://www.nature.com/nprot/journal/v2/n10/abs/nprot.2007.324.html

http://www.annualreviews.org/doi/full/10.1146/annurev-genom-082509-141536

http://europepmc.org/abstract/med/26156018


https://tel.archives-ouvertes.fr/file/index/docid/969170/filename/elias_thesis.pdf

http://www.nature.com/nmeth/journal/v13/n1/full/nmeth.3692.html

61

Few, S. (2007). Dashboard confusion revisited. Perceptual Edge, 1-6. Retrieved from:

http://perceptualedge.com/articles/03-22-07.pdf

Futreal, P. A., Coin, L., Marshall, M., Down, T., Hubbard, T., Wooster, R., ... & Stratton, M. R.

(2004). A census of human cancer genes. Nature Reviews Cancer, 4(3), 177-183.

Retrieved from: http://www.nature.com/nrc/journal/v4/n3/full/nrc1299.html

Gao, J., Aksoy, B. A., Dogrusoz, U., Dresdner, G., Gross, B., Sumer, S. O., ... & Cerami, E.

(2013). Integrative analysis of complex cancer genomics and clinical profiles using the

cBioPortal. Science signaling, 6(269), pl1. . Retrieved from:

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4160307/

Gonzalez-Perez, A., & Lopez-Bigas, N. (2012). Functional impact bias reveals cancer drivers.

Nucleic acids research, gks743. Retrieved from:

http://nar.oxfordjournals.org/content/early/2012/08/14/nar.gks743.short

Isenberg, P., Zuk, T., Collins, C., & Carpendale, S. (2008, April). Grounded evaluation of

information visualizations. In Proceedings of the 2008 Workshop on BEyond time and errors:

novel evaLuation methods for Information Visualization (p. 6). ACM. Retrieved from:

http://dl.acm.org/citation.cfm?id=1377974

Jordi, B., Giordano, C., Nicolas, S., Kavitha, V., Adam, A.M., Sungjoon, K., …& Levi, A.G.

(2012). The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug

sensitivity. Nature, 483 (7391), 603-607. Retrieved from:

http://www.nature.com/nature/journal/v483/n7391/full/nature11003.html

Klonowska, K., Czubak, K., Wojciechowska, M., Handschuh, L., Zmienko, A., Figlerowicz,

M., ... & Kozlowski, P. (2015). Oncogenomic portals for the visualization and analysis of

genome-wide cancer data. Oncotarget. Retrieved from:


Landi, L., Minuti, G., & Salvini, J. (2013). MET overexpression and gene amplification in

NSCLC: a clinical perspective. Dovepress, 4, 15-25. Retrieved from:

https://www.dovepress.com/met-overexpression-and-gene-amplification-in-nsclc-a-clinical-

perspect-peer-reviewed-article-LCTT

Morse, E., Lewis, M., & Olsen, K. A. (2000). Evaluating visualizations: using a taxonomic

guide. International Journal of Human-Computer Studies, 53(5), 637-662. Retrieved from:


http://perceptualedge.com/articles/03-22-07.pdf

http://www.nature.com/nrc/journal/v4/n3/full/nrc1299.html


http://nar.oxfordjournals.org/content/early/2012/08/14/nar.gks743.short

http://dl.acm.org/citation.cfm?id=1377974



https://www.dovepress.com/met-overexpression-and-gene-amplification-in-nsclc-a-clinical-perspect-peer-reviewed-article-LCTT

https://www.dovepress.com/met-overexpression-and-gene-amplification-in-nsclc-a-clinical-perspect-peer-reviewed-article-LCTT


62

Olivero, M., Rizzo, M., Madeddu, R., Casadio, C., Pennacchietti, S., Nicotra, M. R., ... &

Comoglio, P. M. (1996). Overexpression and activation of hepatocyte growth factor/scatter

factor in human non-small-cell lung carcinomas. British journal of cancer, 74(12), 1862.

Retrieved from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2074802/

Prelich, G. (2012). Gene overexpression: uses, mechanisms, and interpretation. Genetics,

190(3), 841-854. Retrieved from: http://www.genetics.org/content/190/3/841

Rajaram, S., & Oono, Y. (2010). NeatMap-non-clustering heat map alternatives in R. BMC

bioinformatics, 11(1), 1. Retrieved from:

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-45

RPubs. (2015). Visualisations Vignette for Cancer Cell Lines Package, Retrieved December 2,

2015 from http://rpubs.com/chapmandu2/120228

Santarius, T., Shipley, J., Brewer, D., Stratton, M. R., & Cooper, C. S. (2010). A census of

amplified and overexpressed human cancer genes. Nature Reviews Cancer, 10(1), 59-64.

Retrieved from: http://www.nature.com/nrc/journal/v10/n1/full/nrc2771.html

Saraiya, P., North, C., & Duca, K. (2005). An insight-based methodology for evaluating

bioinformatics visualizations. Visualization and Computer Graphics, IEEE Transactions on,

11(4), 443-456. Retrieved from:

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432690&tag=1

Schroeder, M. P., Gonzalez-Perez, A., & Lopez-Bigas, N. (2013). Visualizing multidimensional

cancer genomics data. Genome Med, 5(1), 9. Retrieved from:

http://repositori.upf.edu/handle/10230/23185

Siegel, R. L., Miller, K. D., & Jemal, A. (2015). Cancer statistics, 2015. CA: a cancer journal for

clinicians, 65(1), 5-29. Retrieved from:

http://onlinelibrary.wiley.com/doi/10.3322/caac.21254/

Stephens, P. J., McBride, D. J., Lin, M. L., Varela, I., Pleasance, E. D., Simpson, J. T., ... &

Greenman, C. D. (2009). Complex landscapes of somatic rearrangement in human breast

cancer genomes. Nature, 462(7276), 1005-1010. Retrieved from:


Suh, S. S., Yoo, J. Y., Cui, R., Kaur, B., Huebner, K., Lee, T. K., ... & Croce, C. M. (2014). FHIT

suppresses epithelial-mesenchymal transition (EMT) and metastasis in lung cancer through

modulation of microRNAs. PLoS Genet, 10(10), e1004652. Retrieved from:

http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004652


http://www.genetics.org/content/190/3/841

http://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-11-45

http://rpubs.com/chapmandu2/120228

http://www.nature.com/nrc/journal/v10/n1/full/nrc2771.html

http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1432690&tag=1

http://repositori.upf.edu/handle/10230/23185

http://onlinelibrary.wiley.com/doi/10.3322/caac.21254/


http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004652

63

Travis, W. D. (2002). Pathology of lung cancer. Clinics in chest medicine, 23(1), 65-81.

Retrieved from: http://www.sciencedirect.com/science/article/pii/S0272523103000613

Vaske, C. J., Benz, S. C., Sanborn, J. Z., Earl, D., Szeto, C., Zhu, J., ... & Stuart, J. M. (2010).

Inference of patient-specific pathway activities from multi-dimensional cancer genomics

data using PARADIGM. Bioinformatics, 26(12), i237-i245. Retrieved from:

https://bioinformatics.oxfordjournals.org/content/26/12/i237.full

Weinstein, J. N. (2008). Biochemistry. A postgenomic visual icon. Science (New York, NY),

319(5871), 1772-1773. Received from:

http://science.sciencemag.org/content/319/5871/1772

Wilkinson, L., & Friendly, M. (2012). The history of the cluster heat map. The American

Statistician. Retrieved from:

http://cda.psych.uiuc.edu/multivariate_fall_2013/lincoln/heatmap.pdf

Wu, Y., Liu, H., Shi, X., Yao, Y., Yang, W., & Song, Y. (2015). The long non-coding RNA HNF1A-

AS1 regulates proliferation and metastasis in lung adenocarcinoma. Oncotarget, 6(11), 9160.

Retrieved from: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4496209/

Yin, T., Cook, D., & Lawrence, M. (2012). ggbio: an R package for extending the grammar of

graphics for genomic data. Genome Biol, 13(8), R77. Retrieved from:

http://download.springer.com/static/pdf/

Yin, S., Yang, J., Lin, B., Deng, W., Zhang, Y., Yi, X., ... & Zhao, G. (2014). Exome sequencing

identifies frequent mutation of MLL2 in non–small cell lung carcinoma from Chinese

patients. Scientific reports, 4. Retrieved from:

http://www.nature.com/articles/srep06036?WT.ec_id=SREP-631-20140819


https://bioinformatics.oxfordjournals.org/content/26/12/i237.full

http://science.sciencemag.org/content/319/5871/1772

http://cda.psych.uiuc.edu/multivariate_fall_2013/lincoln/heatmap.pdf


http://download.springer.com/static/pdf/

http://www.nature.com/articles/srep06036?WT.ec_id=SREP-631-20140819

64

9.0 Appendix

9.1 Mutated genes in lung adenocarcinoma (source from: Devarakonda, Morgensztern & Govindan, 2015)

65

9.2 Template of the email for inviting participants for dashboard evaluation

66

9.3 Research Ethics

9.3.1 Ethics Application Form

72

9.3.2 Ethics Approval Letter

73

9.3.3 Consent Form

75

9.4 Questionnaire

83

9.5 Details of participants’ current profession

Information School.

Access to Dissertation

A Dissertation submitted to the University may be held by the Department (or School) within which the Dissertation was undertaken and made available for borrowing or consultation in accordance with University Regulations.

Requests for the loan of dissertations may be received from libraries in the UK and overseas. The Department may also receive requests from other organisations, as well as individuals. The conservation of the original dissertation is better assured if the Department and/or Library can fulfill such requests by sending a copy. The Department may also make your dissertation available via its web pages.

In certain cases where confidentiality of information is concerned, if either the author or the supervisor so requests, the Department will withhold the dissertation from loan or consultation for the period specified below. Where no such restriction is in force, the Department may also deposit the Dissertation in the University of Sheffield Library.

To be completed by the Author – Select (a) or (b) by placing a tick in the appropriate box

If you are willing to give permission for the Information School to make your dissertation available in these ways, please complete the following:

(a) Subject to the General Regulation on Intellectual Property, I, the author, agree to this dissertation being made immediately available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study

(b) Subject to the General Regulation on Intellectual Property, I, the author, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I agree to this dissertation being made available through the Department and/or University Library for consultation, and for the Department and/or Library to reproduce this dissertation in whole or part in order to supply single copies for the purpose of research or private study

Name Hailing Lu

Department Information School

Signed Hailing Lu Date 31/8/2016

To be completed by the Supervisor – Select (a) or (b) by placing a tick in the appropriate box

(a) I, the supervisor, agree to this dissertation being made immediately available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project.

*Specialrestrictions

(b) I, the supervisor, request that this dissertation be withheld from loan, consultation or reproduction for a period of [ ] years from the date of its submission. Subsequent to this period, I, agree to this dissertation being made available through the Department and/or University Library for loan or consultation, subject to any special restrictions (*) agreed with external organisations as part of a collaborative project

Name

Department

Signed Date

THIS SHEET MUST BE SUBMITTED WITH DISSERTATIONS BY DEPARTMENTAL REQUIREMENTS.

\\studata05\home\LI\Lip15hl\ManW7\Downloads\DISSERTAION\Dissertation\Content\Confirmation of Address after Completion Form.doc

Information School

Address & First Employment Destination Details

Name:

Programme:

Registration No:

CONFIRMATION OF ADDRESS

The University student record system has the capability to record up to three different addresses for

you at any one time: HOME address, TERM-TIME address and CORRESPONDENCE address.

Please note that the University and the School will use the details on your HOME address record

when posting out information to you, e.g. Statement of Results, Notification of Degree Results,

Degree Ceremony information etc. It is therefore YOUR RESPONSIBILITY to ensure this

address information is up to date so that communications we post to you get to you.

However, if you know that you will be staying at a temporary/other address rather than your home

address after completing the programme, and you wish correspondence to be sent to this

temporary/other address it is YOUR RESPONSIBILITY to add the address details to the

CORRESPONDENCE address on your student record. If you have provided a Correspondence

address this will be used instead of the Home address to post out communications to you. (See

www.shef.ac.uk/ssid/record/correspondence.html for more information) The Schools will not send

out correspondence to your TERM address once you have completed your studies with us.

You can check and update all your address information by logging on to the SSID website at

www.shef.ac.uk/ssid/record/pin.html. You will need your UCARD PIN number to access your

student record.

I confirm that I have checked, and updated if necessary, my HOME address

details on my University student record tick to confirm

I wish correspondence to go to a temporary/other address and not my Home

address, and I have therefore provided CORRESPONDENCE address details on

my University student record. tick to confirm

Alumni Information

I agree that the Information School may pass details regarding the result of my

degree to the relevant Professional Body Organizations for the purposes of

assisting these organisations in deciding my eligibility for the award of

professional accreditation. tick to agree

I would like my name to go on the

School’s Alumni webpage.

I would like my name and email address to go on

the School’s Alumni webpage

My email address:

tick to agree

tick to agree

I would like to be forwarded details of any job vacancies received by the School.

My email address is tick to agree

http://www.shef.ac.uk/ssid/record/correspondence.html

http://www.shef.ac.uk/ssid/record/pin.html

\\studata05\home\LI\Lip15hl\ManW7\Downloads\DISSERTAION\Dissertation\Content\Confirmation of Address after Completion Form.doc

First Employment Destination Details for School Records

It is very important that we receive information on how and where students have gained their first

employment after finishing their programme of study with the Information School. Prospective and

current students find it very useful to know the types of jobs/sectors that graduates of our

programmes can expect to go into and we provide summarised information on the School website

at www.shef.ac.uk/is/careers for each separate programme. The School also has to provide a

summary report to the University Faculties on first destinations as part of Teaching Quality

Assessment, Therefore any updated information you can give us is vital and very much appreciated.

Employer’s name

and address (inc.

telephone number

and email if

available):

Job Title:

Brief Job

Description:

Starting Date:

Please tell us where

you saw this post

advertised:

Is this employment:

Full-time or Part-time

Is this employment:

Permanent or Temporary

Are you returning to an

existing post: YES or NO

Would you be interested in participating in careers talks within the iSchool?

My email address is________________________________________________

(please do not use your Sheffield University email address)

tick to agree

Signed u Date 31st

AUGUST, 2016

http://www.shef.ac.uk/is/careers

Build and evaluate an interactive dashboard to visualize...

Documents

Transcript of Build and evaluate an interactive dashboard to visualize...