MSKCC Publish JSW

25
1 Pathway-Based Approach to Analyze Genome-Wide Association Study of Pancreatic Adenocarcinoma Survival Using Pre-Defined Gene Sets and Pathway Analysis Software by Jeanette Wong Jason A. Willis – Memorial Sloan Kettering Cancer Center Robert J. Klein, Principal Investigator – Memorial Sloan Kettering Cancer Center Advisor: Dr. Erin O’Leary Ph.D.

Transcript of MSKCC Publish JSW

Page 1: MSKCC Publish JSW

1

Pathway-Based Approach to Analyze Genome-Wide

Association Study of Pancreatic Adenocarcinoma Survival Using Pre-Defined Gene Sets and Pathway Analysis Software

by Jeanette Wong

Jason A. Willis – Memorial Sloan Kettering Cancer Center Robert J. Klein, Principal Investigator – Memorial Sloan Kettering Cancer Center

Advisor: Dr. Erin O’Leary Ph.D.

Page 2: MSKCC Publish JSW

2

Pathway-based Approach to Analyze Genome-Wide Association Study of Pancreatic Adenocarcinoma Survival Using Pre-Defined Gene Sets and Pathway Analysis Software

Jeanette Wong Mentor: Jason Willis, Memorial Sloan Kettering Cancer Center

Robert J. Klein, Memorial Sloan Kettering Cancer Center Advisor: Dr. O’Leary, Bronx High School of Science

Genome wide association studies (GWAS) have identified single-loci markers and SNPs to be

associated with pancreatic cancer; however, complex diseases such as pancreatic cancer develop due

to multiple rare genetic mutations or variations, rather than by a single SNP or gene mutation.

Pathway analyses provide supplemental information from GWAS results to further analyze and

understand disease etiology. With the use of two publically-available pathway analyses software

programs, GSA-SNP and ICSNPathway, standard parameters are set and data is analyzed with the

use of computational algorithms. The goal of this research is to assess results from GWAS of

pancreatic cancer survival data to identify pathways associated to disease progression in addition to

locate genetic mutations that predispose some individuals to pancreatic cancer and influence a

patient’s overall prognosis. Results from this study provide insight to mechanisms of pancreatic

cancer and their relationship to candidate pathways derived from pathway analyses. A literature

survey confirms the significance and relevance of candidate pathways to pancreatic cancer.

Pathway-based Approach to Analyze Genome-Wide Association Study of Pancreatic

Adenocarcinoma Survival using Pre-Defined Gene Sets and Pathway Analysis Software Introduction

Pancreatic adenocarcinoma kills 95.4% of patients diagnosed with the disease within five

years after initial diagnosis. 1 Pancreatic cancer is one of the most fatal of cancers, as symptoms

Page 3: MSKCC Publish JSW

3

do not become apparent until late stages, resulting in only 10-20% of patients eligible to be

candidates for resection. After resection, the median survival time is approximately 11-20

months, and the 5-year survival rate is approximately 7-25%. Resection is the only treatment that

has the potential to cure pancreatic adenocarcinoma. While treatments such as chemotherapy

may improve survival by 10-15%, they do not have the ability to cure pancreatic cancer. Patients

who are diagnosed at a late stage of pancreatic cancer are usually not eligible for resection.

These patients have a median survival time of 6-11 months after diagnosis. Patients with

metastatic pancreatic cancer have a median survival time of 2-6 months.2 The key survival of

pancreatic adenocarcinoma would be early detection and diagnosis of the disease, when resection

is a possible treatment, with potential for a cure to the cancer.

Approximately 10% of patients with pancreatic cancer have a family history of pancreatic

cancer.3 Familial pancreatic cancer is transmitted through autosomal dominant means with

approximately 17-19% of families with BRCA2 mutations. Pancreatic cancer may also result

from other disease syndromes such as familial atypical multiple mole melanoma syndrome

(FAMM) and Peutz-Jeghers syndromes. Molecular alterations such as Kras (proto-oncogene)

activation, p53 (tumor-suppressor gene) inactivation, SMAD4, and p16 signaling can be found in

approximately 80% of pancreatic adenocarcinoma patients.2 Known germ-line mutations are

responsible for approximately 10-20% of clustering of pancreatic cancer in families with an

inherent history of the disease. 4 Germ-line refers to the DNA that is inherited from parents in

offspring, whereas somatic mutations arise not due to genetic changes of inherited DNA.

Pathway-based approaches examine whether a group of genes in the same functional biological

pathway is associated with a trait of interest for disease. 5, 6-9 Previous studies have hypothesized

Page 4: MSKCC Publish JSW

4

that disease risk may possibly be triggered and caused by a variety of numerous rare variants,

and pathway analysis leverages more non-obvious genetic factors associated to disease. 10, 11

Previous studies have shown that molecular pathways leading from benign to malignant

pancreatic cancer have a role in metastasis and therefore survival.12 A greater understanding of

the molecular pathogenesis of pancreatic cancer may allow for the development of novel targeted

treatments and identification of early precursor lesions.13 Genome wide association (GWA)

studies have typically focused on the analysis of single markers, which have found an association

between a single-nucleotide polymorphism (SNP) marker and trait of interest. GWAS studies

have an essential goal to search for the genetic mechanisms that drive the disease, in which

germ-line mutations are to be identified to associate to loci that are found to be associated with

disease. Pathway-based approaches have been developed, using biological knowledge on gene

function to generate more power from genome-wide association study (GWAS) result data.

Previous GWAS Pathway analyses have been completed to target other diseases besides

pancreatic cancer such as breast cancer and Alzheimer’s disease. 14, 15

In pathway analysis, ‘pathway’ is defined as a set of related genes, and not necessarily a

physically networked pathway. With the use of prior biological information and knowledge on

overrepresented pathways in GWAS data, pathway classification analysis can help prioritize

pathways that are most likely to be associated with disease. By incorporating gene networks and

pathway classification tools for analysis of GWAS data, molecular pathways can bring single-

locus genome-wide association studies further in depth. There are currently several available

pathway classification analysis tools and databases; these tools have genes sorted into pre-

defined pathways of cellular processes based on biological genomic and molecular information.

Parts of genomes are inherited together, and every SNP gives information about several other

Page 5: MSKCC Publish JSW

5

genetic variations on a specific chromosome. Considering the linkage disequilibrium (LD)

patterns within a genome, for pathway analysis, a SNP is mapped back to an LD gene block,

which contains several genes within a specified parameter. In pathway analysis, a threshold p-

value is selected in order to prioritize output. 5 Larger pathways containing a significantly higher

number of genes within gene sets will lead to larger numbers of genotyped SNPs are expected to

show more associated SNPs by chance alone. 11

Pathway-based approaches assess whether test statistics for a group of related genes has

consistent yet moderate deviation from chance. Genes are not fully functional in isolation.

Complex molecular pathways tend to be more related to disease susceptibility and disease

progression. In pathway based association tests for GWAS, a database list of predefined gene

sets for pathways have been created based on prior biological knowledge. The significance of

each pathway can be summarized based on association of markers in or near genes that are

components of a specific pathway. There may be multiple related genes in the same functional

pathway that confer disease progression and pathogenesis. Pathway analysis is complementary to

the conventional GWAS by identifying additional susceptibility genes; pathway analysis can be

used to understand missing heritability in genome-wide association studies.16

It has been presented as a problem that by testing one single gene marker at a time,

coherent patterns cannot be found among significant genes, making biological interpretation

difficult in GWAS. Gene set analysis (GSA) methods use different pre-defined gene sets that are

grouped together based on their biological function and expressions. GSA determines the

significance of pre-defined sets of genes with respect to an outcome variable. In this study, the

outcome variable is the quantitative biological analysis of disease survival. Gene sets have the

ability to coordinate expression patterns of genes of interest. The essential goal for genome-wide

Page 6: MSKCC Publish JSW

6

association studies is to prioritize the biological functions or related biological networks based

on a targeted biological interest trait or area. Pre-defined gene sets or pathways can further better

define the results from a GWAS. 17 The goal of research is to assess the significance of pathways

from germ-line mutation studies to define and identify significant pathways associated to

pancreatic cancer. Results provide insight into mechanisms of pancreatic cancer and their

relationship to candidate pathways derived from pathway-based analyses. It is hypothesized that

pathway analysis based on results from genome-wide association studies, will be a reliable

indicator of candidate pathways associated to the development and metastasis of pancreatic

cancer.

Materials and Methods Input for Pathway Analysis - Pancreatic Cancer Survival GWAS

A genome-wide association study (GWAS) was conducted prior to pathway analysis.

DNA samples from 252 patients diagnosed with pancreatic adenocarcinoma were collected via

blood or cheek cell samples. The 252 patients were enrolled in a study at a research institution

and consented to offer a DNA sample. The DNA was genotyped using an Illumina CNV370-duo

SNP genotyping array (~340k SNP markers). After DNA samples of patients were collected,

clinical information of each patient was tracked (e.g. survival time, treatment plan). Results of

the GWAS are p-values assigned to SNPs, without individual-level genotypes.

Input for Pathway Analysis – Example Dataset: Height

GSA-SNP provided an example dataset of 100 samples of DNA in the format of SNPs

and p-values from a Korean population for height (PGWAS < 4x10-6). SNPs were obtained by

computing labels of SNP microarray data from the Korean Association Resources (KARE)

project, and then with the use of PLINK software, genotyping was completed. The genotypes of

Page 7: MSKCC Publish JSW

7

a total of 2,168,896 SNPs were imputed using PLINK and 799,492 of them passing PHeight >

1x10-6. The p-values for all resulting SNPs were gathered and used as an input variable. 17

Standard Thresholds and Parameters for Pathway Analyses

A p-value of the input data is a probability statement that tests the null hypothesis. For

example, as a p-value is smaller, the evidence against a null hypothesis is stronger. The p-value

is compared to a significance value. For pathway analyses of this study, the standard cutoff point

is 0.001, 1x10-3 and any p-value below this threshold is determined to be statistically significant.

When SNPs are being mapped to genes, a SNP would be located between a 5’ and 3’ ends of the

first and last exons of a gene, as it is assigned to a latter. A SNP located within ±20kb of the 5’

and 3’ ends of the first and last exons of a gene is always assigned to a latter (±20kb upstream or

downstream of the gene), in order to take account of surrounding regulatory regions/linkage

disequilibrium (LD) neighborhoods. (Linkage disequilibrium occurs between disease allele and

marker alleles; GWAS can identify disease-associated alleles when mapped from significant

SNPS). 10 If a given SNP was assigned to more than one gene, the SNPs are subject to being

reanalyzed. The Gene Ontology gene set database is used to provide a broad spectrum of gene

sets for genomics research testing enrichment. 1 The standard 10-200 (minimum-maximum) gene

set size of each pathway/gene set was selected to avoid overly narrow or overly broad functional

categories in the Gene Ontology database. The q-value on GSA-SNP represents the False

Discovery Rate (FDR) for the analysis as a correction method to correct false positive results.

The standard FDR cutoff for pathway-based analysis was set at ≤0.05.

GSA-SNP: Gene Set Analysis with SNP Input

Gene set analysis (GSA) has been introduced to genome- wide association studies with

goals to identify association between groups of genes that share a common biological function

Page 8: MSKCC Publish JSW

8

and disease. With the use of GSA, the power of GWAS can be increased substantially, as

association patterns may be found of gene sets. Data input windows are shown in Figure 1, 2, 3,

and 4, which are the respective steps taken to properly input formatted data. GSA-SNP is

computational software that is freely available along with an example dataset at

http://gsa.muldas.org 17

The input format for GSA-SNP used was a list of p-values for each SNP from a GWAS.

A gene-set analysis works by first taking the “–log” on every individual p-value of a SNP. A

feature of GSA-SNP is the use of a “k-th best p-value” when k= 1, 2, 3, 4, 5 for every individual

gene, allowing gene scores to be more evenly distributed. For this experiment, k=2, the second

best SNP in each gene, was set as a standard to summarize values of multiple SNPs. If k=1 were

set as a standard, significance would only be found in only the best SNP. SNPs are mapped to its

nearest gene within 20 kB. Larger k-values tend to lower the power of results. 17

Procedures for using GSA-SNP

1. Run GSA-SNP: Execute run.sh (Unix/Linux) on a computer. (Figure 1)

2. Breakdown of pathway analysis using GSA-SNP program

- Click the “…” button and choose a data file.

- Click the “Upload” button to detect the data type (SNP, Gene, or Haplotype).

- The program will show relevant input options.

A SNP input data-file is the GSA-SNP input. The program automatically detects the data type by

reading the first ten lines of the input file. The row identifier for SNP data is rs#####. The first

column of the input file is the rs number of a SNP, and the second column of the input file is the

p-value for the SNP. Figure 1 shows the initial window after the GSA-SNP java file is executed.

Figure 2 shows a pop-up window after “open” button is clicked. Figure 3 shows how parameters

Page 9: MSKCC Publish JSW

9

for analysis are set. In this particular experiment, parameters that were set as standards were

inputted. Figure 4 shows the window when all data and parameters are completely and properly

entered into the analysis program, and the pathway analysis is ready to be run.

Figure 1. Figure 2. Click the “…” button to manually select the input data file. Select a file. Click the “open” button, then “upload”.

Data Parameters: GSA-SNP applies “–log” to every p-value in the input data. For SNP data, padding is for mapping SNPs to genes with due to LD. ±20kB is the set standard threshold.

Figure 3. Gene set parameters of the Gene Oncology (GO) database. Gene set size is set to range from 10 (minimum)-200 (maximum) genes in a gene set to avoid overly narrow or broad functional gene sets. The q-value is the false discovery rate (FDR), and is set to default at ≤ 0.05.

Figure 4. The analysis begins promptly when “Run” is clicked. The progression status of the analysis is found in bottom bar of the program window. When analysis is complete, results will appear on the right of the program window. Within the GSA-SNP software program, the Z-statistic method is employed to provide a corrected p-value. In the output variable, the “z-score” represents results from this algorithm.

3. Results: When the analysis computation is complete, the result appears on the right side of

the executable window. Results are formatted into columns and rows. The results are

Page 10: MSKCC Publish JSW

10

formatted by: gene set name, gene count in each gene set, gene set size, z-score, corrected p-

value (q-value), and names of genes within each gene set.

Figure 5. The computation results of GSA-SNP of pancreatic cancer survival data with the application of the Z-statistic method and all standardized parameters. Results are ordered in decreasing significance of pathways based on p-value of gene sets.

ICSNPathway: Identification of Candidate Causal SNPs and Pathways

ICSNPathway is an online web server freely available for use, developed to analyze SNPs

from GWAS and identify associated pathways with a targeted interest. ICSNPathway has a

unique approach to deal with linkage disequilibrium (LD) analysis, which is to apply the

HapMap population to more accurately map SNPs to genes for pathway analysis. Figure 6 shows

the online web page of the ICSNPathway program, displaying all the parameters set for the

pathway analysis. Figure 6 is not the initial web page display, but resembles the input page

relatively similarly, as shown by Figure 7. To show what happens within the ICSNPathway

Page 11: MSKCC Publish JSW

11

program itself, Figure 8 shows how the data is analyzed and how the chosen parameters are

applied. Results of the pathway analysis are able to be downloaded into a text file, as displayed

by Figure 9. Output data is ordered by lowest to highest p-values. ICSNPathway carries out

efficient running procedures within minutes with properly prepared input data and parameters.18

Figure 6. Parameters set for the KARE Height Data Input. Standardized set parameters are applied to analysis.

Figure 7. The home page of the ICSNPathway web server program. All input information is properly completed before analysis begins when “RUN” promptly begins the process. GWAS SNP p-value file is uploaded, LD neighborhood parameters are selected, and standardized parameters selected for this experiment are all applied.

Page 12: MSKCC Publish JSW

12

Figure 8. Diagram of how ICSNPathway functions overall.18

Figure 9. After the ICSNPathway analysis is completed, the output is listed on the result page online. The output is also available to be downloaded as a text file. The results are categorized in columns: Index (ranking), Candidate causal pathway, Gene set URL, Description of Gene Set, Nominal P-value, and FDR.

Results

Output data from the GSA-SNP software is in the format of a spreadsheet, in which there are

columns and rows, so that data can be sorted in various different ways (i.e. descending,

ascending) of p-value, z-score, etc. Table 1 compares the output values of the two pathway

analysis tools used for comparison purposes geared towards gaining an understanding and

assessment of stability of pathway analyses with usage of different tools. Figure 10 and figure 11

are graphs that show how skewed the results from both the pancreatic cancer programs are, and

how different the output values are, or how similar the values are.

Results of Two Different Pathway Tools Analyzing the Same GWAS Data Input

Page 13: MSKCC Publish JSW

13

All the data represented in the results of this study are a part of a broader genetics study to

analyze the effect of germ-line pathways that trigger or have association for an inherited trait or

for the development of disease. 5

Table 1. Comparison of results of pathway analysis using two different software GSA-SNP and ICSNPathway, similar pathway names, rankings, and their p-values are organized in the table.

Figure 10. Rankings comparison of overlapping pathways appearing in the results of both GSA-SNP and ICSNPathway for the control KARE Height dataset. This shows that there is no consensus of rankings, even though the same parameters were set.

Figure 11. Pathway p-value comparison of overlapping pathways appearing in the results of both GSA-SNP and ICSNPathway for the control KARE Height dataset, demonstrating how application of different algorithms yield different computation results.

Comparison of GSA-SNP and ICSNPathway

The results from ICSNPathway are vaguer than those of GSA-SNP. With the use of the

same standardized parameters, similar results may have been expected, but there is very minimal

overlapping representation. GSA-SNP has a more broken down Gene Ontology database, in

Page 14: MSKCC Publish JSW

14

which certain pathways are classified into greater detail, carrying different p-values. This may

have skewed the comparison of the two software programs used. Regardless, there is some

consensus for top pathways from both pathway analysis output results. Table 4 shows the output

values of the GSA-SNP pathway analysis, ordered in ascending p-values. Table 5 shows the top

ranked pathways and its statistical values as computed by the GSA-SNP pathway analysis

program. Figure 10 shows the rankings of overlapping pathways appearing in the results of both

GSA-SNP and ICSNPathway. Figure 11 shows the p-value comparison of overlapping pathways

appearing in the results for both pathway analysis programs, demonstrating different values.

Table 4. List of the highest ranked Gene Ontology categories for SNP association with GWAS Pancreatic Cancer Survival Data, P-values ≤ 0.001 from GSA-SNP pathway analysis. 253 pathways appeared in the results of the GSA-SNP pathway analysis of the pancreatic cancer survival GWAS results, and only the most highly significant pathways were selected for a literature survey in search for relevance to pancreatic cancer.

Page 15: MSKCC Publish JSW

15

Table 5. After selecting top ranked pathways for pancreatic cancer from the GSA-SNP pathway analysis, a literature survey was completed. The literature survey was done by using search engines to search for literature containing terms such as: pancreatic cancer, metastasis, survival, progression, and the name of a pathway. This table gives citations of one example of published literature that was found from the literature search, as evidence to support the association of pathway and pancreatic cancer survival factors (metastasis, tumor growth, cancer progression, cell regulation, cellular invasion, etc.) The top ten most strongly associated pathways are presented. Discussion Interpretation of Results

The results of this study successfully address candidate pathways associated with pancreatic

cancer survival, metastasis, carcinogenesis, and underlying biological-genetic mechanisms.

Results of the analysis do not necessarily identify the most highly associated pathways

accurately, as rankings of pathways do not correlate to targeted disease pathogenesis. This study

provides supplementary information to other findings within the same research discipline, in

which it has been said that somatic mutations are predominantly responsible for the

development, risk, and metastasis of pancreatic cancer.

Analysis in Context

Do pathway analyses effectively further the findings from genome-wide association studies?

Page 16: MSKCC Publish JSW

16

Pathway analyses effectively further the findings from genome-wide association studies to an

extent. Due to the fact that there are numerous differences in output rank between the two

programs and the same input data and parameters were used, an ambiguity is presented. In

addition, the false-discovery-rates show that there may be false positives in pathway analyses,

showing that the reliability of the output values from the analyses may not be accurate or

biologically correct.

How accountable are the quantitative results from the two pathway-based analysis software

used? Since certain output gene set names were different, but the gene sets may contain the same

genes, but not all genes within the gene set, a problem is presented. Further biological research is

necessary to prove whether or not certain genes belong to a certain gene set.

Are pathway analyses an efficient and significant means of leveraging GWAS of other

diseases besides pancreatic cancer? The goal of this study was to allow data to be analyzed with

as minimal bias as possible to the standardized thresholds so that data is most significantly

represented in analysis. Since it has been claimed that complex diseases such as cancer are

driven by multiple rare pathways/genetic mutations, it is ideal to use pathway analyses tools as a

possible solution to problems of GWAS, which identifies germ-lines and SNPs, but not involved

and underlying pathways.

Differences of How a Pathway is Defined

One possible explanation for differences in outcome between GSA-SNP and ICSNPathway

analysis tools when analyzing the same dataset could be the difference in updated Gene

Ontology databases, which include but are not limited to, difference in gene sets, pathways,

genes within each gene set, and use of different statistical algorithms that prioritize outcome

First, the server searches for SNPs in linkage disequilibrium with the most significant SNPs

Page 17: MSKCC Publish JSW

17

based on the linkage disequilibrium of the specific European American Population (CEU)

HapMap population. By doing so, the genetics of human biology are better assessed. Second,

ICSNP annotates functions to SNPs in order to extract corresponding pathways and genes to

marked functional SNPS. Afterwards, pathway based analysis on GWAS SNP p-values was

performed using the Gene Ontology database to identify candidate pathways and SNPs that may

correspond to a biological trait of interest such as disease. 19 It is difficult to accurately compare

results from various pathway analysis tools, as there are different definitions as to what exactly a

pathway is, and what a pathway contains. Since pathways have networks they interact with to

carry out biological functions, one pathway may not be enough to contribute to disease etiology.

Table 6 offers a detailed list of the disadvantages and limitations of pathway analysis methods,

and how these methods can be improved so that usage of pathway analysis for future analyses

can offer more optimal results. Although well-defined pathways have yet to be established and

there needs more consensus as to what a pathway is defined as, pathway analyses have ability to

make credible computational predictions of how biological processes (e.g. cancer metastasis) are

associated by cellular and molecular pathways. 5

Pathway-based association approaches may be susceptible to false positive results but could

be appropriately replicated with independent data sets. Pathway analyses can be relatively

flexible as it can be conducted on GWAS data from different genotyping platforms. Pathway

based approaches are a possible solution to identifying novel genes or gene sets that confer with

disease pathogenesis.

Table 6.Limitations of Pathway-Based Analysis Approaches Problem Description Explanation Outcome differences between GSA-SNP and ICSNPathway

• Different gene updates of gene builds: certain genes may not be recognized

• Different human reference gene sets: there may be a quantitative difference of the amount of genes in reference sets

• Freely available pathway analysis tools may not be up-to-date and variations between different tools may exist and affect outcome

• Reference gene sets should be easily

Page 18: MSKCC Publish JSW

18

• Different statistical algorithms in tools: Z-Statistic, GSA Restandardization, HapMap

• Different understanding of pathways: names of pathways and gene sets, the amount of genes in each gene sets, and more minute classification of pathways

accessed • There needs to be more consensus

on pathway classification, and greater understanding of how behind biological processes, there are pathways that have networks and interact.

Over-represented Pathways

• There may be significantly overrepresented pathways within a pathway analysis tool, in which gene sets appear to be importantly classified in output, but no correlation is found between GWAS dataset input and pathway outcome results. Larger pathways tend to give off larger outcomes and numbers that are compared become larger

• Statistically or programmatically, there may be flaws which are biased towards specific pathways within their database and reference gene set list

In tumor progression, molecular changes occur and improve specificity without significantly

compromising sensitivity. Successful molecular screening can be defined as the identification of

genetic alterations that occur at a specified point in DNA. Pathway analyses are still relatively

new to understanding genomic studies, as it is a more integrated approach to using multiple data

types together in the same pathway-based tool. Since complex diseases such as cancer can

involve multiple pathways, which include interaction between various affected genes, associated

to disease development, it may be ideal to combine various analysis tools for genome-wide

association studies. An important limitation of pathway-based analysis of GWAS is the

incomplete annotation of the human genome. As of now, functionality of many human genes is

unknown, which does not allow genes to be classified into pathways. Overall, there is no

specifically defined standard as to what a pathway is, and as a result, different software use

different databases will offer different results of analysis. Another limitation of this study would

be the lack of validation of results with the control data set. Improvements in organization and

consensus of gene-set pathway databases may greater improve understanding of cellular

mechanisms of genes, pathways, and disease association. 14

Conclusions and Future Work

Page 19: MSKCC Publish JSW

19

There is evidence that candidate pathways from the GSA-SNP pathway analysis of

pancreatic cancer survival GWAS results are associated with pancreatic cancer. Reliability of

results from pathway analysis was assessed through a comprehensive literature survey with Gene

Ontology terms of gene sets and pancreatic cancer as a key term. Each individual candidate

pathway was validated with multiple published papers to support association between pathway

and disease. The example dataset involving the KARE Height GWAS data served as a control

dataset for this study to establish a case-control experiment. However, there are no established

ways to determine significance of association between pathway, gene sets, and its relationship to

a specific disease or biological trait.

With the use of pathway analysis tools to further analyze genome wide association study

results, overrepresented cell processes can be significantly classified. Pathway analysis is a

potential solution to gaining a greater understanding and value from GWAS, and can prove to be

useful for acquiring a greater insight for disease etiology, risk, diagnosis, and survival time.5

Prior studies have suggested that GWAS studies are insufficient for powerfully detecting small

main effects (overrepresentations) of genes, and gene-gene interactions may have a significant

role in disease pathogenesis, in which GWAS do not assess the full potential of associating genes

in pathways to disease. The application of pathway analyses following GWAS can be considered

as a novel approach to the traditional genome-wide association study methods. 20

GWAS studies aim to find associations between disease phenotypes and genetic alterations.

Pathway analyses offer a simple, ideal alternative that is supplementary to the traditional genetic

association studies. As a result, pathway analyses may offer identification of relevant gene sets

and subsets in pancreatic cancer phenotypes. The use of pathway-based analysis for this study

proved to be useful in examining effects of a pathway or group of genes on disease, through the

Page 20: MSKCC Publish JSW

20

testing of established gene sets of the Gene Ontology database. Results from this study offer

insight to how pathway analysis methods could potentially increase the power of GWAS results

to detect underlying associated pathways. 21

By studying the results of a GWAS for pancreatic cancer survival from a population of

unrelated patients, it can be determined that patients with pancreatic cancer share a phenotype of

hereditary cancer. Pathway-based approaches allow more biological information from GWAS,

making results from GWAS more powerful for gaining insight for disease pathogenesis. Greater

development of sequencing and analysis tools will further improve the power of pathway

analysis and genetic, genome-wide association studies.

Future steps for pathway analysis would be to have improvements in computational

predictions of cellular processes from genomic and molecular biology, as presented by Table 6.

Research must continue into the molecular components involved in understanding the biology of

pancreatic cancer. The development of more sensitive and specific molecular solutions to

understanding disease is essential to gaining knowledge of the pancreatic cancer progression

model. 22 With the use of GWAS and pathway analysis, insight is gained to the genetics behind

understanding individual genetics of cancer to improve early prevention and diagnosis. Gaining

knowledge of underlying biological, molecular, and functional pathways can allow novel gene

targeted therapies to be designed and developed. GWAS can identify associations among

common alterations within a genome with high-density SNP markers. 10

A preliminary study has been done to further focus on the genes within the gene sets

identified by the pancreatic cancer survival data, to see which genes have the greatest statistical

and quantitative correlation to the biology of pancreatic cancer and its related diseases (i.e.

diabetes, pancreatic inflammation, etc.). This study is an extension of the pathway analysis

Page 21: MSKCC Publish JSW

21

results, and statistical methods were based on the data provided by the p-values, and amount of

genes within each gene set and/or pathway.

Further experiments of pathway analysis on GWAS can be performed to validate results

from this study and better define conclusions with more reliable, therefore, more significant

results. For example, a larger sample of SNPs can be used as input for a broader library of

disease pathogenesis information, using the same analysis methods. Comparison of association

of somatic mutations vs. germ-line mutations can also offer greater comprehension of disease

mechanisms. Another future improvement could be more consensus of what a pathway is, such

that databases containing p-values, pathways, and gene sets become better defined and prove to

yield more consistent results regardless of differences in software program algorithms. Another

further extension could simply be to manipulate and modify the standardized parameters. As

SNP to gene mapping, and p-value maximum values are changed, pathway analyses results may

offer significantly different results quantitatively and biologically. Questions that can be

answered with further research would be (1) Can pathway analyses further determine and rank

which genes within gene sets are most highly associated with disease incidence? (2) Which

pathways or genes interact with one another in order to trigger disease or advance disease stages?

(3) Are single-loci mutations identifiable and practical enough for understanding disease? Better

insight to pancreatic cancer can be gained if pathway analysis methods and concepts advance. If

a locus and gene is identified, genes can be targeted to further identify therapy methods to treat

and cure cancer.

Development of various computational tools to further analyze biological databases such

as Gene Ontology will allow greater understanding of results from genomic studies. With a

comprehensive examination of relevant published literature, additional experimental validation

Page 22: MSKCC Publish JSW

22

can confirm results to support computational calculations from algorithms that analyze genomics

and gene profiling to incorporate direct comparison and biological complexities within cancer

survival mechanisms pertaining to specific genes, gene types, and gene interactions. Ultimately,

if proper signaling pathways for pancreatic cancer and other types of cancer, software programs

can be created and developed to provide accurate analysis through algorithmic statistics- gene

networks may also be show abnormality through gene signaling, leading to the possibility for

rational therapeutic gene selection towards finding a potential identifying mechanism for

pancreatic cancer patients.

References 1. Ries LAG, Eisner MP, Kosary CL, Hankey BF, Miller BA, Clegg L, Mariotto A, Feuer EJ, Edwards BK (eds). SEER Cancer Statistics Review, 1975-2002, National Cancer Institute. Bethesda, MD, <http://seer.cancer.gov/csr/1975_2002> 2. Thomasset S.C., Lobo D.N. “Pancreatic Cancer”. Hepatobiliary Surgery II (2010) 28:5, 198-204. 3. Klein, Alison P, et al. “Prospective Risk of Pancreatic Cancer in Familial Pancreatic Cancer Kindreds.” Cancer Research 64.7 (2004): 2634-8. PubMed. 4. Klein, Alison P, et al. “Prospective Risk of Pancreatic Cancer in Familial Pancreatic Cancer Kindreds.” Cancer Research 64.7 (2004): 2634-8. PubMed. Web. 6 September 2011. <http://cancerres.aacrjournals.org/content/64/7/2634.long>. 5. Elbers C.C., Eijk K.R., Frake L., Mulder F., Schouw Y.T., Wijmenga C., Onland-Moret, N.C. “Using Genome-Wide Pathway Analysis to Unravel the Etiology of Complex Diseases”. Genetic Epidemiology 33: 419-431 (2009). Doi: 10.1002/gepi.20395

Page 23: MSKCC Publish JSW

23

6. Visscher, Peter M., WG Hill, and NR Ray. “Heritability in the Genomics Era – Concepts and Misconceptions.” Nature Reviews. Genetics 9.4 (2008): 255-66. PubMed. Web. 26 Aug. 2010. <http://www.nature.com/doifinder/10.1038/nrg2322>. 7. Li, J, et al. “A Combined Analysis of Genome-Wide Association Studies in Breast Cancer.” Breast Cancer Research and Treatment (Sept. 2010): PubMed. Web. 21 Aug. 2011. <http://www.springerlink.com/content/ kl32p6271h141716/>. 8. Naj, AC, et al. “Dementia Revealed: Novel Chromosome 6 Locus for Late-onset Alzheimer Disease Provides Genetic Evidence for Folate-pathway Abnormalities.” PLoS Genetics 6.9 (2010): e1001130. PubMed. Web. 15 Aug.2010. <http://www.plosgenetics.org/article/info %3Adoi%2F10.1371%2Fjournal.pg en.1001130>. 9. Lambert, JC, et al. “Implication of the Immune System in Alzheimer’s Disease: Evidence from Genome-Wide Pathway Analysis.” Journal of Alzheimer’s Disease: JAD 20.4 (2010): 1107-18. PubMed. Web. 7 September 2011. <http://iospress.metapress.com/content/mj6t4h073843501l/>. 10. Galvan, Antonella, JP Ioannidis, and TA Dragani. “Beyond Genome-Wide Association Studies: Genetic Heterogeneity and Individual Predisposition to Cancer.” Trends in Genetics: TIG 26.3 (2010): 132-41. 11. Cantor, Rita M., K Lange, JS Sinsheimer. “Prioritizing GWAS Results: A Review of Statistical Methods and Recommendations for Their Application.” American Journal of Human Genetics 86.1 (2010): 6-22. PubMed. Web. 5 September 2011. <http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2801749/?tool=pubmed>. 12. Raimondi, S. Lowenfels, A.B., Morselli-Labate A.M., Maisonneuve P., Pezzilli R. “Pancreatic cancer in chronic pancreatitis” aetiology, incidence, and early detection”. Best Practice & Research Clinical Gastroenterology 24 (2010) 349-358. 13. Vitone L.J., Greenhalf W., McFaul C.D., Ghaneh P., Neoptolemos J.P. “The inherited genetics of pancreatic cancer and prospects for secondary screening”. Best Practice & Research Clinical Gastroenterology. 20 (2006) 253-283. 14. Li J, et al. “ A Combined Analysis of Genome-Wide Association Studies in Breast Cancer.” Breast Cancer Research and Treatment. (2010) PubMed. <http://www.springerlink.com/content/kl32p6271h141716/>. 15. Lambert JC, Boley BG, Chouraki V, Heath S, Zelenika D, Fievet N, Hannequin D, Pasquier F. “Implication of the Immune System in Alzheimer’s Disease: Evidence from Genome-Wide Pathway Analysis.” Journal of Alzheimer’s Disease 20 (2010) 1107-1118. Doi: 10.323/JAD-2010-100018 16. Wang, K., Li M., Hakonarson, H. “Analysing biological pathways in genome-wide association studies”, Nature Reviews, Volume 11, December 2010. Doi: 10.1038/nrg2884

Page 24: MSKCC Publish JSW

24

17. Nam, D., Kim, J., Kim, S.Y., Kim, S. “GSA-SNP: a general approach for gene set analysis of polymorphisms”. Nucleic Acids Research 2010, Vol. 38, 749-754, doi:10.1093/nar/gkq428 18. K. Zhang, S. Chang, et al. (2011). "ICSNPathway: identify candidate causal SNPs and pathways from genome-wide association study by one analytical framework." Nucleic Acids Res. 39(suppl 2): W437-W443. 19. Zintzaras E, Lau J (2008) Trends in meta analysis of genetic association studies. J Hum Genet 53, 1-9. 20. Zhao et al. “Pathway-based analysis using reduced gene subsets in genome-wide association studies”BMC Bioinformatics 2011, 12:17. Doi: 10.1186/1471-2105-12-17 21. Vitone L.J., Greenhalf W., McFaul C.D., Ghaneh P., Neopotolemos J.P. “The inherited genetics of pancreatic cancer and prospects for secondary screening” Best Practice & Research Clinical Gastroenterology, Vol. 20. No. 2. 253-283, 2006. Doi: 10.1016/j.bpg.2005.10.007 22. Rhee SY, Wood V, Dolinski K, Draghici S (2008) Use and misuse of the gene ontology annotations. National Review Genetics 9, 509-515. 23. Torkamani, Ali, EJ Topol, and NJ Schork. “Pathway Analysis of Seven Common Diseases Assessed by Genome-Wide Association.” Genomics 92.5 (2008): 265- 72. PubMed. 24. Medina, I., Montaner, D., Bonifaci, N., Pujana, M.A., Carbonell, J., Tarraga, J., Al-Shahrour, F. and Dopazo, J. (2009) Gene set-based analysis of polymorphisms: finding pathways or biological processes associated to traits in genome-wide association studies. Nucleic Acids Res, 37, W340-344. 25. Wang, K., Li, M. and Hakonarson, H. (2010) Analysing biological pathways in genome-wide association studies. Nat Rev Genet, 11, 843-854. 26. Zhong, Hua, et al. “Integrating Pathway Analysis and Genetics of Gene Expression for Genome-wide Association Studies.” American Journal of Human Genetics 86.4 (2010): 581-91. PubMed. Web. 30 August 2011. <http://www.ncbi.nlm.nih.gov/ pmc/articles/PMC2703874/?tool=pubmed> . 27. “Whole Genome Association Analysis Toolset.” PLINK. Web. 28 August 2011. <http:///pngu.mgh.harvard.edu/~purcell/plink/gplink.shtml>. 28. You L, Chen G, Zhao Y.P. Core signaling pathways and new therapeutic targets in pancreatic cancer. Chin Medical Journal 2010; 123 (9): 1210-1215. Doi: 10.3760/cma.j.issn. 0366-6999-2010.09.020 29. “ICSNPathway”. Web. 28 August 2011. <http://icsnpathway.psych.ac.cn>

Page 25: MSKCC Publish JSW

25