20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens...

271
The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University of Copenhagen 21.11.2013 Author Lars Rønn Olsen The bioinformatics Centre Department of biology University of Copenhagen Academic supervisors Professor Anders Krogh The bioinformatics Centre Department of biology University of Copenhagen Associate Professor Ole Winther Center for cognitive systems DTU Compute Technical University of Denmark Professor Vladimir Brusic Cancer Vaccine Center Dana-Farber Cancer Institute Harvard Medical School

Transcript of 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens...

Page 1: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

The prognostic, diagnostic, and therapeutic potential of tumor antigens

This thesis has been submitted to the PhD School

of The Faculty of Science, University of Copenhagen 21.11.2013

Author

Lars Rønn Olsen The bioinformatics Centre

Department of biology University of Copenhagen

Academic supervisors Professor Anders Krogh

The bioinformatics Centre Department of biology

University of Copenhagen

Associate Professor Ole Winther Center for cognitive systems

DTU Compute Technical University of Denmark

Professor Vladimir Brusic

Cancer Vaccine Center Dana-Farber Cancer Institute

Harvard Medical School

Page 2: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University
Page 3: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University
Page 4: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University
Page 5: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

i

Summary

Tumor antigens are a group of proteins recognized by the cells of immune system. Specifically, they are recognized in tumor cells where they are present in larger than usual amounts, or are physiochemically altered to a degree at which they no longer resemble native human proteins. Their presence or abundance in cancer cells is often unique and their roles and functions in tumorigenesis are, in many cases, studied extensively. They, therefore, have the potential to be highly specific biomarkers as well as therapeutic targets, but complex analysis combining basic science, high-throughput methods of genomics and proteomics, and clinical studies need to be combined. These analyses produce large amounts of data that require advanced bioinformatics methods for collection, management, integration and interpretation. In this thesis, I have explored the potential of tumor antigens as biomarkers and therapeutic agents, by developing and implementing several computational tools and databases for immunotherapy target discovery, and have analyzed the potential of tumor antigens as proteogenomic biomarkers in invasive ductal carcinomas. In this analysis I have shown that the combination of proteomics and genomics data with a focus on tumor antigens can provide biological insights into molecular pathways involved in tumorigenesis.

Page 6: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

ii

Page 7: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

iii

Resumé

Tumor antigener er en gruppe af proteiner der kan genkendes af immunforsvaret. De genkendes i kræftceller hvori de ofte er tilstede i større end sædvanelige mængder, eller kemisk forandret i en grad hvor de adskiller sig fra almindelige humane proteiner. Deres tilstedeværelse, eller forøgede mængde, er ofte unik for kræftceller, og deres funktion og rolle i kræft er nøje studeret. De bærer derfor potentiale for at være stærkt specifikke biomarkører, samt terapeutiske behandlingsmål, men kompleks analyse der kombinerer grundforskning, high-throughput metoder inden for gen- og proteinforskning, samt kliniske studier skal kombineres for at opnå dette. Disse analyser producerer enorme mængder af data der kræver avancerede bioinformatiske metoder til indsamling, lagring, integrering, og fortolkning. I denne afhandling udforsker jeg tumor antigenernes diagnostiske og terapeutiske potentiale ved at udvikle og anvende bioinformatiske værktøjer og databaser til at definere immunoterapeutiske behandlingsmål. Jeg har ligeledes analyseret tumorantigeners potentiale som proteogenomiske markører for inflammatorisk brystkræft. I denne analyse viser jeg blandt andet at kombinationen af gen- og proteinbaseret data fokuseret på tumor antigener, kan fremsætte indsigt i molekylære systemer involveret i udviklingen af brystkræft.

Page 8: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

iv

Page 9: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

v

Preface

Mephistopheles: "Tell me first, without more ado, Which Faculty appeals to you?" Student: "I want to be a learned man, And find out everything I can -- All the whole universe contains, All about Nature, which Science explains." Mephistopheles: "Well, you’re on the right road, that’s clear. But you will find a lot to distract you here!" Johann Wolfgang von Goethe - Faust

Upon returning to Copenhagen after almost two years of living in Boston, I approached Ole with a somewhat daring plan: I wanted to finish my PhD in half the standard time. This was our first meeting, but my optimism did not seem to faze him and he agreed to advise me in my endeavor. Ole's group at the time consisted of three other PhD students and a postdoctoral fellow. He managed this group part time, while also managing a successful research group at the Technical University of Denmark. After 18 months of working with Vladimir in his exceptionally well-managed lab, the independence and self-governance struck me as daunting. Soon did I realize that Ole had carefully assembled a group of extremely qualified individuals, complementing each other’s skills and personalities to a degree where autonomy equals creativity and productivity. It was the perfect place to write my PhD, while under the supervision of two excellent academic advisors of complimenting skills and styles, namely Ole and Vladimir. It seemed, almost, as if the thesis would write itself. As Faust was warned in Goethe's famous play of the same name, distractions are plentiful in the world of science. Alas, as I struggled to

Page 10: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

vi

define the boundaries of my endeavors, I did not finish my PhD thesis in half the standard time. Instead, I traded the time for, I would like to believe, quality, and I produced a thesis that I am far more proud of than I ever hoped for. Just as my Master's thesis paved the way for this PhD thesis, I feel that the work presented here has left me with multiple exciting prospective projects, all of which I am aching to explore further. Acknowledgements

After two years of enjoying the support and company of my supervisors and colleagues at work and friends and family in my spare time, it seems a token gesture to spend a few lines expressing my gratitude to the individuals who made all of this possible, and an enjoyable experience as well. First and foremost, I would like to extend my sincere gratitude to Ole Winther for believing in me. Ole's refreshing take on science and management makes for an exceptionally dynamic and inspiring research environment. On a personal level, Ole is a true pleasure to be around, whether at work, the occasional post-work beer, or Saturday morning footraces (one of these days, he might just win...), and I am excited about the prospects of continuing our collaboration. No one has influenced me academically more than Vladimir Brusic, and I am extremely grateful that Vladimir agreed to co-supervise me during this project. Although it is not always clear to me at first, Vladimir never fails to provide me with advice that eventually benefits me enormously. I owe him thanks for a great many things, but above all, I am sincerely thankful to him for never allowing me to settle for anything less than excellence. I also owe thanks to Anders Krogh for supervising the more practical aspects of this PhD work, and for making the Bioinformatics Centre the exceptionally inspiring and pleasant workplace that it is. During this work, I have had the good fortune of sharing my workspace with three talented scientists of the Winther group, Frederik Otzen Bagger, Nicolas Rapin, and Tomas Martin-Bertelsen. Although we work on very different projects, these three individuals can always be counted on for help, advice, and the occasional post-work beer. They have not only made my PhD better, but also infinitely more enjoyable to work on every day.

Page 11: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

vii

I have also had the absolute pleasure of working with a uniquely gifted medical doctor, scientist, and equally great friend, Benito Campos. Benito belongs to a rare breed of people who, in the course of a day, probably get more inspired ideas than I will over the course of my lifetime. And even more incredibly, he somehow manages to carry out the vast majority. Thanks also go to his wife and my friend, Julia, without whom Benito would have long ago changed his permanent address to the lab, and surely be a lot less fun. I would also like to thank my colleagues from the Cancer Vaccine Center in Boston: Guanglan Zhang (now assistant professor at Boston University), Derin Keskin, Jing Sun, Ulrich Johan Kudahl (now PhD student at University of Cambridge), and Christian Simon (now PhD student at University of Copenhagen). All equally as great friends as colleagues, they have made my visits to Boston rewarding both personally and academically. Special thanks go to Christian, whose friendship goes back more than twenty years. Mads Hald Andersen and Mike Stein Barnkob also deserves my gratitude for their valuable feedback and help they offered me to better understanding the intricacies of cancer immunology as it is practiced in the lab and in the clinic. A big thanks to the past and present members of the Sandelin lab: Albin Sandelin, Robin Andersson, Berit Lilje, Mette Jørgensen, Jette Bornholdt Lange, Mette Boyd, Yun Chen, Morana Vitezic, Ilka Hoof, and Kristoffer Vitting-Serrup for providing a pleasant atmosphere and making the office a great place to go to every day. I also would like to offer my gratitude thanks to my friends "on the outside": Mikkel, Peter, Anders, Klaus, Laura, Joe, Mads, Hans, Sebastien, Ricko, Michael, and Abbi, and of course my wonderful girlfriend, Maya, for sharing with me all the finer (non work-related) things in life, and for generally being an awesome group of people. Lastly, I would like to thank my dear family, my mother, my father, and my sister Louise and Rasmus, for their love and support throughout my years of studying, and for dutifully visiting me wherever these endeavors took me.

Lars Rønn Olsen Copenhagen, Denmark

November 2013

Page 12: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University
Page 13: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

ix

Publications

The results presented in this thesis have formed the basis of the following peer reviewed publications, submitted manuscripts, and draft manuscripts (corresponding author marked with asterisk): Olsen LR*, Campos B, Barnkob MS, Winther O, Brusic V, Andersen MH: Bioinformatics for cancer immunotherapy target discovery. Cancer Immunology, Immunotherapy (submitted). Olsen LR, Simon C, Kudahl UJ, Bagger FO, Winther O, Reinherz EL, Zhang GL, Brusic V*: Identification of T cell vaccine targets from multiple sequence alignments. Journal of Immunology (in review). Olsen LR, Kudahl UJ, Simon C, Sun J, Schönbach C, Reinherz EL, Zhang GL, Brusic V*: BlockLogo: visualization of peptide and sequence motif conservation. Journal of Immunological Methods 2013. Tongchusak S, Zhang GL*, Olsen LR, Lin HH, Reinherz EL, Brusic V: TANTIGEN: a tumor antigens database and analysis platform for vaccine target discovery. Cancer Research (submitted) Olsen LR, Kudahl UJ, Winther O, Brusic V*: Literature classification for semi-automated updating of biological knowledgebases. BMC Genomics 2013, 14 Suppl 6:S14. pub Olsen LR, Campos B, Winther O, Brusic V*: Tumor antigens as proteogenomic biomarkers of invasive breast carcinomas. 2013 (manuscript in preparation)

Page 14: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

x

The work performed for this thesis was used for the following relevant publications not included due to space limitations: Olsen LR, Sun J, Simon C, Zhang GL, Brusic V: Genomics in vaccine development. Chapter in: Genomics in Drug Discovery Editors: Sakharkar MK, Sakharkar RK, Chandra R. River Publishers, Denmark 2014 (in press). Zhang GL, Chitkushev L, Olsen LR, Kudahl UJ, Simon C, Brusic V. Streamlining the development process of immunological knowledge-based systems. Chapter in: Genomics in Drug Discovery Editors: Sakharkar MK, Sakharkar RK, Chandra R. River Publishers, Denmark 2014 (in press). Sun J, Zhang GL, Olsen LR, Reinherz EL, Brusic V*: Landscape of neutralizing monoclonal antibodies against dengue virus. In Proc. ACM-BCB, 2013 Sep 22-25. Washington DC, USA 2013. Additionally, the following research articles are currently in preparation or have been published prior to the work for this thesis. Genee HJ, Bonde MT, Bagger FO, Jespersen JB, Sommer MOA, Wernerson R, Olsen LR*: PHUSER 2.0: a primer design software for accelerated USER cloning-based DNA engineering. 2014 (manuscript in preparation). Olsen LR, Zhang GL, Keskin DB, Reinherz EL, Brusic V*: Conservation analysis of dengue virus T-cell epitope-based vaccine candidates using peptide block entropy. Frontiers in Immunology 2011, 2:1–15. Olsen LR, Zhang GL, Reinherz EL, Brusic V*: FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology. Immunome Research 2011, 7:1–9. Olsen LR*, Hansen NB, Bonde MT, Genee HJ, Holm DK, Carlsen S, Hansen BG, Patil KR, Mortensen UH, Wernersson R: PHUSER (Primer Help for USER): a novel tool for USER fusion primer design. Nucleic acids research 2011, 39 Suppl 2:W61–7.

Page 15: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xi

Abbreviations

CTL: Cytotoxic T lymphocyte HLA: human leukocyte antigen IDC: invasive ductal carcinoma k-NN: k-Nearest Neighbor MGH: Massachusetts General Hospital MSA: Multiple Sequence Alignment NEU: Northeastern University pHLA: peptide-HLA complex TA: tumor antigen TCGA: The Cancer Genome Atlas

Page 16: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xii

Page 17: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xiii

Content

Summary i Resumé iii Preface v Publications ix Abbreviations xi Chapter 1: Introduction .............................................................................. 1

The structure of the thesis .................................................................... 2 Contributions .......................................................................................... 4 Comments ............................................................................................... 4

Chapter 2: Bioinformatics for cancer immunotherapy target

discovery ..................................................................................................... 5 In silico screening for immunotherapy targets .................................... 6

Chapter 3: Bioinformatics tools for vaccine target selection .............. 11

Epitope conservation analysis ............................................................ 13 Chapter 4: TANTIGEN: an updated database of tumor antigens

to support immunotherapy target selection ........................................ 17 The TANTIGEN database of tumor T cell antigens .................... 18 Keeping TANTIGEN up-to-date ..................................................... 18

Chapter 5: Tumor antigens as proteogenomic biomarkers in

invasive ductal carcinomas .................................................................... 23

Page 18: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xiv

Tumor antigens as biomarkers for cancer ....................................... 24 Chapter 6: Conclusion .............................................................................. 29

Future perspectives .............................................................................. 29 References ................................................................................................... 31 Paper I: Bioinformatics for cancer immunotherapy target

discovery ................................................................................................... 39 Paper II: Identification of T cell vaccine targets from multiple

sequence alignments ............................................................................... 97 Paper III: BlockLogo: visualization of peptide and sequence

motif conservation ................................................................................ 129 Paper IV: TANTIGEN: a tumor antigens database and

analysis platform for vaccine target discovery .................................. 147 Paper V: Literature classification for semi-automated updating

of biological knowledgebases .............................................................. 173 Paper VI: Tumor antigens as proteogenomic biomarkers of invasive

breast carcinomas .................................................................................. 193 Contributions ........................................................................................... 227 Supplementary materials ......................................................................... 229

Supplementary materials for Paper I .............................................. 229 Supplementary materials for Paper II ............................................. 239 Supplementary materials for Paper VI ........................................... 245

Page 19: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xv

Figures

Chapter 2

Figure 1: The workflow for discovery of potential T cell epitopes and immunotherapy targets ..................................................................... 9

Chapter 3

Figure 2: Subdivision of MSA into blocks of peptides ........................ 14 Figure 3: Example of the visualization of the block conservation

analysis and HLA class I binding affinity predictions ....................... 15 Figure 4: Example of BlockLogo ............................................................ 16 Chapter 4

Figure 5: Tumor antigen classification chart in TANTIGEN ............ 19 Figure 6: Growth of PubMed, UniProt/Swiss-Prot, and

COSMIC .................................................................................................. 20 Figure 7: Literature classification signature ............................................ 21 Chapter 5

Figure 8: Density distribution of correlation coefficients .................... 25 Figure 9: Correlation of mRNA and protein expression. .................... 25 Figure 10: Functional modules of tumor antigens and interacting

proteins ..................................................................................................... 26

Page 20: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xvi

Page 21: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xvii

Tables

Chapter 2

Table 1: Predicted HLA binding affinities of each peptide in the example block ........................................................................................... 16

Page 22: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

xviii

Page 23: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

1

Chapter 1 Introduction

Gil: "I would like you to read my novel and get your opinion." Ernest Hemingway: "I hate it." Gil: "You haven't even read it yet." Ernest Hemingway: "If it's bad, I'll hate it. If it's good, then I'll be envious and hate it even more. You don't want the opinion of another writer." From Woody Allen's Midnight in Paris

The work presented in this PhD thesis was performed between February 2011 and November 2013, partly as a visiting research scholar at the Cancer Vaccine Center, Dana-Farber Cancer Institute from February 2011 to September 2011, partly as a research assistant at the Bioinformatics Centre, Department of Biology, University of Copenhagen from February 2012 to September 2012, and partly as a PhD candidate in bioinformatics at the Bioinformatics Centre at University of Copenhagen from October 2012 to November 2013. The work was supervised by Professor Anders Krogh, Associate Professor Ole Winther, and Professor Vladimir Brusic. Funding was provided by the Dana-Farber Cancer Institute and the Novo Nordisk Foundation. The work in this PhD thesis is a theoretical and practical extension of preliminary studies performed during the work for my Master's thesis, written during 2010 and 2011 at the Cancer Vaccine Centre, under the supervision of Vladimir Brusic and Søren Brunak. The Master's thesis,

Page 24: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

2

titled "An integrated computational framework for vaccine target discovery in flavivirus spp.", resulted in the publication of two peer reviewed research articles addressing the selection of T cell vaccine targets for polyvalent vaccines against dengue virus and related flaviviruses. The first of the two articles presented the FLAVIdB database and analysis resource for flavivirus protein sequences [1], while the second presented the results of T cell epitope predictions for vaccine target discovery in dengue virus [2]. The problems of vaccine design against well-characterized viral pathogens differ significantly in complexity from those that burden cancer immunotherapy design. Although there is an overlap in the methodologies for vaccine target selection and biological database construction, there are significant differences due to the nature of cancer antigens and aberrations of regulatory, metabolic, and signaling pathways in cancer. My work involved development of new methods, refinement of existing ones, and significant extension of their functionality for applications in cancer immunotherapy. The aim of this work is to provide a clear definition of the roles of bioinformatics and the development of novel solutions for advancing target discovery for cancer immunotherapy. In doing so, I identified and addressed three major issues where adequate solutions have been lacking: i) the lack of a well-defined workflow for bioinformatics applications in the field of immunotherapy; ii) the lack of specialized tools for defining therapy targets for multi T cell epitope strategies; and iii) the lack of an up-to-date specialized database of tumor antigens (TAs). In addition to technological advances achieved by addressing these three key issues, there was a scientific contribution of this work. Using data and tools developed in the first part of the thesis work, I studied the expression of TAs and elucidated how gene and protein expression can be used to clarify mechanisms of tumorigenesis. A case study of genomics and proteomics data from invasive ductal carcinomas (IDCs) was performed to identify aberrations of molecular pathways involved in tumorigenesis. The structure of the thesis

In addition to this introduction and the final conclusion, this PhD thesis contains four chapters that summarize the key findings of six research articles produced during the course of this PhD work. Chapter 2, titled “Bioinformatics for cancer immunotherapy target discovery” (Paper I), provides highlights from the review article of the

Page 25: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

3

same name. In the review, we (the authors) discuss the main challenges in T cell-based cancer immunotherapy, with a particular emphasis on the current and prospective applications of bioinformatics used for discovery and selection of immunotherapy target selection. In this article, we defined a workflow for selection of T cell-based cancer immunotherapy targets, using a selection of established bioinformatics tools. This article is submitted for publication in Cancer Immunology, Immunotherapy. In chapter 3, I have provided an overview of a conceptual framework for the methodology used for in silico selection of vaccine targets and two software implementations of the developed methods. This chapter summarizes two articles. The first article describes the block conservation method and a webserver that facilitates conservation and variability analysis of T cell epitopes for multi epitope vaccines against human pathogens and cancers. The second article describes a software implementation of BlockLogo - an extension of the traditional sequence logo tool that enables visualization of continuous and discontinuous immunological motifs. The methods were applied to viral pathogens as case studies, but this method is applicable to any pathogen or TA. The research article describing the above is titled "Identification of T cell vaccine targets from multiple sequence alignments" (Paper II) and is currently undergoing review for publication in the Journal of Immunology and the article titled "BlockLogo: visualization of peptide and sequence motif conservation" (Paper III) is published in the Journal of Immunological Methods [3]. A central resource in this work is the TANTIGEN database of tumor T cell antigens. As such, it is paramount that this database is kept updated to the highest degree possible. In chapter 4, I provide an overview of the structure and content of TANTIGEN as well as a methodological framework for semi-automated updating of TANTIGEN using text mining. The article describing TANTIGEN is titled "TANTIGEN: a TA database and analysis platform for vaccine target discovery" (Paper IV) and is submitted to Cancer Research. The article describing the methodology for updating TANTIGEN is titled "Literature classification for semi-automated updating of biological knowledgebases" (Paper V) is published in BMC genomics [4]. TAs are often unique to cancer cells, a property that makes them interesting targets to study for therapeutic as well as diagnostic and prognostic purposes. In chapter 5, I have summarized the results of my preliminary study of the role of TAs in tumorigenesis of IDCs.

Page 26: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

4

This chapter summarizes findings, which are described in detail in the research article draft titled "Tumor antigens as proteogenomic biomarkers in invasive ductal carcinomas" (Paper VI) that is intended for submission to BMC Genomics. Lastly, chapter 6 contains an overall conclusion to the thesis, including future work and perspectives. Contributions

As each chapter represents the work done by myself in collaboration with multiple contributors, the plural personal pronoun, "we", is consistently used throughout the chapters and papers of this thesis. The individuals significantly contributing to the different subprojects are acknowledged in the list of authors on the Papers, and their contributions detailed in the "Contributions" section immediately following the papers appended to this thesis. Comments

The six papers forming the basis of this thesis are interrelated, which allowed me to present this thesis in the form of a monograph, and it is my hope that this format will increase the readability of the six chapters. The papers, however, are written to present the results of individual research projects resulting in a certain amount of overlap in the content - particularly in the introductory and methodology sections. As the reader may have noticed by now, I have added minor informalities in the form of quotes at the beginning of each chapter. Some hold relevance to the content of the chapter, some are merely curious musings of great people past or present, while some are quotes from unrelated works of fiction that have inspired me during the course of this PhD.

Page 27: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

5

Chapter 2 Bioinformatics for cancer immunotherapy target discovery

"Down to their innate molecular core, cancer cells are hyperactive, survival-endowed, scrappy, fecund, inventive copies of ourselves." Siddhartha Mukherjee - The Emperor of All Maladies

Cancer immunotherapy is a major modality of cancer treatment. It is often used in combination with other treatments such as surgery, radiotherapy, and chemotherapy [5]. Immunotherapy is a broad term that encompasses a variety of treatments, of which some enhance immune response in a very general way, while others direct the immune system to specifically target cancer cells. Their common aim is a productive modulation of immune responses for effective targeting and elimination of tumor cells [6]. In this thesis, I focus specifically on cancer immunotherapy treatments that involve targeting of TAs by T cells of the immune system. The cytotoxic capacity of T cells can be utilized to target TAs by conventional in vivo stimulation with whole TAs, TA subunits, synthetic peptides, or recombinant viruses [7], by the adoptive transfer of T cells from another source, or T cells expanded and modified ex vivo [8]. The adaptive immune system responds to tumor cells by recognizing tumor-associated antigens (antigens that are overexpressed in cancer

Page 28: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

6

cells) [9], or tumor-specific antigens (unique antigens that are not expressed in most normal tissues) [10]. Genetic mutations in tumor cell can lead to changes in the primary or secondary protein structure that may affect immunogenicity of antigens [11]. Changes in protein sequences can alter peptide binding affinity to the human leukocyte antigen (HLA), and potentially change subsequent responses by cytotoxic T lymphocytes (CTLs) [12]. The control of tumor proliferation by CTLs can be highly efficient, and expression of T cell epitopes have been associated with lower tumor survival rates [13]. However, harnessing the capacity of CTLs into an efficient cancer immunotherapy remains elusive. While T cell responses can successfully be induced in vitro, it has not yet been reliably reproduced with the same efficiency in clinical trials. This difference is largely due to the immunosuppressive strategies deployed by tumors [14] and immune escape by immunoediting of antigenic sites leading to therapy resistance [15]. Various strategies have been devised to overcome the latter, for example by targeting functionally essential protein regions, or by multi epitope vaccination [16]. Therefore, identification and detailed characterization of vaccine targets is an essential step in tumor vaccine development. Technical advances in instrumentation, sample processing, immunological assays, and bioinformatics techniques have generated large amounts of immunological data, including experimentally identified TAs and T cell epitopes, novel tumor biomarkers and differentially expressed genes or proteins. The majority of these targets have been identified through genomics, proteomics, or other high-throughput methods. Our data and methods provide means for in silico prescreening of immunotherapy targets and enable the streamlining of target discovery for immunotherapy. In silico screening for immunotherapy targets

Identification and selection of antigens is a multifaceted task that depends both on the type and the intended application of antigens. In 2000, Rino Rappuoli formalized the role of computational analyses in vaccinology in a conceptual framework termed "Reverse Vaccinology" [17]. Originally formulated to facilitate vaccine target discovery in pathogens, the concepts of reverse vaccinology can be expanded for applications in cancer immunology. Reverse vaccinology revolves around sequence analysis, in which genomic sequence is used to catalogue all potential antigens. In simple viral and prokaryotic

Page 29: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

7

pathogens, essentially all protein products are potential antigens, whereas the majority of tumor tissue proteins are non-aberrant native proteins, rendering them poor therapy targets. Therefore, cataloguing antigens in tumor cells require additional genomic and proteomic screening. Once a catalogue of potential antigens has been established, the reverse vaccinology pipeline calls for in silico prediction of vaccine targets - a process that is seen as naïve, in that targets predicted from sequence, such as predicted HLA binders, are not characterized in terms of intra-cellular pre-processing, conservation, or in vivo expression. Additional computational and experimental prescreening of epitope candidates must therefore be performed before they can be included as therapy targets. Potential T cell epitopes, must be examined for pre-processing by the proteasome, transport by the transporter associated with antigen processing (TAP), HLA binding, stability of peptide-HLA (pHLA) complex [18], pHLA binding to the T cell receptor [19], and in vivo expression. T cell epitopes are short peptide fragments of 8-12 and 13-25 amino acids in length for HLA class I and II, respectively [20, 21]. Proteins are intracellularly processed by proteasomal cleavage in the cytosol, after which they are transported to the ER where they bind the HLA. The HLA is then relocated to the cell surface, where the pHLA complex is recognized by binding the receptors of circulating T cells [22]. Each of these cellular processes is a limiting step in the classical T cell-mediated immunity - the prediction of immunogenicity is therefore a non-trivial task. The conservation and variability analysis for selection of immunotherapy targets is a multidimensional problem. If one aims to define targets for general immunotherapies applicable to a broad cohort of patients, antigen diversity must be studied across the patient population. Addressing patient diversity alone is a complex problem, and targeting a disease of such vast heterogeneity as cancer further increases the complexity of the studied system is further increased. Due to high variability of cancer targets, even personalized vaccine targets are likely to be unstable over time, and the somatic process driving the selections are difficult to predict. It is therefore very important that conservation analysis and definition of multi epitope strategies encompass as much data and information as possible and the related computational analysis tools must be customized for the intended purpose.

Page 30: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

8

Traditionally, large-scale experimental screening has been the primary tool to elucidate cancer immunotherapy targets. This process can be streamlined by systematic application of bioinformatics on patient antigen expression and sequence profiles, as well as analyses in conjunction with the ever-growing body of publically available biological data. In Paper I, we have outlined a proposed approach that uses bioinformatics-driven selection of immunotherapy targets (Figure 1), and suggest a workflow that combines bioinformatics tools and databases for the purpose. We have reviewed current immunotherapy treatment modalities and have discussed the key issues pertaining to T cell-based therapy design. The main issue is selection of immunotherapy targets that help achieve an efficient and lasting protection against malignant tumors.

Page 31: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

9

Figure 1: The workflow for discovery of potential T cell epitopes and immunotherapy targets in tumor tissue. Green boxes represent the preliminary and confirmatory laboratory analyses; red boxes represent bioinformatics analyses performed using computational tools (described in further detail in section 4.1 of paper I); blue cylinders represent cross-referencing with biological databases (described in further detail in section 4.2 of Paper I), and yellow boxes denote intermediary outputs from the analyses and cross-referencing.

Gene expression profiling (microarray,

RNA seq)!

Protein expression profiling

(immunohistochemistry, mass spectrometry)!

Differentially expressed genes/proteins!

TANTIGEN!

Potential protein targets!

Mutation and splice variant mapping (RNA

seq, DNA seq)!

Potential antigens!

Epitope stability analysis (Sequence

alignment)!

Pathway/interactant analysis!

Healthy tissue sample!Tumor and healthy tissue samples!

HLA profiling (microarray, RNA seq,

DNA seq)!

Predict epitopes (netMHC, netMHCII)!

TANTIGEN!UniGene, !Human Protein Atlas!

TANTIGEN!COSMIC, UniProt!

TANTIGEN!KEGG, STRING!

Potential T cell epitopes!

Potential co-targets!

Cataloguing!potential!antigens!

Identification of !potential T cell !epitopes!

Epitope !conservation!analysis!

Co-target!analysis!

Validation!Experimental validation!

Page 32: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

10

Page 33: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

11

Chapter 3 Bioinformatics tools for vaccine target selection

"I don't think we're here for anything, we're just products of evolution. You can say 'Gee, your life must be pretty bleak if you don't think there's a purpose' but I'm anticipating a good lunch." James Watson

Knowledge-based selection of immunotherapy targets relies on the use of specialized bioinformatics tools. Multiple steps are involved spanning the initial sequence analysis utilizing local alignment algorithms and multiple sequence alignment algorithms, through sequence analysis algorithms for the assessment of variability and conservation, to functional annotations, such as predictions of epitopes. The biggest success of immunological bioinformatics is the development of algorithms for prediction of peptide binding affinity to the HLA - one of the rate limiting steps in T cell-based immune response [23]. Although current forms of these algorithms are highly accurate [24–26], the output alone is not the end product that informs the selection of therapeutic applications. In the conceptual framework for reverse vaccinology, Rino Rappouli described in silico predictions of immune epitopes from biological sequence data as a "naive approach" when compared with experimental elucidation immunogenic peptides. He argued that many parameters of a good vaccine target conferring efficient, lasting immunity, still remain to be

Page 34: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

12

considered after prediction of HLA binding: multiple rate-limiting steps of peptide pre-processing, confirming in vivo expression, considering dynamics of expression in different developmental stages and cellular environments, presence of epitope across pathogen or tumor population, response across host population, epitope stability over time, and others [17]. In this thesis, I have addressed the issue of eliciting a lasting immunity through vaccination. I proposed a method for selection of vaccine targets for multi epitope, or polyvalent, vaccine designs. In this method multiple targets collectively confer broad coverage and lasting immunity. This proposed solution addresses a major part of the vaccine target selection, but vaccinology as a discipline contains many facets beyond target selection. Nevertheless target selection is a bottleneck in the design process (discussed in Paper I). The central hypothesis behind this approach is anchored in the natural selection theories, and detailed description of algorithms and applications can be found in Paper II. The hypothesis presented in Paper II is based on three basic assumptions: (i) if an immunogen is capable of eliciting a strong response, it is under high selective pressure and therefore likely to be of lower frequency in the viral or tumor population, (ii) some immunogens will be phased out of the pathogen or tumor population this way, (iii) while others, for example those located in functionally essential regions of the protein that harbors it, will have only limited leeway for mutating. Traditionally, T-cell epitope diversity analysis is done to identify conserved peptides with desired immunogenic properties. Studies of HLA evolution suggest that HLA evolved to recognize peptides derived from conserved hydrophobic cores of pathogen proteins [27]. However, since the human immune system's evolution occurs on a significantly longer time scale than rapidly mutating pathogens or tumors [28], high selective pressure causes them to alter expression of some immunogenic antigens faster than the immune system can evolve to keep up with the changes [29]. The HLA binding affinity of a peptide relative to its frequency in a viral or malignant cell population is known as its targeting efficiency (TE). It has been shown that the targeting efficiency of peptides vary in different organisms, and in some highly variable viruses it tends to be low [30]. Regions of high TE comprise peptides that are highly conserved, most likely owing to the functional relevance limiting the capacity of a cell to change antigens while maintaining its fitness [31–33]. Regions of low TE comprise one or more peptides, potentially of high HLA binding affinity, but each of them will have a low frequency in the

Page 35: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

13

viral or cell population. For rapidly mutating viruses, such as RNA viruses [34–39], the selective pressure exerted on HLA-binding peptides, means that host immunity will often, and in some cases preferentially, target low frequency epitopes [30]. Although data for the frequency of tumor T-cell epitopes has not yet been sufficiently accumulated to enable drawing such conclusions, one can speculate that the issues with variability of TAs resemble those of highly variable viruses. However, the immune escape mechanisms of tumors are vastly more complex. In addition to the loss of epitopes [40], tumors are also known to lose expression of antigens altogether [41], or they down regulate or lose expression of HLA [40] or antigen processing and presentation components [42]. The resistance that is often observed after limited periods of immunotherapy treatment is indicative of the escape mechanisms employed by the tumor cells. Additional complications include the immunosuppressive properties of tumors in advanced stages of progression [43], which may be ameliorated by increasing response efficiency against early stage tumors. The software tools presented here were initially developed specifically for viral pathogens and later adapted to TAs, and in principle they should be able to capture antigen diversity in a similar manner. This notion is supported by our results from exploring the COSMIC database for mutations in TAs (discussed in Paper I), where we often observed mutations in predicted and known T-cell epitopes in tumor T-cell antigens leading to the loss of HLA binding properties. Epitope conservation analysis

Conservation and variability analysis for immunological applications is traditionally done by calculating the frequency of each amino acid in a multiple sequence alignment (MSA) of homologous protein sequences. Regions, in which nine or more consecutive residues display conservation over a given threshold (typically > 90%), are then further analyzed for their potential as immune response targets either by prediction, experimental testing, or both [44, 45]. However, since epitopes bind to HLA as peptides rather than as individual amino acids, they should be analyzed as peptides in the initial conservation analyses. This approach is especially useful to ensure the inclusion of low frequency peptides as potential vaccine targets, since analysis of blocks of homologous peptides may reveal peptide pools with similar immunological properties.

Page 36: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

14

The block conservation approach is based on prediction of HLA binders from MSAs of homologous protein sequences. Once protein sequences are aligned, frequency and information content can be calculated for blocks of aligned peptides within the MSA (Shannon entropy [46]). Block conservation is based on segmenting a protein MSA into windows of a given size (Figure 2) (for T-cell epitope applications, the size can vary from 8 to 25 amino acids in length), and calculating the conservation of each peptide in the block as well as the entropy for the block as a whole.

Figure 2: Subdivision of a multiple sequence alignment into blocks of peptides, l amino acids in length. In this example, l = 9. Figure from Paper II. A bar plot displaying the number of different peptides in each block allows for easy overview of peptide conservation (Figure 3). Additionally, the entropy of each block provides insight into the nature of variation in the block; i.e. whether variability is conferred by a large number of equal frequency variants or whether a few high-frequency variant exists as well [2]. Following the conservation analysis, each block can be analyzed for potential HLA (class I or II) binders for a large number of HLA alleles. The number of predicted binders and their accumulated frequency in a given block determines the potential of each block for polyvalent vaccine designs: if all peptides are predicted binders to the same HLA with similar affinity, the block may be valuable for further examination for potential use in multi epitope vaccine constructs. The predicted binding affinities of peptide blocks to the selected HLA alleles can be visualized using a heat map displayed below the conservation graph. In this heat map each column corresponds to the binding affinity of the given position in the MSA, and each row corresponds to an HLA allele. This approach to visualization allows simultaneous display of peptide conservation and overview of predicted binding to multiple HLA alleles (Figure 3).

MSA of N protein sequences of length L, giving rise to L-l-1 blocks of peptides of length l.

Block 1

AGVLWDVPSPPPMGK...AGVLWDVPSPPPMGK...AGVLWDVPSPPPMGK...AGVLWDVASPPPMGK...AGVLWDVASPPPMGK...AGVLWDVASPPPMGK...AGVLWDVSSPPPMGK...

Page 37: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

15

Figure 3: Visualization of block conservation analysis and HLA class I binding affinity predictions in 82 isolates of influenza A H7N9 hemagglutinin, in which at least 95% of the peptides were predicted to bind at least one of eight HLA alleles used in this example (A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, B*08:01, B*15:01). The bars show the minimum number of peptides in a block (Y axis) at a given starting position in the multiple sequence alignment (X axis) required for fulfilling the user defined coverage threshold (in this case 95%). The heat map below the bar show the percentage of peptides in the block predicted to bind to each of the HLA alleles predicted for in these examples. The color of each position in the heat map matrix ranges from gray (0% accumulated conservation by predicted binders in the block for the given allele) to red (blocks predicted to bind to the given allele with a minimum binding affinity of 500 nM represents 95% conservation in the block). In this figure, only blocks in which 95% of the peptides are predicted binders are shown. The starting positions of the blocks are shown below the heat map. The figure was generated using Block conservation webserver (http://met-hilab.bu.edu/blockcons).

A0101$A0201$A0301$A1101$A2402$B0702$B0801$B1501$

HLA$Alleles$

5$ 36$

82$

101$

151$

154$

160$

162$

167$

202$

214$

244$

252$

260$

295$

324$

340$

355$

382$

429$

430$

473$

474$

494$

524$

527$

528$

536$

537$

538$

539$

541$

542$

Alignment$Posi/on$

100$%$

0$%$

50$%$

25$%$

75$%$

Heat$Scale$

1$

2$

3$

4$

5$

Num

ber$o

f$pep

/des$in$blocks$

Page 38: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

16

The algorithms and software implementation of the block conservation analysis for visual data mining of multi epitope conservation is described in detail in Paper II and the tool is available at http://met-hilab.bu.edu/blockcons. Additionally, the block conservation web server integrates BlockLogo (available at: http://research4.dfci.harvard.edu/cvc/blocklogo/ and described in Paper III) for visualization of information content in individual blocks of peptides (Figure 4 and Table 1). As detailed in this paper, the standalone version of the BlockLogo tool can also be used for visualization of B-cell epitopes and other discontinuous peptide motifs. This functionality is further explored and its potential applications described by Sun et al. (2013) [47].

Figure 4: Top panel: Sequence logo plot of the residues in the 9-residue block starting at position 244 of 82 isolates of influenza A H7N8 hemagglutinin generated using WebLogo. Bottom panel: BlockLogo of the peptides in the 244-252 block. The residue position in the multiple sequence alignment is shown on the X-axis, and the information content is shown on the Y-axis. See Table 1 for peptide frequencies and HLA binding affinity predictions. The figure was generated using the BlockLogo webserver (freely accessible at: http://research4.dfci.harvard.edu/cvc/blocklogo/). Table 1: Predicted HLA binding affinities of each peptide present in the block of 9-mer peptides starting at position 244 of 82 isolates of influenza A H7N8 hemagglutinin. Prediction for HLA A*02:01 allele is listed. # Peptide Frequency Accumulated

frequency Number of sequences

Binding affinity to HLA A*0201

1 LMLNPNDTV 71.25% 71.25% 57 53 nM 2 LILNPNDTV 10.00% 81.25% 8 458 nM 3 LLLDPNDTV 10.00% 91.25% 8 20 nM 4 LMLNPNDTI 7.50% 98.75% 6 312 nM 5 MLLDPNDTV 1.25% 100.00% 1 10 nM

Page 39: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

17

Chapter 4 TANTIGEN: an updated database of tumor antigens to support immuno-therapy target selection

"The most exciting phrase to hear in science, the one that heralds new discoveries, is not 'Eureka!' but 'That's funny...'" Isaac Asimov

Databases are essential resources that enable a multitude of bioinformatics analyses. As experimental methods keep advancing and high-throughput methods keep producing increasing volumes of raw data, the number of biological data repositories grows rapidly [48]. Similarly, the quantity and complexity of the data are growing, thus increasing the demands for computational resources to store, access, and analyze the data. As discussed in Paper I, a comprehensive database of TAs is essential for the reverse vaccinology approach and immunotherapy target selection. A comprehensive catalogue of classified TAs helps direct the search for epitopes imbedded in these proteins, and a list of experimentally validated epitopes represents a valuable resource for target selection, therapy assembly, and design of validation strategies. Several data sources provide information on tumor T cell antigens. The Cancer Immunity Peptide Database developed by Ludwig Institute for Cancer Research defined four data tables containing 150 TAs with defined T cell epitopes; 56 TAs resulting from mutations, 31

Page 40: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

18

shared tumor-specific antigens, 12 differentiation antigens, and 51 antigens overexpressed in tumors [49]. A list of human TAs reported in the literature as of February 2004 was collected by Parmiani and colleagues [50]. This list includes T cell-defined epitopes, while analogs, artificially modified epitopes, and virus-encoded and antibodies-recognized antigens have been excluded. These sources provide valuable data on human tumor T cell antigens, but they are not up to date and do not provide any bioinformatics tools for data analysis. The TANTIGEN database of tumor T cell antigens

To provide a comprehensive resource of tumor T cell antigens, we assembled the TANTIGEN database. We have cataloged 4014 TA entries representing variants of 258 unique protein TAs reported in the literature. KB-builder, an in-house database development framework was deployed to construct this database [51]. Each record contains the information on TA sequence, variants (splice isoforms and mutation variants), known T cell epitopes and HLA ligands, and corresponding literature references. A set of computational tools for in-depth analysis of TAs has been integrated into TANTIGEN. These tools include TA classification and TA nomenclature (Figure 5), sequence comparison using BLAST search, multiple sequence alignments of antigens, mutation mapping, and T-cell epitope/HLA ligand visualization. Predicted Class I and Class II HLA binding peptides for 15 common HLA alleles are included in this database as putative targets. TANTIGEN is available at http://cvc.dfci.harvard.edu/tadb/. The assembly, structure, and content of TANTIGEN is described in detail in paper IV. Keeping TANTIGEN up-to-date

Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. Collecting these annotations in a single accessible database represents a valuable resource, but it is, however, a non-trivial task to perform. Data for updating TANTIGEN were extracted from three primary databases: UniProt/Swiss-Prot [52] for protein data and information, COSMIC [53] for data about known somatic mutations, and PubMed for published literature about TAs. All three sources are extensive

Page 41: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

19

repositories for their respective data types, and the quantity of data is steadily increasing (Figure 6).

Figure 5: Tumor antigen classification chart in TANTIGEN. Each new addition to TANTIGEN is classified according to the principles established by Van den Eynde and van der Bruggen [10]. The first build of TANTIGEN was assembled in 2007 using manual extraction of relevant articles and data from the literature. This effort was incredibly time consuming, but provided valuable insight into the nature of the data, which allowed for automation of the process for

T umor antigen

Unique antigenS hared antigen

Substitution m

utation

Intron encoding

Alternative O

RF

Chrom

osomal

translocation

Internal tandem

repeat

CASP

5C

DKN

2AO

GT

ART

C1

MU

M1

ACT

N4

BRA

FC

DK

4C

TNN

B1

EEF

2E

FTUD

2FN

1H

HA

T

HSPA1B

KRA

S

SIRT

2S

NRPD

1TPI1

ZUBR

1

ME

1

TRAPP

C1

MU

M3

MYO

1BN

FYC

NRA

SO

S9

LPGAT

1P

RDX

5

PTP

RK

ABL

-BCR

DEK

-CAN

ETV

6-AM

L1

LDLR

-FUT

NPM

1-ALK

1P

AX3-FKH

RP

ML

-RAR

AS

YT-S

SX1

FLT3

PAP

OLG

GPN

MB

CASP

8C

DC

27

707-AP

AN

KRD30A

KLK

3K

LK4

MC

1RM

LAN

AO

CA2

RAB

38S

CGB

2A2

SILV

SO

X2

GPR

143

ABC

C3

ACP

PA

DA

M17

AD

FPA

FPA

LDH

1A1

ALK

AM

L1

ART

4B

CL-2

FOLH

1

CCN

D1

CEA

CAM

5B

CL2L1

BIRC

5B

IRC7

BST

2C

A9

CCN

IC

CNB

1

FGF

5F4.2

DCT

TYRP

1

CEL

CSF

1

RAG

E

WD

R46

AIM

2

CYP

1B1

DD

R1

DEK

DKK

1E

GFR

EN

AH

EPH

A2

EPH

A3

ERB

B2

EZH

2

MU

C1

CPSF

1C

SPG

4

FMN

L1

GPC

3IL13R

A2

KAA

G1

MCL

1M

DM

2M

MP

2M

RPL

28M

SLN

CLC

A2

AN

XA

2B

AGE

CCD

C110

CSA

G2

CTA

G1A

CTA

G2

CXO

RF61

GAG

E1

HER

V-K

-MEL

VEN

TXP

1

MAG

EA3

GAG

E3

GAG

E4

GAG

E2

GAG

E5

GAG

E6

GAG

E7

GAG

E8

MAG

EA1

MAG

EA10

MAG

EA12

MAG

EA2

MAG

EA9

MAG

EB1

MAG

EB2

MAG

EC2

SAG

E1

SPA

17S

SX2

SSX

4

SYC

P1

MAG

EA4

MAG

EA6

CTA

G

MG

AT

5

Uncla ssifie d

Shared tum

or specific

Differentiation

Overexpressed

TGFBR

2

ABI2

ABL

1

NPM

1

ACR

BP

PRA

ME

AKA

P13

RH

AM

MR

NF

43S

ART

1

APC

SO

X10

CD

C2

CD

KN1A

ATIC

BCA

P31

BCR

SAR

T3

SCR

N1

BTB

D2

CALR

3C

AN

SFM

BT1

CTSH

DN

AJC

8

SO

X11

SO

X4

EIF

4EBP

1

STEAP

1

ETV

6

TACST

D1

TERT

FMO

D

BAA

T

TOP

2A

FUT

1H

3F3AH

MO

X1

HPSE

IER3

IGF

2BP

3ITG

B8

ITPR

2

JUP

MFI2

MM

P14

MU

C2

CO

TL1

LDLR

LGA

LS3B

PLRP

1LY

6KM

AGED

4

ZNF

395

MET

MFG

E8

TOP

2B

PA

2G4

PAG

E4

PAK

2P

ARP

12P

GK

1

TP53

PM

L1

PRT

N3

FOX

O1

PSC

AP

XDN

L

RAR

AR

CVRN

RPA

1

TPBG

RPSA

RPL

10AR

PS2

SLC

45A3

SEP

T2

UBE

2A

OAS

3

LCK

SSX

1

HSM

D

SLB

PS

LC35A

4

TPM4

TRPM

8

TYR

TRG

UBE

2V1

PAX

3P

PIB

TOR

3A

TSPYL

1

PTH

LH

TRGC

2

RG

S5

HN

RPL

SD

CBP

TYMS

STA

T1

SYT

TAPBP

XAG

E1

SYT

-SSX

2

UBX

D5

WH

SC2

WT1

WN

K2

TTKS

UPT

7L

BCR

-ABL

ETV

5H

SPA

1A

SYN

D1

TRIM68

XBP

1

HM

HA

1

Page 42: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

20

subsequent builds. However, automating the collection of the types of data stored in TANTIGEN was non-trivial. To do so, one is required to address the diversity of data types, diversity of data formats, and dispersion of data across different sources. The data is stored and retrieved either as structured text, table formats, semantic web formats (such as RDP, OWL, or XML), or non-structured text. Depending on the target data format, retrieval can be performed by direct extraction, parsing, text mining, or manual extraction. Text mining, manual extraction, or a combination of these two is common in extracting the high-level data, such as functional annotations. Data availability and individual entry size vary between different data types, presenting a computational challenge in terms of retrieval, handling, analysis, and storage. Additional factors that affect the complexity of the updating task are data heterogeneity, integration of multiple data types after retrieval, as well as provenance tracking for quality assessment [54].

Figure 6: Number of entries in PubMed, UniProt/Swiss-Prot, and COSMIC from their respective launch dates to present. Entries in PubMed were filtered by the search term “(tumor OR cancer) AND (antigen OR epitope)“. To address these challenges, we designed and applied a machine learning approach for literature classification to support updating of TANTIGEN. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-nearest neighbor (k-NN) classifier on 310 abstracts yielded classification accuracy of 0.95 (with sensitivity of 0.96 and specificity of 0.93), thus showing significant value in support of extraction of relevant literature.

Manual extraction of new antigenic proteins and tumor T cell antigens was subsequently performed from the classified literature. A k-NN classifier with k=6 yielded the best performance, thus the body of classified abstracts was divided in seven groups, corresponding to

0

100

200

300

1940

1960

1980

2000

Entr

ies (

thousands)

PubMed

0

20000

40000

1985

1990

1995

2000

2005

2010

Year

8QL3URW�6ZLVVï3URW

0

250

500

7502004

2006

2008

2010

2012

COSMIC

Page 43: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

21

whether an abstract had from zero to six relevant neighbors in the training set. Retrieving the literature published since the last TANTIGEN update was done by querying PubMed. Entries in were pre-filtered by the search term “(tumor OR cancer) AND (antigen OR epitope)“, which yielded 48,130 abstracts for classification. Out of these, 117 had six relevant neighbors, 156 had five, 212 had four, 859 had three, 3,489 had two, 12,738 had one, and 30,856 abstracts had zero relevant neighbors. We manually examined the top 273 scoring papers where we found 13 new antigenic proteins with 32 newly reported tumor T-cell epitopes. Additionally, we found more than 100 new T cell epitopes discovered in proteins that were previously recorded as TAs in TANTIGEN. The document-term matrix (DTM) of the training corpus contains more than 5,600 terms. Most are very rare terms present in only one or a few abstracts, and have no significance for abstract classification. Examining the top ten terms, most discriminative between abstracts of relevant and irrelevant articles (determined by t test), show a distinct signature and reveal particular emphasis on such terms as "immunotherapy", "epitope", "T cell", and "CTL" (Figure 7). These terms are likely to be the main drivers of classification and may very well be sufficient to support the main task of classification. Notably, all discriminating terms are predominant in, but not exclusive to relevant abstracts, indicating that a machine learning approach to classification is likely to outperform a simple keyword search. For a full discussion of the results, please refer to Paper V.

Figure 7: Average frequency of the top ten most discriminative terms between relevant and irrelevant abstracts. Significance of difference is based on t test of term frequency between corpora and p-values are above between bars. Terms are stemmed to ensure completeness in term count.

Terms

7.3E-10

3.0E-10

3.5E-9

4.1E-16

3.1E-10

6.5E-18

4.4E-18

1.4E-15

1.1E-10

2.3E-16RelevantIrrelevant

Aver

age

term

freq

uenc

y

Litterature classification signature

Page 44: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

22

Paper IV describes in detail the assembly, structure, and content of the TANTIGEN database of tumor T cell antigens. To our knowledge, TANTIGEN is the most comprehensive database of TAs, and the only of its kind to be kept up-to-date, in spite of an ever growing body of the complex, unstructured, and dispersed type of data that TAs and epitopes represent. In Paper V, we present a conceptual framework for streamlining periodic updating of TANTIGEN with standardized and non-standardized data from primary and secondary data repositories, as well as the literature. The framework is based on text mining methods to categorize literature. This method utilizes term signatures defined from freely available article abstracts, which enable significantly faster manual extraction of relevant data. This method was used successfully and efficiently to update TANTIGEN and will be applied periodically to ensure that TANTIGEN remains up-to-date.

Page 45: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

23

Chapter 5 Tumor antigens as proteogenomic biomarkers in invasive ductal carcinomas

"If I had asked people what they wanted, they would have said faster horses." Henry Ford (attributed)

Statistical testing for patterns in high throughput mRNA expression experiments data has long been the primary method for elucidating candidate biomarkers from human cancers [55]. The experimental methods are constantly refined with the inclusion of epigenetic experimentation [56], measurements of miRNAs [57], and protein expression [58]. The analyses are expanded with ontology enrichment analyses [59], pathway analyses [59], and co-analyses of different data types [60]. Additionally, as experimental methods increase in efficiency and resolution, the bodies of data examined keep growing. A large number of diagnostic and prognostic biomarkers have been reported for different cancers and a small number are regularly utilized in clinics, but it is believed by some that the majority of reported biomarker candidates are the result of stochastic noise within data sets [61]. To extract meaningful knowledge about the etiology of cancer using these data, it is generally desired to provide a biological context for the biomarkers derived from data-driven methods. This knowledge includes the understanding of respective biological pathways that

Page 46: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

24

provide clues for the role of these candidate biomarkers in disease progression. If the role of genes in the diseases for which they serve as markers can be defined, they have the potential to become vastly more useful, as exceptions and variations may more readily be anticipated and explained in clinical models. Tumor antigens as biomarkers for cancer

TAs are a group of proteins against which the immune system has been recorded to react. Specifically, they are recognized in cells where they are present in larger than usual amounts, or are physiochemically altered to a degree at which they no longer resemble native human proteins [62]. Their presence or abundance in cancer cells is often unique and their roles and functions are, in many cases, studied extensively. It can be hypothesized that proteins that are frequently observed (and autologously recognized by the immune system) in tumor cells play a significant role in tumorigenesis. They, therefore, hold the potential to be highly specific biomarkers for the cancers of their origin. Challenges pertaining to the utility of TA biomarkers are similar to those we face when we statistically filter out potential biomarkers from vast amounts of high throughput genomics data: our understanding of their function and role in cancer must be elevated to a degree where we can account for outliers and exceptions to the general rules we define in clinical models. To achieve this, we analyzed the mRNA and protein expression of 32 TAs in normal tissue versus IDC tissue, supplied by collaborators at Massachusetts General Hospital (MGH) [63] and Northeastern University (NEU) [64]. In addition to these data, we also analyzed correlation between mRNA and protein expression for 86 gene/product pairs from 404 IDC patients, collected by The Cancer Genome Atlas (TCGA) consortium [65]. This approach helps provide an IDC-specific assessment of mRNA/protein expression correlation, as the mRNA and protein expression data from MGH and NEU were unpaired. The mean Spearman's correlation of expression of the 86 mRNA/protein IDC pairs was found to be 0.35 (Figure 8). The threshold for correlation was set at ρ > 0.455 to compare our results with a previous study of correlation between the mRNA and protein expression in 23 different cancer cell lines by Gry et al. (2009) [66].

Page 47: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

25

We found 29 gene/product pairs with a correlation coefficient > 0.455, six of which are known TAs (Figure 9). These results warrant caution in proteogenomic analyses, as well as studies of TAs as therapy targets, as it is clear that mRNA expression rarely correlates well with protein expression.

Figure 8: Density distribution of correlation coefficients of mRNA vs protein expression (aqua) and density distribution of correlation coefficients of randomized mRNA vs protein expression (pink). The mean correlation in each distribution is marked by the dashed red lines.

Figure 9: Spearman's rank correlation between mRNA expression and protein expression of 86 genes in 404 IDC patients. The tumor antigens are highlighted in red. We found that all but two TAs were expressed in IDC on protein level, and a subset of these was aberrantly expressed on mRNA level. We examined their known and proposed roles in cancer by analyzing the TAs and their closest functional counterparts for overlapping participation in canonical pathways. With this approach, we defined four functional modules of TAs and interactants that overlapped in

0.0

0.2

0.4

0.6

0.8

GAT

A3IG

FBP2

AR ASNS

INPP

4BCCNB1

ERBB2

EIF4EBP1

KIT

CHEK

2SM

AD3

ANXA

1PA

RK7

GAB

2IR

S1EG

FRSM

AD1

LCK

SYK

TP53

BP1

SRC

ITG

A2AC

ACA

MAP

2K1

CAV1

STAT

5ABRAF

CLDN

7CD

H3M

APK9

CDH1

PEA1

5PR

DX1

ARID

1APT

ENEE

F2K

NOTC

H1PI

K3CA

EIF4

ENF

2FN1

CHEK

1AK

T1FO

XO3

RAD5

0AT

MXR

CC1

PCNA

PRKC

ARB

M3

MYC

DVL3

BCL2

L11

ANLN

PXN

STK1

1BCL2L1

STM

N1SC

DCO

L6A1

ERBB

3TS

C2TP53

SMAD

4ER

RFI1

CDH2

PDCD

4M

APK1

YWHA

EPR

KAA1

XRCC

5PS

MD9

MRE

11A

CTNN

A1BI

DXBP1

YBX1

KRAS

EEF2

RPS6

BAK1

BECN

1M

AN1B

1HS

PBP1

CTNNB1

KDR

genes

Spearman's ѩ

Page 48: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

26

canonical pathways (Figure 10). The perturbation of these pathways were readily linked to the hallmarks of cancer [67] by querying relevant literature.

Figure 10. Confidence view of expanded protein-protein interactions within the four functional groups of tumor antigens (highlighted in gray) generated using STRING database. Interacting proteins were added to the tumor antigens using one cycle of expansion. Nodes correspond to the proteins and edges correspond to their functional interactions. The thicker edges signify higher confidence in the interaction. Interactions with a confidence score higher than 0.5 are shown. Currently, 258 TAs are catalogued and annotated in the TANTIGEN database of TAs. Expanding the analysis performed here to the full set of TAs is highly likely to provide additional insights. RNA sequencing is another desirable follow-up study of TAs found to be expressed in a given cancer tissue, as this may reveal known or novel splice isoforms,

A)# B)#

C)# D)#

Page 49: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

27

mutations, or other genetic aberrations. In additional to the diagnostic and prognostic potential of such a study, a catalogue of expressed TAs and TA variants in a given tumor can be further analyzed for their potential as therapeutic targets and directly applied in personalized treatment modalities. Paper VI provides a comprehensive and detailed discussion of these results.

Page 50: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

28

Page 51: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

29

Chapter 6 Conclusion

“Quotation is a serviceable substitute for wit” Oscar Wilde

The focus of this thesis has been to study TAs as therapy targets and biomarkers in cancer. I addressed these objectives in a series of six papers, summarized in the previous chapters. This work has generated new bioinformatics tools, fresh hypotheses, knowledge, and a rich collection of data that can be utilized to take T cell-based immunotherapies closer to successful applications. More specifically, I have provided a review of the current state of T cell-based cancer immunotherapy, a workflow for discovery of therapy targets using principles from reverse vaccinology, a theoretical framework for identification of multiple evolutionarily related antigenic targets suitable for co-targeting, two software tools (and corresponding public webservers) for characterization of co-targets, a database of TAs, a text mining-based method for rapidly updating said database, and a preliminary study of TAs as biomarkers for tumorigenesis in IDC. Future perspectives

The research presented in this thesis raised as many new questions as it provided answers - a hallmark, I would like to believe, of a successful scientific endeavor. As a bioinformatics PhD thesis, much

Page 52: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

30

of the desired future work relies on continued synergistic collaboration with wet lab researchers, as validation of hypotheses, models, and predictions generated computationally requires wet lab experimentation. This will in turn generate more data for bioinformatics analyses, and feed back to facilitate refinement of existing and development of new tools. It would give me much pleasure to further investigate the topics studied in this thesis and advance the set of tools and the resulting body of knowledge. Paper II describes block conservation analysis of antigenic variation – it is reliant on the availability of sufficient, non-redundant sequence data. Therefore, the application of the method presented in Paper II was demonstrated on sets of viral sequence data, which are available in much larger quantities than TA sequence data. Although the databases of somatic mutations in cancer and tumor sequence repositories are growing, the body of public TA sequence data is limited at present. However, high-throughput genomics projects will likely provide significantly larger body of data in near future. Following a more thorough analysis of these data, all computationally predicted therapy targets must be experimentally validated before further evaluations of their therapeutic potential in pre-clinical trials. The unanswered questions in Paper VI are highly connected to the desired future studies I proposed; before TAs can be utilized as cancer biomarkers or targeted with immunotherapies, comprehensive RNA sequencing and quantification of immunotherapy targets from a large number of tumors is desirable. Lastly, the vast majority of theories, methods, and computational tools used for this work can be improved in one way or another, and it is an onerous impulse of the passionate student of science to improve the state of the art whenever possible.

Page 53: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

31

References

1. Olsen LR, Zhang GL, Reinherz EL, Brusic V: FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology. Immunome Res. 2011, 7:1–9. 2. Olsen LR, Zhang GL, Keskin DB, Reinherz EL, Brusic V: Conservation analysis of dengue virus T-cell epitope-based vaccine candidates using peptide block entropy. Front. Immunol. 2011, 2:1–15. 3. Olsen LR, Kudahl UJ, Simon C, Sun J, Schönbach C, Reinherz EL, Zhang GL, Brusic V: BlockLogo: Visualization of peptide and sequence motif conservation. J. Immunol. Methods 2013:8–15. 4. Olsen LR, Kudahl UJ, Winther O, Brusic V: Literature classification for semi-automated updating of biological knowledgebases. BMC Genomics 2013, 14 Suppl 6:S14. 5. Andersen MH, Junker N, Ellebaek E, Svane IM, Thor Straten P: Therapeutic cancer vaccines in combination with conventional therapy. J. Biomed. Biotechnol. 2010, 2010:237623. 6. Kirkwood JM, Butterfield LH, Tarhini AA, Zarour H, Kalinski P, Ferrone S: Immunotherapy of cancer in 2012. CA. Cancer J. Clin. 2012, 62:309–35. 7. Rosenberg S a, Yang JC, Restifo NP: Cancer immunotherapy: moving beyond current vaccines. Nat. Med. 2004, 10:909–15.

Page 54: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

32

8. Kalos M, June CH: Adoptive T cell transfer for cancer immunotherapy in the era of synthetic biology. Immunity 2013, 39:49–60. 9. Boon T, van der Bruggen P: Human tumor antigens recognized by T lymphocytes. J. Exp. Med. 1996, 183:725–9. 10. Van den Eynde BJ, van der Bruggen P: T cell defined tumor antigens. Curr. Opin. Immunol. 1997, 9:684–93. 11. Seremet T, Brasseur F, Coulie PG: Tumor-specific antigens and immunologic adjuvants in cancer immunotherapy. Cancer J. 2011, 17:325–30. 12. Lurquin C, Van Pel A, Mariamé B, De Plaen E, Szikora JP, Janssens C, Reddehase MJ, Lejeune J, Boon T: Structure of the gene of tum- transplantation antigen P91A: the mutated exon encodes a peptide recognized with Ld by cytolytic T cells. Cell 1989, 58:293–303. 13. Spiotto MT, Yu P, Rowley DA, Nishimura MI, Meredith SC, Gajewski TF, Fu Y-X, Schreiber H: Increasing Tumor Antigen Expression Overcomes “Ignorance” to Solid Tumors via Crosspresentation by Bone Marrow-Derived Stromal Cells. Immunity 2002, 17:737–747. 14. Critchley-Thorne RJ, Simons DL, Yan N, Miyahira AK, Dirbas FM, Johnson DL, Swetter SM, Carlson RW, Fisher GA, Koong A, Holmes S, Lee PP: Impaired interferon signaling is a common immune defect in human cancer. Proc. Natl. Acad. Sci. U. S. A. 2009, 106:9010–5. 15. DuPage M, Mazumdar C, Schmidt LM, Cheung AF, Jacks T: Expression of tumour-specific antigens underlies cancer immunoediting. Nature 2012, 482:405–9. 16. Walter S, Weinschenk T, Stenzl A, Zdrojowy R, Pluzanska A, Szczylik C, Staehler M, Brugger W, Dietrich P-Y, Mendrzyk R, Hilf N, Schoor O, Fritsche J, Mahr A, Maurer D, Vass V, Trautwein C, Lewandrowski P, Flohr C, Pohla H, Stanczak JJ, Bronte V, Mandruzzato S, Biedermann T, Pawelec G, Derhovanessian E, Yamagishi H, Miki T, Hongo F, Takaha N, Hirakawa K, Tanaka H, Stevanovic S, Frisch J, Mayer-Mokler A, Kirner A, Rammensee H-G, Reinhardt C, Singh-Jasuja H: Multipeptide immune response to cancer

Page 55: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

33

vaccine IMA901 after single-dose cyclophosphamide associates with longer patient survival. Nat. Med. 2012, 18. 17. Rappuoli R: Reverse vaccinology. Curr. Opin. Microbiol. 2000, 3:445–50. 18. Harndahl M, Rasmussen M, Roder G, Dalgaard Pedersen I, Sørensen M, Nielsen M, Buus S: Peptide-MHC class I stability is a better predictor than peptide affinity of CTL immunogenicity. Eur. J. Immunol. 2012, 42:1405–16. 19. Rudolph MG, Stanfield RL, Wilson IA: How TCRs bind MHCs, peptides, and coreceptors. Annu. Rev. Immunol. 2006, 24:419–66. 20. Falk K, Rötzschke O, Stevanović S, Jung G, Rammensee HG: Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature 1991, 351:290–6. 21. Reinherz EL, Tan K, Tang L, Kern P, Liu J, Xiong Y, Hussey RE, Smolyar A, Hare B, Zhang R, Joachimiak A, Chang HC, Wagner G, Wang J: The crystal structure of a T cell receptor in complex with peptide and MHC class II. Science 1999, 286:1913–21. 22. Van der Burg SH, Visseren MJ, Brandt RM, Kast WM, Melief CJ: Immunogenicity of peptides bound to MHC class I molecules depends on the MHC-peptide complex stability. J. Immunol. 1996, 156:3308–14. 23. Montoya M, Del Val M: Intracellular rate-limiting steps in MHC class I antigen processing. J. Immunol. 1999, 163:1914–22. 24. Zhang GL, Ansari HR, Bradley P, Cawley GC, Hertz T, Hu X, Jojic N, Kim Y, Kohlbacher O, Lund O, Lundegaard C, Magaret C a, Nielsen M, Papadopoulos H, Raghava GPS, Tal V-S, Xue LC, Yanover C, Zhu S, Rock MT, Crowe JE, Panayiotou C, Polycarpou MM, Duch W, Brusic V: Machine learning competition in immunology - Prediction of HLA class I binding peptides. J. Immunol. Methods 2011, 374:1–4. 25. Lin HH, Ray S, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008, 9:8.

Page 56: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

34

26. Lin HH, Zhang GL, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 2008, 9 Suppl 12:S22. 27. Hughes AL, Hughes MK: Self peptides bound by HLA class I molecules are derived from highly conserved regions of a set of evolutionarily conserved proteins. Immunogenetics 1995, 41:257–62. 28. Borghans J a M, Beltman JB, De Boer RJ: MHC polymorphism under host-pathogen coevolution. Immunogenetics 2004, 55:732–9. 29. Prugnolle F, Manica A, Charpentier M, Guégan JF, Guernier V, Balloux F: Pathogen-driven selection and worldwide HLA class I diversity. Curr. Biol. 2005, 15:1022–7. 30. Hertz T, Nolan D, James I, John M, Gaudieri S, Phillips E, Huang JC, Riadi G, Mallal S, Jojic N: Mapping the landscape of host-pathogen coevolution: HLA class I binding and its relationship with evolutionary conservation in human and viral proteins. J. Virol. 2011, 85:1310–21. 31. Da Silva J, Hughes AL: Conservation of cytotoxic T lymphocyte (CTL) epitopes as a host strategy to constrain parasite adaptation: evidence from the nef gene of human immunodeficiency virus 1 (HIV-1). Mol. Biol. Evol. 1998, 15:1259–68. 32. Yeager M, Carrington M, Hughes AL: Class I and class II MHC bind self peptide sets that are strikingly different in their evolutionary characteristics. Immunogenetics 2000, 51:8–15. 33. Hughes AL, Packer B, Welch R, Bergen AW, Chanock SJ, Yeager M: Widespread purifying selection at polymorphic sites in human protein-coding loci. Proc. Natl. Acad. Sci. U. S. A. 2003, 100:15754–7. 34. Lucas M, Karrer U, Lucas A, Klenerman P: Viral escape mechanisms--escapology taught by viruses. Int. J. Exp. Pathol. 2001, 82:269–86. 35. Moore CB, John M, James IR, Christiansen FT, Witt CS, Mallal SA: Evidence of HIV-1 adaptation to HLA-restricted immune responses at a population level. Science 2002, 296:1439–43. 36. Kiepiela P, Leslie AJ, Honeyborne I, Ramduth D, Thobakgale C, Chetty S, Rathnavalu P, Moore C, Pfafferott KJ, Hilton L, Zimbwa P,

Page 57: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

35

Moore S, Allen T, Brander C, Addo MM, Altfeld M, James I, Mallal S, Bunce M, Barber LD, Szinger J, Day C, Klenerman P, Mullins J, Korber B, Coovadia HM, Walker BD, Goulder PJR: Dominant influence of HLA-B in mediating the potential co-evolution of HIV and HLA. Nature 2004, 432:769–75. 37. Hughes AL, Hughes MAK: More effective purifying selection on RNA viruses than in DNA viruses. Gene 2007, 404:117–25. 38. Pybus OG, Rambaut A, Belshaw R, Freckleton RP, Drummond AJ, Holmes EC: Phylogenetic evidence for deleterious mutation load in RNA viruses and its contribution to viral evolution. Mol. Biol. Evol. 2007, 24:845–52. 39. Rousseau CM, Daniels MG, Carlson JM, Kadie C, Crawford H, Prendergast A, Matthews P, Payne R, Rolland M, Raugi DN, Maust BS, Learn GH, Nickle DC, Coovadia H, Ndung’u T, Frahm N, Brander C, Walker BD, Goulder PJR, Bhattacharya T, Heckerman DE, Korber BT, Mullins JI: HLA class I-driven evolution of human immunodeficiency virus type 1 subtype c proteome: immune escape and viral load. J. Virol. 2008, 82:6434–46. 40. Hicklin DJ, Marincola FM, Ferrone S: HLA class I antigen downregulation in human cancers: T-cell immunotherapy revives an old story. Mol. Med. Today 1999, 5:178–86. 41. Maeurer MJ, Gollin SM, Martin D, Swaney W, Bryant J, Castelli C, Robbins P, Parmiani G, Storkus WJ, Lotze MT: Tumor escape from immune recognition: lethal recurrent melanoma in a patient associated with downregulation of the peptide transporter protein TAP-1 and loss of expression of the immunodominant MART-1/Melan-A antigen. J. Clin. Invest. 1996, 98:1633–41. 42. Restifo NP, Esquivel F, Kawakami Y, Yewdell JW, Mulé JJ, Rosenberg SA, Bennink JR: Identification of human cancers deficient in antigen processing. J. Exp. Med. 1993, 177:265–72. 43. Huang Y, Shah S, Qiao L: Tumor resistance to CD8+ T cell-based therapeutic vaccination. Arch. Immunol. Ther. Exp. (Warsz). , 55:205–17. 44. Khan AM, Miotto O, Nascimento EJM, Srinivasan KN, Heiny AT, Zhang GL, Marques ET, Tan TW, Brusic V, Salmon J, August JT: Conservation and variability of dengue virus proteins: implications for vaccine design. PLoS Negl. Trop. Dis. 2008, 2:e272.

Page 58: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

36

45. Tan PT, Khan AM, August JT: Highly conserved influenza A sequences as T cell epitopes-based vaccine targets to address the viral variability. Hum. Vaccin. 2011, 7:402–9. 46. Shannon CE: A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27:379–423, 623–656. 47. Sun J, Zhang GL, Olsen LR, Reinherz EL, Brusic V: Landscape of neutralizing monoclonal antibodies against dengue virus. In Proc. ACM-BCB, 2013 Sep 22-25. Washington DC, USA: 2013. 48. Fernández-Suárez XM, Galperin MY: The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection. Nucleic Acids Res. 2013, 41:D1–7. 49. Van der Bruggen P, Stroobant V, Vigneron N, Van den Eynde B: Peptide database: T cell-defined tumor antigens. Cancer Immun 2013. 50. Novellino L, Castelli C, Parmiani G: A listing of human tumor antigens recognized by T cells: March 2004 update. Cancer Immunol. Immunother. 2005, 54:187–207. 51. Zhang GL, Chitkushev L, Olsen LR, Kudahl UJ, Simon C, Brusic V: Streamlining the development of immunological knowledge bases. In Genomics Drug Discov. edited by Sakharkar M (Submitted); 2014. 52. The UniProt Consortium: Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41:D43–7. 53. Forbes S a, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011, 39:D945–50. 54. Zhao J, Miles A, Klyne G, Shotton D: Linked data and provenance in biological data webs. Brief. Bioinform. 2009, 10:139–52. 55. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, Poggio T, Gerald W, Loda M, Lander ES, Golub TR: Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. U. S. A. 2001, 98:15149–54.

Page 59: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

37

56. Esteller M: Epigenetics in cancer. N. Engl. J. Med. 2008, 358:1148–59. 57. Lu J, Getz G, Miska EA, Alvarez-Saavedra E, Lamb J, Peck D, Sweet-Cordero A, Ebert BL, Mak RH, Ferrando AA, Downing JR, Jacks T, Horvitz HR, Golub TR: MicroRNA expression profiles classify human cancers. Nature 2005, 435:834–8. 58. Vivona S, Gardy JL, Ramachandran S, Brinkman FSL, Raghava GPS, Flower DR, Filippini F: Computer-aided biotechnology: from immuno-informatics to reverse vaccinology. Trends Biotechnol. 2008, 26:190–200. 59. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 2005, 102:15545–50. 60. Snyder M, Weissman S, Gerstein M: Personal phenotypes to go with personal genomes. Mol. Syst. Biol. 2009, 5:273. 61. Ratain MJ, Glassman RH: Biomarkers in phase I oncology trials: signal, noise, or expensive distraction? Clin. Cancer Res. 2007, 13:6545–8. 62. Haen SP, Rammensee H-G: The repertoire of human tumor-associated epitopes--identification and selection of antigens and their application in clinical trials. Curr. Opin. Immunol. 2013, 25:277–83. 63. Ma X-J, Dahiya S, Richardson E, Erlander M, Sgroi DC: Gene expression profiling of the tumor microenvironment during breast cancer progression. Breast Cancer Res. 2009, 11:R7. 64. Cha S, Imielinski MB, Rejtar T, Richardson EA, Thakur D, Sgroi DC, Karger BL: In situ proteomic analysis of human breast cancer epithelial cells using laser capture microdissection: annotation by protein set enrichment analysis and gene ontology. Mol. Cell. Proteomics 2010, 9:2529–44. 65. The Cancer Genome Atlas (TCGA) Research Network: Comprehensive molecular portraits of human breast tumours. Nature 2012, 490:61–70.

Page 60: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

38

66. Gry M, Rimini R, Strömberg S, Asplund A, Pontén F, Uhlén M, Nilsson P: Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 2009, 10:365. 67. Hanahan D, Weinberg R: Hallmarks of cancer: the next generation. Cell 2011, 144:646–74.

Page 61: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

39

Paper I Bioinformatics for cancer immunotherapy target discovery

Cancer Immunology, Immunotherapy (Submitted)

Lars Rønn Olsen1,2,3,*, Benito Campos4, Mike Stein Barnkob5, Ole Winther1,6, Vladimir Brusic3 and Mads Hald Andersen7

1Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 2Biotech Research and Innovation Center (BRIC), University of Copenhagen, Copenhagen, Denmark 3Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 4Division of Experimental Neurosurgery, Department of Neurosurgery, Heidelberg University Hospital, Heidelberg, Germany 5Department of Clinical Immunology, Odense University Hospital, University of Southern Denmark, Odense, Denmark

6Cognitive Systems, DTU Compute, Technical University of Denmark, Kgs. Lyngby, Denmark 7Center for Cancer Immune Therapy, Department of Hematology, Herlev Hospital, Herlev, Denmark *Corresponding author:

Page 62: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

40

Abstract

The mechanisms of immune response to cancer have been studied extensively and great effort has been invested into harnessing the therapeutic potential of the immune system. Immunotherapies have seen significant advances in the past 20 years. However, the full potential of protective and therapeutic cancer immunotherapies has yet to be fulfilled. The insufficient efficacy of existing treatments can be attributed to a number of biological and technical issues. In this review, we detail the current limitations of immunotherapy target selection and design, and review computational methods to streamline therapy target discovery in a bioinformatics analysis pipeline. We describe specialized bioinformatics tools and databases for four main steps in immunotherapy target selection: cataloguing potential antigens, identification of potential T cell epitopes, selection of stable epitopes, and selection of co-targets for multi epitope strategies. Lastly, we provide examples of application for three well-known tumor antigens: HER2, survivin, and IDO, and suggest bioinformatics methods to ameliorate therapy resistance and ensure efficient and lasting control of tumors.

Page 63: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

41

1 Introduction

Cancer immunotherapy is a major modality of cancer treatment. It is often used in combination with other cancer treatments such as surgery, radiotherapy, and chemotherapy. Immunotherapy is an intervention that modulates immune responses for effective targeting and elimination of tumor cells. Immunotherapy is a broad term that encompasses variety of treatments; some enhance immune response in a very general way, while others direct immune system to specifically target cancer cells. Although immunotherapy has been well documented for more than a century [1], the mechanisms of immune responses and tumor evasion have been poorly understood. Understanding the genetic background of the patient as well as of molecular characteristics of individual cancers at time of diagnosis and during therapy is essential for stratification of patients, assessing their responses, and selection of optimal therapies [2]. Immunotherapies focus on tumor-specific antigens as well as on regulatory mechanisms that drive effective immune responses. Inducing immune responses to cancer is a delicate balancing act where effective immunity against cancer should be enhanced, while the related autoimmune responses against normal cells should be minimized [3]. Innovations in molecular and cellular biology and related instrumentation have enabled study of immune infiltrates and identification of immune cell subsets and molecules that mediate anti-cancer immunity [4]. Immune evasion by cancer involves several mechanisms such as suppression of antigen expression and presentation, recruitment of immunosuppressive paracrine mediators, suppression of cytotoxic T-lymphocyte activity through inhibitor molecules, or recruitment of suppressor cells [5]. Passive immunotherapies involve administration of antibodies and other immune system products that provide immunity but without activation of the host’s immune responses. Active immunotherapies, on the other hand, stimulate host immune responses against cancer. Major immunotherapy strategies involve use of antibodies, cytokines, cellular therapies, and vaccines [1]. Cancer is characterized by biological capabilities acquired by multiple mutations during multi-step development of tumors. The ten hallmarks of cancer include sustained proliferation, evasion of growth suppression, resistance to cell death, unlimited replicative potential, angiogenesis, local tissue invasion and metastasis, deregulated cell energetics, genome instability and mutation, avoidance of immune destruction, and tumor-promoting inflammation [6]. Specific mechanisms that support these hallmarks can be studied using, for example, genomics, proteomics, or serology that can reveal both the mechanisms of tumorigenesis and cancer-specific antigens in individual cancers [4, 7–9]. Understanding these mechanisms and knowledge about cancer

Page 64: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

42

antigens that can be targeted can lead to personalization of immunotherapies [8, 10] and the ability to optimize them [11]. A number of tumor antigens has been successfully targeted by antibody-based therapies, including EGFR, ERBB2, VEGF, CTLA4, CD20, CD30 and CD52 as well as antigens that are overexpressed in cancers and represent potential targets, including EGFRvIII, MET, and FAP [12]. Studies of anti-tumor vaccines in animal models have shown that immunization against a variety of tumors can be effective [13, 14]. Similar vaccines studied in human clinical trials, showed effectiveness mainly against precancerous tumors, while their ability to protect against established cancers was less effective as tumors progress to advanced stages. The lack of vaccine efficacy is due to high growth rate of tumors which then overcome the rate of immune destruction, the immunosuppressive activity of tumors, the down regulation of immune responses, or the inadequate targeting dosage or scheduling [15, 16]. Several unique tumor antigens have been proposed as targets for immunotherapies [17]. Unique antigens are molecules derived from somatic mutation of proteins expressed in tumor cells or alternative product of proteins derived from RNA splicing [18]. These antigens show a strong ability to elicit antigen-specific T cell responses in ex vivo studies. However, vaccination and tumor immunotherapy trials have not yet demonstrated their effectiveness. Instead oncoantigens, the molecules that support tumorigenesis, have been proposed as preferable candidate targets for cancer vaccines as it is hoped that their targeting will reduce immune evasion [19]. Identification and detailed characterization of vaccine targets is an essential step in tumor vaccine development. Technical advances in instrumentation, sample processing, immunological assays, and bioinformatics techniques have generated large amounts of immunological data, including experimentally identified tumor antigens and T cell epitopes, novel tumor biomarkers and differentially expressed genes or proteins identified through genomics, proteomics, or other high-throughput methods. Collection, analysis and management of these data require extensive use of bioinformatics applications. In this review we describe bioinformatics tools used for the study of cancer immunotherapies focusing on epitope-based and cell-based vaccines.

Page 65: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

43

1.1 T cell-based therapies There are several major challenges when designing an efficient T cell-based therapy aimed at implementing or re-activating an immune response against a given cancer: Untreated tumors almost invariably progress in spite of frequently observed tumor recognition by the patient’s own immune system [20]. This inefficiency of the immune defense to contain the expanding tumor is mainly due to a large repertoire of immunosuppressive and immune-evasive strategies employed by cancer cells. Tumor-suppressor mechanisms include specific cell types, effector molecules, and pathways that function collectively and can be affected by tumor-induced changes [21]. Immune evasion can happen through central tolerance where effector tumor-reactive cells are either eliminated or transformed into regulatory phenotypes [22]. The immune evasion by peripheral tolerance occurs when effector cells are neutralized in the periphery [23]. Defects in antigen processing and presentation that result in reduction or loss of human leukocyte antigen (HLA) or transport and accessory molecules (e.g. TAP, β2m, LMP2 or others) lead to the loss of recognition of targets on tumor cells [24, 25]. Interferon (IFN-α, IFN-β or IFNγ) insensitivity by tumor cells can happen by mutation or silencing of genes that encode interferon signaling. For example, the mutations in the components of IFNγ receptor signaling network that includes INFGR1, INFGR2, JAK1, JAK2 and STAT1 genes or their partners may result in aberrant signaling and IFNγ insensitivity. Tumor cells that have such mutations fail to produce intracellular assembly for facilitation of antigen processing and presentation [26]. Additional mechanisms include neutralization of NK cells through loss of ligands for NKG2D that is NK cell effector molecule [27], impairment of dendritic cells (DC) maturation [28], or upregulation of anti-apoptotic pathways [29]. Specific mechanisms may involve impeded maturation of antigen presenting cells (APCs) and tumor-infiltrating lymphocytes (TILs) through secretion of various metabolic compounds [30] and cytokines such as TGF-β [31], reduced infiltration of immune cells through blood vessels into the tumor [32], activation of Toll-like receptors [33] as well as exhaustion of immune response through antagonized immune cells activity [34]. TILs withstanding these immunosuppressive barricades will further prompt immune-evasive maneuvers by tumor cells including down-regulation of HLA expression, impairment of antigen processing/presentation [35], and eventually, even loss of immunogenic epitopes due to immunoediting [36]. Thus, the design of an effective immunotherapy will depend

Page 66: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

44

greatly on the ability to identify or create an immune-permissive environment susceptible for immune responses targeting tumor cells. Eliciting lasting and significant immune responses will further depend on how potential immune targets are administered to patients. Specifically, immunotherapy targets should be combined with a suitable adjuvant and used to load into or pulse with the ex-vivo activated immune cells. A main challenge is whether in situ expansion of these cells after re-administration can be achieved [37]. Stringent criteria have to be applied in the process of selection of immune response targets. Ideally, the design strategies need to identify antigens that are specifically and sufficiently expressed in tumor cells ensure the detection by immune cells and at the same time not expressed by normal cells to avoid autoimmunity. In case of T cell-based therapies, these antigens need to be processed and presented properly by cancer cells and also be sufficiently immunogenic when presented by HLA molecules for elicitation of significant activation of immune cells [38]. In the context of immunoediting it has been proposed that a panel of several carefully selected antigens will be superior as compared to a single target, especially if these antigens are overexpressed constituents of pathways that are essential for cancer growth [39]. Finally, when these three major challenges are taken into account, administration of an immunotherapy should be properly combined with conventional therapeutic treatment regiments to achieve potential synergistic effects and avoid immunosuppression [39]. In this review we will focus primarily on current limitations of immunotherapy target selection and design, and describe the use of bioinformatics for T cell target discovery. 1.2 Selecting the appropriate targets for immunotherapy In the classic mode of CD8+ T cell activation, intracellular proteins are processed in the cytoplasm and cleaved by the proteasome into small peptides [40], which are then delivered into the endoplasmic reticulum by transporter associated with antigen processing (TAP) proteins [41], where they bind major histocompatibility complex (MHC) class I molecules, and subsequently presented on the cell surface as T cell epitopes. The T cell epitopes on the surface of target cells are screened and recognized by CD8+ cytotoxic T lymphocytes (CTLs). Those target cells that are recognized as foreign due to malignant transformations are killed by the cognizant CD8+ cells [42]. Initially, the CD8+ T cells are in their naïve state. After recognition of a

Page 67: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

45

peptide-MHC complex, they are activated to their effector CTL state and proliferate to clear cells presenting the peptide-MHC complex on their surface. An additional mode of activation, referred to as cross-presentation, involves recognition of antigens from the extracellular environment and subsequent stimulation of both CD4+ and CD8+ T cell response [43]. The APCs, such as DCs, internalize extracellular proteins from nascent tissues. The internalized proteins are then processed, either in the traditional fashion by the proteasome, or endosomally by cathepsin S protease. Subsequently, CTLs are recruited to kill cells presenting the target antigen in a process termed cross-priming [44, 45]. The adaptive immune system responds to tumor cells by recognition of either tumor-associated antigens (TAAs, antigens that are overexpressed in cancer cells) [46, 47] or tumor-specific antigens (TSAs, antigens that are not expressed in most normal tissues) [18]. The therapeutic potential of a tumor antigen depends on an array of factors. In an effort to define the characteristics of the ideal cancer antigen, researchers at the American National Cancer Institute proposed to rank tumor antigens by: therapeutic function, immunogenicity, oncogenicity, specificity, expression level and percentage of positive cells, stem cell expression, the number of patients with antigen-positive cancers, the number of validated epitopes in antigen, and cellular location of expression [48].

1.2.1 Tumor-associated antigens TAAs are not exclusive to tumor cells. They can, however, in certain instances still elicit a tumor-specific response. TAAs can be divided into two subgroups: differentiation antigens and overexpressed antigens. Common for both types is the inherent risk that eliciting a sufficiently strong T cell response may induce systemic autoimmunity against healthy cells carrying the antigen. A number of overexpressed antigens have been characterized, as an array of genes involved in regulating tumor growth, replication, as well as apoptosis, and can severely affect the health of the cell if dysregulated (TANTIGEN: Tumor T-cell Antigen Database, cvc.dfci.harvard.edu/tadb/ ). As described in detail by Hanahan and Weinberg [6], a number of molecular events leads to the general cancer traits, namely sustained proliferation, evasion of growth suppression, resistance to cell death, unlimited replicative potential, angiogenesis, local tissue invasion and metastasis, deregulated cell energetics, genome instability and

Page 68: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

46

mutation, avoidance of destruction by the immune system, and tumor-promoting inflammation. Proteins or protein patterns consistently responsible for the hallmarks of cancer cells represent ideally suited target structures for therapeutic intervention, including immunotherapy targeting. Importantly, these traits are essential characteristics of all life-threatening cancers, and therapies based on molecular targeting of these characteristics are therefore broadly applicable to most if not all cancers. Thus, this group of antigens is especially interesting as they are characterized in a conceptual framework in which the biology, microenvironment, and conventional treatment of cancer have been taken into consideration. Several proteins responsible for or associated with these cancer traits have been characterized, e.g., cell division (telomerase, survivin), resistance to apoptosis (survivin, ML-IAP, Bcl-2, Bcl-X(L), and Mcl-1), tumor development (Cyp1B1, BRAF ), metastatic potential (Heparanase, RhoC), and angiogenesis (survivin, Bcl-2, VEGFR) (reviewed in [6]). Importantly, the mentioned proteins have all been characterized as targets for specific T cells. Thus, these proteins represent broadly applicable targets in therapeutic vaccinations against cancer. Moreover, although vaccination against these proteins or groups of proteins is in itself a promising new approach to fight cancer, the combination with additional therapy modalities could create a number of synergistic effects. Precautionary measures should, however, be taken when undesirable immunogenicity against TAAs. It was recently discovered that in some cases, a genetically modified high avidity TCR recognizing a survivin epitope also reacted against healthy lymphocytes expressing the protein [49]. Similarly, adoptive T cell therapy against HER-2/neu carries a risk of inducing severe toxicity [50]. Certain melanocyte differentiation antigens, such as tyrosinase and tyrosinase related proteins 1 (gp75) and 2, have been shown to elicit a CTL response in melanoma cells [51–53]. Although these targets are also expressed by mature melanocytes (albeit in smaller amounts than in melanomas) no spontaneous immune responses to these antigens has been observed in healthy subjects [54]. However, due to the presence of differentiation antigens in healthy melanocytes, exaggerated immune activation carries the inherent risk of autoimmunity or toxicity [55]. Melan-A/MART1 [56] and gp100 [57] are also being investigated as possible targets.

1.2.2 Tumor-specific antigens Mutations in tumor cell genes can lead to changes in the primary or secondary protein structure that may affect immunogenicity of

Page 69: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

47

antigens [54]. Specifically, sequence changes in short peptides can change peptide binding affinity for HLA, and thus change subsequent responses by T cells [58]. Similarly, mutations are likely to induce changes is secondary structure that can affect, to some extent T cell recognition, but primarily change the affinity and avidity of circulating antibodies for the target [59]. The main drawback of antigens resulting from mutations is that they are mostly patient-specific and therefore not applicable for broadly neutralizing therapies. However, they are likely to be truly tumor-specific and are often found in driver genes, thus making them less susceptible to immunoediting. The first discovered antigen, resulting from a point mutation, the cyclin dependent kinase, CDK4, was shown to elicit a T cell response against melanoma cells and metastasis tissue [60]. Since then, a number of targets has been identified, including the GTPase, N-ras [61], kinase proto-oncogene, B-raf [62, 63], and the Tyrosine-kinase, BCR-ABL [64], famed for its involvement in the Philadelphia chromosome translocation [65]. Hypomethylation of cancer-germline (CG) gene promotors can lead to overexpression of TSAs encoded by certain CG genes. Germline genes are only expressed by/in male germ cells and trophoblastic cells, both of which do not express HLA molecules on the surface. Thus, CG antigens are tumor-specific and have been explored as potential immunotherapy targets in a number of tumor types [54]. 2 In silico screening of immunotherapy targets

Identification and selection of antigens is a multifaceted task that depends both on the type and the application of antigens. In 2000, Rino Rappuoli formalized the role of computational analyses in vaccinology in a conceptual framework termed "Reverse Vaccinology" [66]. Originally formulated to facilitate vaccine target discovery in pathogens, the concepts of reverse vaccinology can be expanded for applications in cancer immunology. Reverse vaccinology revolves around sequence analysis, whereby the genomic sequence is used to catalogue all potential molecular antigens. In simple viral and prokaryotic pathogens, essentially all protein products are potential antigens, whereas the majority of tumor tissue proteins are not aberrant, thereby rendering them poor therapy targets. Therefore, cataloguing antigens in tumor cells require additional pre-screening. Once the catalogue of potential antigens has been established, the reverse vaccinology pipeline calls for in silico prediction of vaccine targets - a process that is characterized as being completely naïve, in

Page 70: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

48

that targets predicted from sequence, such as for example predicted HLA binders, are not characterized in terms of intra-cellular pre-processing, conservation, or in vivo expression. Additional computational and experimental prescreening of epitope candidates must therefore be performed before they can be included as therapy targets. Potential T cell epitopes, must be examined in terms of pre-processing by the proteasome, transport by the TAP, HLA binding, stability of peptide-MHC complex [67], peptide-MHC binding to the T cell receptor (TCR) [68], and in vivo expression. Similarly, potential T cell epitopes should be examined for stability of their expression in tumor cells to ensure lasting immunological control or clearance of targeted tumor cells. However, the heterogenic nature of cancer means that no given cancer type has a uniform molecular profile and several cancers are subclassified into a number of characterized or uncharacterized classes with varying prognostic and therapy outcome [69–72]. Adding further to this property, evidence of intra-tumor heterogeneity is beginning to surface for certain tumors [73, 74]. Additionally, given that immunoediting of tumor antigens is based on somatic mutations, it is extremely difficult to predict the antigenic phenotype after clonal selection. It is therefore often observed that tumors develop tolerance to immunotherapy after a limited period of time of successful treatment [75, 76]. In the following sections we present several examples of computational workflows for antigen cataloguing and immunotherapy target discovery with examples of computational tools for each task. It is beyond the scope of this review to catalogue all existing tools, and in the interest of brevity, we here focus on selection and combination of T cell antigens for personalized and general vaccine constructs. 2.1 Cataloguing potential antigens Establishing a catalogue of potential tumor antigens is a non-trivial task. Identification of potential antigens de novo from genomic sequence using bioinformatics tools is highly challenging, as expression of proteins is regulated by an array of complex regulatory mechanisms, many of which are poorly understood. Traditionally, tumor antigens are identified in vitro from serum by screening cDNA phage libraries using immunoassays [77, 78] or proteomics-based screening [79]. Bioinformatics tools are perfectly suited to aid this process, either by actively identifying novel tumor antigens or by organizing information about known tumor antigens in accessible databases.

Page 71: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

49

2.1.1 In silico screening for novel tumor antigens Large-scale screening of mRNA from public databases is a proposed method for active identification of novel tumor antigens [80, 81]. Specifically, comparing expression profiles of tumors and healthy tissue can elucidate genes that are over-expressed or expressed exclusively in malignant tumors. Databases such as the Gene Expression Omnibus [82] contain MIAME standard compliant [83] expression data. It is important to note that a gene must be translated to protein sequence to be useful for antigen-based cancer immunotherapies, but mRNA expression profiles can serve as useful pre-screenings, subject to additional experimental validation, including proteomics verification since correlation between RNA expression and protein expression very betweem different proteins [84].

2.1.2 Cross-referencing expression data with known tumor antigens A large number of studies presenting potential tumor antigens are published each year. Cross-referencing tumor gene expression and protein expression profiles with previous experimental efforts enable fast cataloguing of potential targets. Data resources for tumor antigens include the Cancer Immunity peptide database of T cell-defined tumor antigens [85], a static listing of tumor T cell antigens provided by Parmiani and colleagues [86], CTdatabase of cancer-testis antigens [87], and the TANTIGEN database of T cell tumor antigens (http://cvc.dfci.harvard.edu/tadb/index.html). Genes or proteins previously identified as TSAs (and expressed in a target sample), or identified as TAAs (and over-expressed compared to normal tissue from the same patient), are subject to further investigations as a potential immunotherapy targets if they are expressed at appropriate levels in a given tumor sample. High-throughput genomics methods have enabled large-scale screening of gene expression. These include nucleotide microarray technologies [88], next-generation RNA sequencing [89]. However, for an antigen to be suitable for immunotherapy, it must be expressed at protein level as well. Large-scale proteome studies include technologies of protein microarrays [90], antibody microarrays [91] and mass spectrometry-based proteomics [92]. Recent studies of the proteome of breast cancer have revealed molecular features of tumorigenesis [93], and proteomics studies are gradually approaching a scale where whole proteome screening is feasible [94]. 2.2 Identification of potential T cell epitopes T cell epitopes are short peptide fragments of 8-12 and 13-25 amino acids in length for HLA class I and II, respectively [95, 96]. Proteins are intracellularly processed by proteasomal cleavage in the cytosol,

Page 72: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

50

after which they are transported to the ER where they bind HLA. HLA is then relocated to the cell surface, where the pMHC complex is recognized by binding the receptors of circulating T cells [97]. Each of these cellular processes is a limiting step in the classical T cell-mediated immunity - the prediction of immunogenicity is therefore a non-trivial task.

2.2.1 T cell epitope prediction Prediction of peptide processing events, such as proteasomal cleavage [98–100] and TAP transport [101–103] has been explored, but evaluations suggests that these methods are still not optimal [104]. Algorithms for predicting HLA binding affinity are superior in accuracy, and highly accurate for a number of HLA alleles [105]. Prediction of peptide binding affinity to HLA class I and class II can be performed with a host of prediction algorithms (reviewed in [106, 107]). The overall best performing predictor for HLA class I binding is the artificial neural network (ANN) and weight matrix-based prediction tool, netMHC 3.2 [108] and the best for class II is the ANN predictor netMHCII 2.2 [109]. Other highly accurate classification algorithms include BIMAS [110], SYFPEITHI [111], novel ensemble methods PM and AvgTanh [112], and various averaging methods [113]. Pre-processing and high affinity binding to HLA are not the only prerequisites for a peptide to be immunogenic. A number of known HLA binders have been shown to be unable to elicit immune response, a phenomenon referred to as holes in the T cell repertoire [114]. Recently, it has been shown that stability of pMHC is a better predictor for immunogenicity of a peptide than the affinity of the binding, as immunogenic peptides are generally more stably bound to MHC [67]. Similarly, assessment of peptide-MHC complex binding to the TCR has been explored as a predictor of immunogenicity [115]. However, at present, only 21 crystal structures of peptide-MHC-TCR complex are completed, which is not sufficient basis to train a generally applicable classifier. Other approaches to evaluating immunogenicity include prediction of T cell reactivity based on an array of physiochemical properties [116, 117].

2.2.2 Selection of candidate epitopes An important challenge in cancer immunotherapy is to design treatments in such a way that they efficiently target tumor cells, without autoimmunity or toxicity [118]. It is therefore important to target epitopes that are located in antigens exclusively expressed or

Page 73: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

51

overexpressed in tumor cells [18]. These epitopes can be elucidated by predicting HLA binders from the catalogue of TAAs or can be located by comparing protein expression in healthy tissue with that in tumor tissue for a given individual. Experimental validation of predicted epitope candidates is important. Since peptide pre-processing is still unknown for predicted binders or stable pMHC complexes, appropriate peptide processing and in vivo binding should be confirmed experimentally before epitopes are included in vaccine constructs. Large-scale T cell epitope validation is enabled by mass-spectrometry [119] and flow cytometry-based methods [120].

2.2.3 Human immune system diversity Host HLA profile must be identified before predicting potential targets. HLAs are amongst the most polymorphic molecules in the human genome, and represent the most variable factor of human immune recognition. Comprised of more than 200 genes located on chromosome 6, three different HLA classes are defined [121]. Only class I and class II are involved in adaptive immunity, and thus are a main focus in this review. For each class, several major and minor proteins are defined, which are in turn classified into supertypes [122, 123] and 9310 individual alleles are reported in Release 3.12.0, (April 17, 2013) of the IMGT/HLA database [124]. Specificity of HLA molecules is instrumental in determining resistance and susceptibility to invading pathogens and cancers. For example, HIV-positive individuals with HLA-B57 and HLA-B27 super types (known as HIV elite controllers), show much slower progression of AIDS than individuals with other HLA super types [125]. Owing to hereditability of HLA loci, specific alleles are often geographically clustered, meaning that some populations are more susceptible to, for example, EBV related cancers [126]. T cell-mediated immunosurveillance of cancerous cells involves HLA restriction, further complicates formulation of T cell-based therapies. Even if we ignore the variability of tumor antigens, the diversity of human immune response to T cell epitopes renders the identification of broadly applicable T cell-based immunotherapy targets highly challenging, and increases the search space of useful T cell targets in personalized therapies immensely. 2.3 Ensuring lasting immunity Conservation and variability analysis for immunotherapy target selection is a multidimensional problem. If one aims to define targets

Page 74: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

52

for general immunotherapies applicable to a broad cohort of patients, antigen diversity must be studied across the patient population. It is a complex problem, and when targeting a disease of vast heterogeneity the complexity of the studied system is further increased. Due to high variability, even personalized vaccine targets are likely to be unstable over time, and the somatic process driving the selection is difficult to predict. There are, however, strategies for inducing lasting immune response against tumors.

2.3.1 Selection of stable epitopes Comparative analysis of gene and protein sequences can reveal de novo SNPs and other somatic mutations. Peptides found in mutated protein regions unique to the tumor are candidates for epitope prediction. However, peptides found in highly variable regions, are likely subject to frequent mutations and potentially lead to loss of antigen immunogenicity. Owing to the heterogenic nature of cancer and complex processes such as immunoediting, the landscape of tumor mutation is far from fully understood. However, potential epitopes can be analyzed for stability in the context of known mutations catalogued in databases such as COSMIC [127]. Similarly, splice variations of proteins and structural variations in the genome can influence the stability of a given epitope. This can be examined using databases such as DECIPHER [128] for chromosomal variations and UniProt [129] for protein isoforms. Identifying regions of limited variability and high stability, and choosing potential epitopes in these regions may increase the likelihood for sustained immune response. Variability also depends on the selection pressure exerted by the immune system during therapy, but regions of known high variability can be excluded.

2.3.2 Multi-epitope strategies An antigen lacking stable epitopes conferring protective capacity is not necessarily excluded as a possibility for immunotherapy. Treatments can be composed of multiple epitopes from multiple antigens [130]. If immunogenicity of one epitope is lost, the remaining set of targets can confer protection. In a multi-epitope setting, unstable regions of tumor antigens can be of value - even if just for a limited time. Additional analyses of metabolic pathways of tumor cells may reveal potential targets in multiple antigens complementary to each other, which collectively can afford protection. Theoretically, co-targeting multiple proteins in pathways essential for tumor fitness should increase probability of sustained response [131]. The network analysis of signaling pathways and the perturbations by oncogenes was recently shown to successfully identify oncogenic targets. A sequential

Page 75: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

53

application of anticancer drugs increased the collective efficiency of the drugs targeting oncogenic signaling pathways [132]. The sequential administration of immunotherapies targeting different antigens may also be advantageous. Multi-epitope approaches carry the inherent risk of raising a dominant response against one, or a few, of the administered epitopes. However, this can, in theory, be avoided by multiple site vaccinations [133]. Additionally, it has been shown that immunotherapies sometimes facilitate immune responses against additional antigens, not included as targets in the therapy, by a process referred to as "epitope spreading" [134] or "provoked immunity" [135]. 3 Examples of T cell target selection

Knowledge discovery by mining biological data is becoming increasingly relevant as high-throughput methodologies facilitate generation of data at rapidly increasing rates. Small-scale functional studies remain instrumental in uncovering the mechanisms of complex diseases such as cancer, but large-scale analyses hold extraordinary potential for extraction of higher-level knowledge for clinical applications. Biological databases are increasing both in number and complexity, and so too are the computational tools for analyses of these data. To explore the full potential of data and tools, it is necessary to streamline analyses of newly generated data with application of relevant tools and cross-referencing of existing data. Tools may include machine-learning algorithms to acquire new knowledge, mathematical modeling to capture relevant system properties, inference mechanisms to enhance reasoning and produce explanation, and visualization schemes to summarize large datasets and highlight salient information. It is essential that tools are applied rationally and ordered, and it is therefore necessary to streamline application of tools in analysis pipelines or workflows [136, 137]. Streamlining principles, methods, tools, and data in workflows can increase the breadth, depth, and speed of analysis and thereby the knowledge output. For discovery of cancer T cell immunotherapy targets we consider the high-throughput methods, analytical tools, and databases, and formalize these components into the analytical workflows. 3.1 Selection of potential antigens For personalized immunotherapies, it is advantageous to have tumor tissue biopsies as well as healthy tissue samples, peripheral blood, or

Page 76: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

54

serum to perform genomics or proteomics analyses. Gene expression can be characterized using RNA sequencing [138] or microarray-based methods [139], and protein expression can be characterized using, for example, ultratrace proteomic analysis [140]. If analyses are performed on both healthy and tumor tissue, these can be compared to elucidate potential targets that are overexpressed, or expressed exclusively in tumor tissue. Depending on the measurement method, different tools can be used for statistical analysis. A multitude of specialized software packages exist for the software environment R, which is most commonly used. Additionally, an expression profile database, such as UniGene or The Human Protein Atlas, can be consulted to ensure that genes or proteins of interest are not overexpressed in healthy tissues. Proteins expressed in tumor tissue can be filtered by whether they are known tumor antigens by cross-referencing with tumor antigen databases such as TANTIGEN. For a given patient or cohort of patients, protein expression in different tissues can be measured. Cataloguing antigens should include analysis of differential expression between healthy tissue and cancerous tissue. A number of differentially expressed proteins are normally identified by such analyses. Proteins overexpressed in tumor tissue are potential tumor antigens, but not all of them are useful. By cross-referencing with known tumor antigens from specialized databases, such as TANTIGEN, protein expression data can be filtered to encompass only proteins previously shown to be a potential antigen. As an example of information collection and antigen assessment, we analyzed three proteins HER2, survivin, and IDO. 3.2 HER2 Information about ERBB2, also known as HER2, relevant to assessing its role as a tumor antigen is located and extracted from a number of different biological databases. Table 1 lists information relevant to assessing the suitability of HER2 as an antigen in a number of different cancers. HER2 is an epidermal growth factor that is amplified in about 20-40% of invasive breast cancers [141]. Whereas normal tissue generally has low expression of HER2, breast cancer cells can have up to 50 copies of the encoding ERBB2 gene and up to 100 fold increased protein expression [142], with heavy correlation to a poor clinical outcome. These properties make HER2 a good marker for tumor tissue. HER2 has four isoforms produced by alternative splicing and alternative initiation. The isoforms overlap in identity by slightly less than half of the protein sequences and a number of somatic mutations detectable on protein level are characterized in HER2. Since HER2 is present on the cell surface in large numbers, it

Page 77: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

55

Table 1: Annotation of HER2 for assessing potential as tumor antigen.

Information type Information Source (ID/accession)

Protein name HER2 GeneCards (GC17P037844)

Gene name ERBB2 GeneCards (GC17P037844)

Full name Human Epidermal Growth Factor Receptor 2 GeneCards (GC17P037844)

Synonyms CD340, HER-2, HER2, NEU, NGL GeneCards (GC17P037844)

Function HER2 is a member of the epidermal growth factor (EGF) receptor family of receptor tyrosine kinases. HER2 cannot bind growth factors itself, since it lacks a ligand binding domain. Rather, it binds to other EGF receptor:ligand complexes, with which is forms a heterodimer, thus stabilizing binding and enhancing activation of downstream signaling pathways.

OMIM (164870)

Role in disease The ERBB2 gene is amplified and HER2 is overexpressed in 25 to 30% of breast cancers, correlating with increased aggressiveness of the tumor.

OMIM (164870)

Localization Cell surface, plasma membrane bound. OMIM (164870)

Isoforms Isoform 1: Canonical sequence Isoform 2: 1-610: Missing Isoform 3: 1-686: Missing Isoform 4: 1-23: MELAALCRWGLLLALLPPGAAST → MPRGSWKP

UniProt (P04626)

Mutations Pos: 20, A→T Pos: 49, L→H Pos: 49, L→P Pos: 92, R→G Pos: 101, I→S

COSMIC (COSG165)

Gene expression, normal tissue

Low expression in most healthy tissue, moderate expression in intestine and mammary glands.

UniGene (241389)

Gene expression, tumor tissue

Overexpressed in 25% to 30% mammary gland tumors (strong correlation with clinical outcome) and in approximately 15% colorectal tumors (no correlation with clinical outcome).

UniGene (241389)

Protein expression, normal tissue

No to low expression in hematopoietic, digestive, and respiratory tissue. No to medium expression in female tissues, placenta, male tissues, and urinary tract tissue.

The Human Protein Atlas

Page 78: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

56

Protein expression, tumor tissue

Strong expression observed in some breast cancer, colorectal cancer, glioma, head and neck cancer, ovarian cancer, and urothelial cancer

The Human Protein Atlas

T cell epitopes 50 reported CD8+ T cell epitopes 13 reported HLA class I ligands

TANTIGEN (Ag000001)

Page 79: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

57

is suitable for targeting by both cellular and humoral immunity, and a number of both T cell and B cell HER2 epitopes have been identified [143, 144]. The monoclonal antibody trastuzumab (Herceptin) is a commonly used treatment in HER2 positive breast cancer [145].

3.2.1 Identifying potential T cell epitopes When potential antigens are catalogued, prediction of T cell binders can be performed. Ideally, the tumor tissue exome should be sequenced, and potential epitopes predicted from the translated sequence. If tumor tissue sequence is not available, canonical sequences can be extracted from protein sequence databases such as UniProt. Before predicting binders, patient HLA types should be identified. This can be done by DNA sequencing [146], RNA sequencing [147], or microarray-based approaches [148]. Then, class I binders to patient HLA alleles can be predicted using, for example, netMHC [149], and class II binders can be predicted using, for example, netMHCII [150]. Predicted binders can be cross-referenced for experimental validation with a tumor antigen database such as TANTIGEN. Predicting HLA binders for 9-meric peptides using netMHC 3.4, yields potential binders to a number of HLA alleles. A closer look at HLA A*02:01, reveals 52 predicted binders, of which one is an experimentally validated binder, found by cross-referencing with TANTIGEN. Some candidate binders are conserved across all isoforms and mutated forms of HER2, while others are found only in some isoforms. Table 2 shows peptides binding HLA A*02:01 that are either conserved in all isoforms, or positions where all variant peptides bind HLA A*02:01 (an analysis known as block conservation [151]).

3.2.2 Selecting therapy targets Once potential HLA binders have been predicted, their stability can be estimated by examining known mutations and isoforms. Databases of mutations in cancers, such as COSMIC, or databases of common isoforms, such as UniProt, may reveal that potential epitopes are located in variable regions, rendering them unstable and potentially unsuitable for immunotherapy. An example of the impact that epitope stability can have on immunotherapy is resistance to trastuzumab, which targets the extracellular domain of HER2. Approximately 12% - 24% of patients respond to single agent treatment, but all patients regress within six months [152]. Resistance has been attributed to loss of epitope by splice isoforms [153, 154], and significant increase in survival time is

Page 80: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

58

Table 2: Peptides from HER2 predicted to bind HLA A*02:01. All peptides in this table are either found in all isoforms and mutated types of HER2, or are peptides from a position on which all peptides are predicted to bind HLA A*02:01. Peptides with a predicted binding affinity of less than 50 nM are strong binders, less than 500 nM are weak binders. All predictions done using netMHC 3.4.

Position Number of variants in given position

Peptide Predicted HLA A*0201 binding

affinity (nM)

Experimental status Reference

689 1 RLLQETELV 34 T cell epitope [155]

767 4 ILDEAYVMA 34 N/A -

ILDEAYAMA 33 N/A -

ILHEAYVMA 73 N/A -

MLDEAYVMA 15 N/A -

823 2 LLNWCMQIA 359 N/A -

LLNWCMQTA 89 N/A -

949 1 TIDVYMIMV 86 N/A -

953 1 YMIMVKCWM 101 N/A -

954 1 MIMVKCWMI 22 N/A -

Page 81: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

59

observed when combining trastuzumab with other treatments, such as endocrine therapy targeted at estrogen receptors with, for examples, tamoxifen [156]. Theoretically, co-targeting two or more epitopes on the same protein or on different proteins, could have the same effect. The predicted binders shown in Table 2 are filtered by conservation in known isoforms and mutated forms. As can be seen, four of the predicted binders are found in all known forms of HER2 (positions 689, 949, 953, and 954), whereas six are found only in some (positions 767 and 823). The latter six peptides are located in potentially unstable regions, but as observed on position 767, each of the four variants are predicted to bind HLA A*02:01, making them potentially useful in a multi epitope setting. Note however, that only 244 uniquely mutated samples of HER2 have been identified (UniProt, May 28, 2013), giving an estimate of the variability of HER2. Additionally, the frequency of each mutation is unknown, so some mutations may impact a peptide's suitability as an immunotherapy target more than others. Targeting multiple epitopes within the same protein can be valuable to avoid loss of immunogenicity caused by mutations or splice variation. However, targeting a single protein does not address loss of immunogenicity by downregulated protein expression. It can therefore be valuable to target multiple epitopes in different proteins. Combined targeting of several antigens increases therapy flexibility and may increase the magnitude of the response. This approach is especially valuable when targeting proteins of similar or compensatory function in redundant pathways [157]. An examination of HER2 interactions recorded in the STRING (a database of known and predicted protein-protein interactions (PPIs) [158]) with known tumor antigens (from TANTIGEN), reveals a large number of protein neighbors and interactants to HER2, based on recorded co-expression, co-mentioning in the literature, or recorded interactions in specialized databases. The ten tumor antigens with highest scoring confidence relationship to HER2 are shown in Figure 1. One of these is EGFR, which has previously been examined as a co-target with HER2 [157, 159]. Cross-referencing TANTIGEN shows that EGFR harbors T cell epitopes for potential immunotherapy targeting. In a similar fashion, functional homologues can be examined as novel targets. Another strategy is to target multiple epitopes in proteins from different interacting pathways. The HER2 and the estrogen receptor (ER) signaling pathways are the dominant drivers of cell proliferation in 85% of breast cancer cancers, which make antigens of these

Page 82: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

60

pathways desirable therapy targets [160]. Another multi epitope strategy could therefore involve targeting several antigens in both pathways to avoid therapy resistance if a single epitope is lost. Examining the HER2 and ER pathways in the Kyoto Encyclopedia of Genes and Genomes (KEGG) the molecular signatures database (MSigDB) reveals multiple potential antigens in each pathway.

Figure 1: Protein-protein interaction (PPI) network for HER2 and interacting tumor antigens (from TANTIGEN). PPI networks may indicate possible compensatory activity such as that for HER2 and EGFR, making interaction networks useful for elucidating additional potential targets. Nodes represent proteins and edges correspond to functional interactions. Thicker edges signify higher confidence in the interaction. Image was generated using the STRING database. HER2 is also expressed in multiple neuronal and heart tissues in embryos and adults [161]. Special care should be taken when targeting HER2, since complications involving cardiotoxicity has been observed in some cases. HER2 is expressed in cardiac tissue in approximately 6 of 60 examined breast cancer patients [162]. However, trastuzumab is generally safe; in a study of the efficacy and safety of trastuzumab in 111 patients, it was found that only 2 patients showed signs cardiotoxicity during monotreatment with trastuzumab, both having a history of cardiac disease [152].

Page 83: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

61

3.3 Survivin The BIRC5 protein, also known as survivin, is highly expressed in most human tumors compared with the expression in normal, adult, non-dividing tissues [163]. Survivin is an apoptosis inhibitor and thus it facilitates an increased tumor growth. It has been the subject of extensive investigation for cancer immunotherapy applications [164]. The BIRC5 gene consists of three introns and four exons, giving rise to seven known splice isoforms that overlap only by approximately 40% identity [165]. Additionally, four somatic mutations detectable on protein level have been characterized. Survivin is located in subcellular compartments, including mitochondria, cytoplasm, nucleus, and the extracellular space, making it a suitable target for T cell-based therapies [166, 167]. Information for assessing survivin as a potential tumor antigen is listed in Table S1.

3.3.1 Identifying potential T cell epitopes Predicting HLA binders for 9-meric peptides using netMHC, yields 36 predicted binders to HLA A*02:01, A*0301, A*11:01, A*24:02, B*07:01, B*08:01, and B*15:01. Four of these predicted binders are present in all known variants of survivin of which three are experimentally validated HLA ligands, found by cross-referencing with TANTIGEN (Table S2).

3.3.2 Selecting therapy targets In theory, each of the four peptides presented in Table S2, is potentially capable of eliciting immune response against all known variants of BIRC5. It is highly desirable to target multiple variants, as isoforms 2α and B3 have been associated with shortened survival prognosis in breast cancer patients [168, 169]. Co-targeting multiple proteins involved in apoptosis signaling may confer synergistic effects [39]. For example, BIRC5 expression has been shown to be strongly correlated with expression of BCL2 [170], another known apoptosis suppressor [171]. As such, a number of T cell epitopes has been characterized for BCL2 [172, 173], making it potentially attractive to co-target BIRC5 and BCL2. Additional targets may be found by querying the STRING database for potentially interacting known tumor antigens (from TANTIGEN). Figure S1 shows predicted interactants of BIRC5, which are also known to harbor tumor T cell antigens. The essential role of survivin during cell development [174], and its potentially homeostatic effect in healthy tissues [175], calls for careful consideration of potential adverse effects of targeting this antigen.

Page 84: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

62

However, phase I and II clinical trials of drugs targeting survivin, such as for example sepantronium bromide (YM155), report little to no adverse effects of targeting survivin with YM155 monotreatment [176]. This offers indication that immunotherapies targeting survivin may also be safe to use. 3.4 IDO Indoleamine 2,3-dioxygenase 1 (IDO1 or IDO) is an intracellular enzyme that catabolizes tryptophan. In doing so, it inhibits T cell proliferation and acts as an immunosuppressant. IDO is constitutively expressed in most human tumors and is absent in most healthy tissues, making it an attractive target for cancer therapies [177]. Additionally, IDO has been shown to elicit naturally occurring CD8 [178] and CD4 [179] T cell responses, making it suitable for T cell-based therapies. Table S3 lists information, collected from a number of publically available data sources, for assessing IDO as a potential tumor antigen.

3.4.1 Identifying potential T cell epitopes Predicting HLA binders for 9-meric peptides using netMHC, yields 51 predicted binders to HLA A*02:01, A*0301, A*11:01, A*24:02, B*07:01, B*08:01, and B*15:01. All predicted binders are present in all known variants of IDO. Two are experimentally validated T cell epitopes, found by cross-referencing with TANTIGEN (Table S4).

3.4.2 Selecting therapy targets Tabls S4 shows that there are many potential HLA binders across the entire span of the protein. No splice variants are known, but nine somatic mutations have been characterized for IDO. At the current level of knowledge, broadly targeting all known variants of the antigen is therefore simple. Furthermore, two positions known to harbor somatic mutations (positions 3 and 5) are likely to remain immunogenic even after mutation, as both peptides resulting from the variable positions are predicted to bind the same HLA with approximately the same affinity. An additional strategy to increase the breadth and depth of the immune response is to select potential HLA class II binders (not predicted here). For example, it was previously shown that the class II epitope DTLLKALLEIASCLE is capable of eliciting a strong response in melanoma patients [179]. This peptide also harbors the class I epitope ALLEIASCL, which also elicits an in vivo response [178]. Querying the STRING database for potential interactions with other known antigens yields one interactant, namely the metabolic enzyme,

Page 85: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

63

Cytochrome P450 1B1 (CYP1B1). Members of the cytochromes P450 group has been shown to support IDO activity in vitro [180], and CYP1B1, which is widely expressed in tumor tissue, but not in healthy tissues, has been shown to elicit spontaneous T cell responses in cancer patients [181], rendering it a potential co-target. In addition to the results from the STRING database, querying ClinicalTrials.gov (http://www.clinicaltrials.gov/) reveals that IDO peptides are currently being tested as treatment for metastatic melanoma patients in conjunction with survivin peptides and temozolomide therapy (trial record NCT01543464). 3.5 General workflow A schematized workflow for the bioinformatics process of identification of immunotherapy targets described in the three examples is shown in Figure 2. 3.6 Using epitopes in therapy The final crucial screening test for candidate peptides is to determine whether T cells recognize the peptide antigens. First, the peptides are examined for their actual ability to bind to HLA-molecules, using for example iTopia [182], mass-spectrometry [119] or flow cytometry-based methods [120]. However, there are many factors that determine a T cell response against a given epitope. These include expression level of the relevant source protein, processing, HLA level on the surface, TCR repertoire, T cell sensitivity, immune suppression etc. Thus, T cells specific for the epitope can be identified by examining the potential spontaneous T cell response against the potential epitopes in cancer patients [181, 183–185]. Alternatively, it is possible to generate specific T cells in vitro and establish whether the peptide-specific T cells recognize cells encoding the full-length protein, e.g. cancer cells. Another strategy to identify natural HLA-restricted antigens involves their purification from natural sources, i.e. cells or tissues [186]. Having established that a given peptide indeed is a T cell epitope, a successful vaccination requires an additional component: an adjuvant. The adjuvant amplifies immune response induced by the antigen. Cancer vaccines are therapeutic vaccines and differ from prophylactic vaccines in that they are given with the purpose of overcoming an already existing disease in the patient. The application of peptides represent one of the simplest ways of targeting single or a few antigens, but other ways of achieving this is certainly possible, e.g., by using whole protein, RNA or DNA. Thus, cancer vaccinations target one or more TAAs and many different vaccine approaches exist. In

Page 86: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

64

Figure 2: The workflow for discovery of potential T cell epitopes and immunotherapy targets in tumor tissue. Green operations are preliminary and final laboratory analyses; red operations are bioinformatics analyses performed using computational tools (described in further detail in section 4.1); blue operations are cross-references with biological databases (described in further detail in section 4.2), and yellow denotes intermediary outputs from the analyses and cross-referencing.

Gene expression

SURÀOLQJ��PLFURDUUDy,

RNA seq)

Protein expression

SURÀOLQJ��LPPXQRKLVWRFKHPLVWUy,

PDVV�VSHFWURPHWU\�

DifIHUHQWLDOO\�H[SUHVVHG�JHQHV�SURWHLQV

TANTIGEN

3RWHQWLDO�SURWHLQ�WDUJHWV

0XWDWLRQ�DQG�VSOLFH�YDULDQW�PDSSLQJ��51A

seq, DNA seq)

3RWHQWLDO�DQWLJHQV

(SLWRSH�VWDELOLW\�DQDO\VLV��6HTXHQFH�

DOLJQPHQW�

3DWKZD\�LQWHUDFWDQW DQDO\VLV

+HDOWK\�WLVVXH�VDPSOHTXPRU�DQG�KHDOWK\�WLVVXH�VDPSOHV

HLA�SURÀOLQJ��PLFURDUUDy, RNA seq,

DNA seq)

3UHGLFW�HSLWRSHV��netMHC, netMHCII)

TANTIGENUniGene,

+XPDQ�3URWHLQ $WODV

TANTIGEN&260,&��UniProt

TANTIGEN06LJ'%��675,1*

3RWHQWLDO T�FHOO�HSLWRSHV

3RWHQWLDO�FR�WDUJHWV

&DWDORJXLQJSRWHQWLDODQWLJHQV

,GHQWLÀFDWLRQ�RI�SRWHQWLDO T�FHOO�epitopes

Epitope

FRQVHUYDWLRQDQDO\VLV

&R�WDUJHWDQDO\VLV

VDOLGDWLRQ([SHULPHQWDO�YDOLGDWLRQ

Page 87: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

65

general, the adjuvant may help induce an antibody response or a cellular response due to the fact that the response, to some extent, depends on the adjuvant. Since most efforts in the field have addressed induction of cellular (T cell) responses, adjuvants with this focus are most intensively studied in cancer vaccines and immunotherapies. The activation of DCs has attracted much attention, as they are regarded to be key mediators of cross talk between the innate and adaptive immune responses and have been described as nature’s own adjuvant [187]. However, other approaches including synthetic adjuvants (e.g. bacterial extracts in oil emulsions), DNA or viral vectors are being investigated [188]. Adjuvants can increase the immunogenicity of the vaccine (e.g. Bacillus Calmette-Guérin (BCG) [189], tetanus toxin, interleukin (IL)-2 [190], interferon (IFN) [191], thymalfasin [192], granulocyte macrophage-colony stimulating factor (GM-CSF) [193]) or decrease immune regulatory mechanisms (e.g. CD25 antibody [194] or chemotherapy). 4 Analysis tools and data resources for cross-referencing and

validation

4.1 Analytical tools Software tools for the analysis of tumor antigen data identified from patient samples as well as from public data sources exist in abundance. They are available as both stand-alone tools with web-accessible interfaces, also available for download and some are integrated into relevant biological databases for analysis of public datasets (e.g. TANTIGEN). Table 3 lists a sample of freely available software tools relevant for cancer immunotherapy target discovery. 4.2 Databases Similarly, a number of databases with data for cross-referencing and co-analyzing with private data are publically available. These databases can be queried directly via their web interface in addition to some being available for download. Nucleic Acids Research offers a comprehensive catalogue of biological databases [195]. Table 4 lists a sample of databases useful for characterizing potential antigens and epitopes for immunotherapy designs.

Page 88: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

66

Table 3: Sample of analytical tools for discovery of T cell epitopes for cancer immunotherapy. Tool Purpose URL Ref.

HLA binding predictions (multiple algorithms exist, reviewed in [105])

netMHC Prediction of HLA class I binders http://www.cbs.dtu.dk/services/NetMHC/ [149]

netMHCII Prediction of HLA class II binders http://www.cbs.dtu.dk/services/NetMHCII/ [109]

Conservation analysis

IEDB analysis resource

Epitope Conservancy Analysis http://tools.immuneepitope.org/tools/conservancy/ [196]

IEDB analysis resource

Epitope Cluster Analysis http://tools.immuneepitope.org/tools/cluster/ -

IEDB analysis resource

Population Coverage Calculation http://tools.immuneepitope.org/tools/population/ [197]

MAFFT Multiple sequence alignments http://www.ebi.ac.uk/Tools/msa/mafft/ [198]

BlockLogo Visualization of immunological motifs http://research4.dfci.harvard.edu/cvc/blocklogo/ [199]

Block conservation

Conservation analysis for multi epitope strategies

http://met-hilab.bu.edu/blockcons/ -

Page 89: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

67

Table 4: Sample of databases containing information useful for discovery of tumor T cell antigens. Database Content URL Ref.

Gene information

OMIM Information about human genes and genetic phenotypes

http://omim.org/ [200]

Ensembl Annotated gene sequences

http://www.ensembl.org [201]

Genecards Functional information about human genes

http://www.genecards.org/ [202]

Sequence data

GenBank Genomic and proteomic sequences

http://www.ncbi.nlm.nih.gov/genbank/ [203]

UniProt Proteomic sequences and functional annotations

http://www.uniprot.org/ [204]

COSMIC Somatic mutations in cancer

http://cancer.sanger.ac.uk/cancergenome/projects/cosmic/ [205]

CGhub Cancer genome sequences

https://cghub.ucsc.edu/ -

Gene expression

GEO Gene expression data http://www.ncbi.nlm.nih.gov/geo/ [82]

Oncomine Gene expression data https://www.oncomine.org/ [206]

UniGene Expression data by tissue http://www.ncbi.nlm.nih.gov/unigene -

Protein expression

Human Protein Atlas

Protein expression by tissue

http://www.proteinatlas.org/ [207]

Interaction/pathway data

STRING Known and predicted protein interactions

http://string-db.org/ [158]

BioGRID Biological interaction data http://thebiogrid.org/ [208]

KEGG Information about genomes, enzymatic pathways, and biological

http://www.genome.jp/kegg/ [209]

Page 90: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

68

chemicals

Reactome Information about biological pathways

http://www.reactome.org/ [210]

MSigDB Molecular signatures annotated using gene set enrichment analysis methods

http://www.broadinstitute.org/gsea/msigdb [211]

Tumor antigen data

TANTIGEN Tumor T cell antigens http://cvc.dfci.harvard.edu/tadb/ -

CTdatabase Cancer-Testis antigens http://www.cta.lncc.br/ [87]

Cancer Immunity peptide database

Tumor T cell antigens http://cancerimmunity.org/peptide/ [85]

HLA data resources

IMGT/HLA HLA sequences and nomenclature

http://www.ebi.ac.uk/ipd/imgt/hla/ [124]

Clinical trials

ClinicalTrials.gov Database of active and completed clinical trials

http://www.clinicaltrials.gov -

Page 91: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

69

5 Future perspectives

At presents, accurate epitope predictions are limited to cellular responses, although prediction of antibody response is highly studied [212]. Computational methods for identification of T cell epitopes also have limitations, in that peptide pre-processing predictions are not yet as accurate as peptide binding prediction algorithms. Additionally, availability of tumor sequences represents a bottleneck in conservation and variability analyses, but this is likely to be remedied in a near future, as high-throughput sequencing becomes cheaper and more efficient. Another issue currently being addressed is that intra-tumor diversity may not be adequately captured by current methods, and may therefore impact the efficacy of immunotherapies and other therapies alike. Lastly, immunotherapy is largely a field of research that must be addressed using proteomics analyses rather than genomics analyses, the latter having been far more prolific in the past decade. Common to all these limitations is that they are currently being addressed in the wet lab to different extents, and all aspects of this progress will increase the need for bioinformatics tools and experts. The assessment of an individual’s susceptibility to immunotherapy can be addressed, both in the wet lab and in silico. In recent years, a number of tumor T cell antigens have been identified against a number of cancers. Successfully induced T cells have been observed in peripheral blood, and yet clinical responses to immunotherapies have been limited. This indicates that barriers to immune response exists within the tumor environment, and as such, play a significant role in planning appropriate treatment modalities [213]. Therefore, personalized immunotherapy treatments are likely to benefit from thorough analysis of genetic and proteomic host factors, or biomarkers, related to susceptibility to a given immunotherapy. Biomarkers are measureable biological substances that are consistently observed in conjunction with a given biological state of tissue, cells or body fluids. Biomarkers can take the form of traceable chemical compounds, patterns in expression of gene, non-coding RNA, proteins (including antibodies), or epigenetic patterns, and can be useful for disease detection, classification, and prognostic prediction [214]. Examples of tumor immune escape mechanisms include downregulation of HLA expression [215], infiltration of immune suppressive cells (e.g. Tregs [216], MDSC [217]), expression of immune suppressive molecules, e.g. IDO [218] and Arg-1 [219], lack of chemokine-mediated trafficking [220], poor innate immune cell activation [28], or immune checkpoint ligands like PD-L1 and PD-L2

Page 92: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

70

[221], all of which may be anticipated by expression characterization if a predictive biomarker set can be defined [213]. In response to the immune system's role both in curbing the expansion of cancer cells, as well as in a tumor’s later progression, it has recently been proposed that the immune system should be included in the traditional histopathological classification of tumors. The classical TNM staging system describes the extent of tumorigenesis based on tumor burden (T), the presence of cancer cells in draining lymph nodes (N), and status of metastases (M). In addition to these scores, a so-called immunoscore (I) can be determined on the basis of two leukocyte populations, namely cytotoxic CD8+ T cells and memory CD8+ T cells [222, 223]. Comparison of the infiltration-rate of these two cell types in the center of the tumor and in the invasive margin of the tumor, determines the “I” score for the tumor. In two independent cohorts, patients with a high I score had significantly less relapse and overall improved survival compared to patients with low I scores [224]. The TNM-I classification scheme in conjunction with definition of predictive biomarkers for immune response could provide a reasonable estimate of the suitability of immunotherapy as part of a treatment modality in any given patient. However, the applicability of this approach is lessened by lack of functional studies on the topic. Firstly, not all cancers are resected, and even fewer have significant material left after normal histological assessment has been completed. Markers and leukocyte profiles that can classify patients on the basis of immunological markers in peripheral blood are therefore desirable, although no such markers have been successfully correlated with clinical response to antigen-based immunotherapies [213]. Secondly, the complex interplay between genetic and proteomic elements makes it hard to elucidate single predictive biomarkers for accurate predictions. Therefore, no tools or data resources for immunotherapy susceptibility biomarkers exist as of yet. 6 Conclusions

In this review, we have discussed the key issues of cancer immunotherapy design, the main of which pertains to therapy target selection for achieving efficient and lasting protection against malignant tumors. Traditionally, mass experimental screening has been the primary tool to elucidate cancer immunotherapy targets, a process which could be streamlined by systematic application of bioinformatics on patient antigen expression and sequence profiles, as

Page 93: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

71

well as analyses in conjunction with the ever-growing body of publically available biological data. The conceptual framework put forth in this review was applied to HER2, BIRC5, and IDO. We addressed the issue of resistance to the current therapies, by suggesting multi epitope strategies based on targets selected to encompass epitopes found in all known isoforms and mutated antigen sequences - a method that will increase in accuracy as the expression and mutation landscape of tumor antigens are investigated in deeper detail in the wet lab, with the increased use of high-throughput technologies. What this increase in data will reveal about tumor cell's response to immunotherapies is uncertain, but one it is clear that as the body of tumor data grows, so will the need for bioinformatics to organize, store, and analyze these. References

1. Kirkwood JM, Butterfield LH, Tarhini AA, Zarour H, Kalinski P, Ferrone S: Immunotherapy of cancer in 2012. CA. Cancer J. Clin. 2012, 62:309–35. 2. Rees R, Laversin S, Murray C, Ball G: Current Approaches to Identify and Evaluate Cancer Biomarkers for Patient Stratification. In Vaccinol. Princ. Pract. edited by Morrow WJW, Sheikh NA, Schmidt CS, Davies DH Oxford, UK: Wiley-Blackwell; 2012:452–63. 3. Wei W-Z, Morris GP, Kong Y-CM: Anti-tumor immunity and autoimmunity: a balancing act of regulatory T cells. Cancer Immunol. Immunother. 2004, 53:73–8. 4. Disis ML: Immunologic biomarkers as correlates of clinical response to cancer immunotherapy. Cancer Immunol. Immunother. 2011, 60:433–42. 5. Gyorki DE, Callahan M, Wolchok JD, Ariyan CE: The delicate balance of melanoma immunotherapy. Clin. Transl. Immunol. 2013, 2:e5. 6. Hanahan D, Weinberg R: Hallmarks of cancer: the next generation. Cell 2011, 144:646–74. 7. Tran B, Dancey JE, Kamel-Reid S, McPherson JD, Bedard PL, Brown AMK, Zhang T, Shaw P, Onetto N, Stein L, Hudson TJ, Neel

Page 94: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

72

BG, Siu LL: Cancer genomics: technology, discovery, and translation. J. Clin. Oncol. 2012, 30:647–60. 8. Palucka K, Banchereau J: Cancer immunotherapy via dendritic cells. Nat. Rev. Cancer 2012, 12:265–77. 9. Suzuki A, Iizuka A, Komiyama M, Takikawa M, Kume A, Tai S, Ohshita C, Kurusu A, Nakamura Y, Yamamoto A, Yamazaki N, Yoshikawa S, Kiyohara Y, Akiyama Y: Identification of melanoma antigens using a Serological Proteome Approach (SERPA). Cancer Genomics Proteomics , 7:17–23. 10. Ogino S, Galon J, Fuchs CS, Dranoff G: Cancer immunology--analysis of host and tumor factors for personalized medicine. Nat. Rev. Clin. Oncol. 2011, 8:711–9. 11. Speiser DE, Romero P: Molecularly defined vaccines for cancer immunotherapy, and protective T cell immunity. Semin. Immunol. 2010, 22:144–54. 12. Scott AM, Wolchok JD, Old LJ: Antibody therapy of cancer. Nat. Rev. Cancer 2012, 12:278–87. 13. Ostrand-Rosenberg S: Animal models of tumor immunity, immunotherapy and cancer vaccines. Curr. Opin. Immunol. 2004, 16:143–50. 14. Palladini A, Nicoletti G, Pappalardo F, Murgo A, Grosso V, Stivani V, Ianzano ML, Antognoli A, Croci S, Landuzzi L, De Giovanni C, Nanni P, Motta S, Lollini P-L: In silico modeling and in vivo efficacy of cancer-preventive vaccinations. Cancer Res. 2010, 70:7755–63. 15. Lollini P-L, Cavallo F, Nanni P, Forni G: Vaccines for tumour prevention. Nat. Rev. Cancer 2006, 6:204–16. 16. Lesterhuis WJ, Haanen JBAG, Punt CJA: Cancer immunotherapy--revisited. Nat. Rev. Drug Discov. 2011, 10:591–600. 17. Parmiani G, De Filippo A, Novellino L, Castelli C: Unique human tumor antigens: immunobiology and use in clinical trials. J. Immunol. 2007, 178:1975–9.

Page 95: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

73

18. Van den Eynde BJ, van der Bruggen P: T cell defined tumor antigens. Curr. Opin. Immunol. 1997, 9:684–93. 19. Cavallo F, Calogero RA, Forni G: Are oncoantigens suitable targets for anti-tumour therapy? Nat. Rev. Cancer 2007, 7:707–13. 20. Schirrmacher V, Feuerer M, Beckhove P, Ahlert T, Umansky V: T cell memory, anergy and immunotherapy in breast cancer. J. Mammary Gland Biol. Neoplasia 2002, 7:201–8. 21. Vesely MD, Kershaw MH, Schreiber RD, Smyth MJ: Natural innate and adaptive immunity to cancer. Annu. Rev. Immunol. 2011, 29:235–71. 22. Kyewski B, Klein L: A central role for central tolerance. Annu. Rev. Immunol. 2006, 24:571–606. 23. Schell TD, Knowles BB, Tevethia SS: Sequential loss of cytotoxic T lymphocyte responses to simian virus 40 large T antigen epitopes in T antigen transgenic mice developing osteosarcomas. Cancer Res. 2000, 60:3002–12. 24. Khong HT, Wang QJ, Rosenberg SA: Identification of multiple antigens recognized by tumor-infiltrating lymphocytes from a single patient: tumor escape by antigen loss and loss of MHC expression. J. Immunother. , 27:184–90. 25. Restifo NP, Marincola FM, Kawakami Y, Taubenberger J, Yannelli JR, Rosenberg SA: Loss of functional beta 2-microglobulin in metastatic melanomas from five patients receiving immunotherapy. J. Natl. Cancer Inst. 1996, 88:100–8. 26. Dunn GP, Sheehan KCF, Old LJ, Schreiber RD: IFN unresponsiveness in LNCaP cells due to the lack of JAK1 gene expression. Cancer Res. 2005, 65:3447–53. 27. Stern-Ginossar N, Gur C, Biton M, Horwitz E, Elboim M, Stanietsky N, Mandelboim M, Mandelboim O: Human microRNAs regulate stress-induced immune responses mediated by the receptor NKG2D. Nat. Immunol. 2008, 9:1065–73. 28. Wang T, Niu G, Kortylewski M, Burdelya L, Shain K, Zhang S, Bhattacharya R, Gabrilovich D, Heller R, Coppola D, Dalton W, Jove

Page 96: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

74

R, Pardoll D, Yu H: Regulation of the innate and adaptive immune responses by Stat-3 signaling in tumor cells. Nat. Med. 2004, 10:48–54. 29. Kataoka T, Schröter M, Hahne M, Schneider P, Irmler M, Thome M, Froelich CJ, Tschopp J: FLIP prevents apoptosis induced by death receptors but not by perforin/granzyme B, chemotherapeutic drugs, and gamma irradiation. J. Immunol. 1998, 161:3936–42. 30. Wainwright DA, Nigam P, Thaci B, Dey M, Lesniak MS: Recent developments on immunotherapy for brain cancer. Expert Opin. Emerg. Drugs 2012, 17:181–202. 31. Flavell RA, Sanjabi S, Wrzesinski SH, Licona-Limón P: The polarization of immune cells in the tumour environment by TGFbeta. Nat. Rev. Immunol. 2010, 10:554–67. 32. Lohr J, Ratliff T, Huppertz A, Ge Y, Dictus C, Ahmadi R, Grau S, Hiraoka N, Eckstein V, Ecker RC, Korff T, von Deimling A, Unterberg A, Beckhove P, Herold-Mende C: Effector T-cell infiltration positively impacts survival of glioblastoma patients and is impaired by tumor-derived TGF-β. Clin. cancer Res. 2011, 17:4296–308. 33. Huang B, Zhao J, Li H, He K-L, Chen Y, Chen S-H, Mayer L, Unkeless JC, Xiong H: Toll-like receptors on tumor cells facilitate evasion of immune surveillance. Cancer Res. 2005, 65:5009–14. 34. Hamanishi J, Mandai M, Iwasaki M, Okazaki T, Tanaka Y, Yamaguchi K, Higuchi T, Yagi H, Takakura K, Minato N, Honjo T, Fujii S: Programmed cell death 1 ligand 1 and tumor-infiltrating CD8+ T lymphocytes are prognostic factors of human ovarian cancer. Proc. Natl. Acad. Sci. U. S. A. 2007, 104:3360–5. 35. Campoli M, Ferrone S: HLA antigen changes in malignant cells: epigenetic mechanisms and biologic significance. Oncogene 2008, 27:5869–85. 36. Schreiber RD, Old LJ, Smyth MJ: Cancer immunoediting: integrating immunity’s roles in cancer suppression and promotion. Science (80-. ). 2011, 331:1565–70. 37. Mellman I, Coukos G, Dranoff G: Cancer immunotherapy comes of age. Nature 2011, 480:480–9.

Page 97: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

75

38. Pamer E, Cresswell P: Mechanisms of MHC class I--restricted antigen processing. Annu. Rev. Immunol. 1998, 16:323–58. 39. Andersen MH, Sørensen RB, Schrama D, Svane IM, Becker JC, Thor Straten P: Cancer treatment: the combination of vaccination with other therapies. Cancer Immunol. Immunother. 2008, 57:1735–43. 40. Rock KL, Gramm C, Rothstein L, Clark K, Stein R, Dick L, Hwang D, Goldberg AL: Inhibitors of the proteasome block the degradation of most cell proteins and the generation of peptides presented on MHC class I molecules. Cell 1994, 78:761–71. 41. Androlewicz MJ, Anderson KS, Cresswell P: Evidence that transporters associated with antigen processing translocate a major histocompatibility complex class I-binding peptide into the endoplasmic reticulum in an ATP-dependent manner. Proc. Natl. Acad. Sci. U. S. A. 1993, 90:9130–4. 42. Henney CS, Gaffney J, Bloom BR: On the relation of products of activated lymphocytes to cell-mediated cytolysis. J. Exp. Med. 1974, 140:837–52. 43. Bevan MJ: Minor H antigens introduced on H-2 different stimulating cells cross-react at the cytotoxic T cell level during in vivo priming. J. Immunol. 1976, 117:2233–8. 44. Bevan MJ: Cross-priming for a secondary cytotoxic response to minor H antigens with H-2 congenic cells which do not cross-react in the cytotoxic assay. J. Exp. Med. 1976, 143:1283–8. 45. Rock KL, Shen L: Cross-presentation: underlying mechanisms and role in immune surveillance. Immunol. Rev. 2005, 207:166–83. 46. Van der Bruggen P, Traversari C, Chomez P, Lurquin C, De Plaen E, Van den Eynde B, Knuth A, Boon T: A gene encoding an antigen recognized by cytolytic T lymphocytes on a human melanoma. Science (80-. ). 1991, 254:1643–7. 47. Boon T, van der Bruggen P: Human tumor antigens recognized by T lymphocytes. J. Exp. Med. 1996, 183:725–9. 48. Cheever M a, Allison JP, Ferris AS, Finn OJ, Hastings BM, Hecht TT, Mellman I, Prindiville S a, Viner JL, Weiner LM, Matrisian LM: The prioritization of cancer antigens: a national cancer institute pilot

Page 98: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

76

project for the acceleration of translational research. Clin. cancer Res. 2009, 15:5323–37. 49. Leisegang M, Wilde S, Spranger S, Milosevic S, Frankenberger B, Uckert W, Schendel DJ: MHC-restricted fratricide of human lymphocytes expressing survivin-specific transgenic T cell receptors. J. Clin. Invest. 2010, 120:3869–77. 50. Morgan R a, Yang JC, Kitano M, Dudley ME, Laurencot CM, Rosenberg S a: Case report of a serious adverse event following the administration of T cells transduced with a chimeric antigen receptor recognizing ERBB2. Mol. Ther. 2010, 18:843–51. 51. Brichard V, Van Pel A, Wölfel T, Wölfel C, De Plaen E, Lethé B, Coulie P, Boon T: The tyrosinase gene codes for an antigen recognized by autologous cytolytic T lymphocytes on HLA-A2 melanomas. J. Exp. Med. 1993, 178:489–95. 52. Wang RF, Robbins PF, Kawakami Y, Kang XQ, Rosenberg SA: Identification of a gene encoding a melanoma tumor antigen recognized by HLA-A31-restricted tumor-infiltrating lymphocytes. J. Exp. Med. 1995, 181:799–804. 53. Bloom MB, Perry-Lalley D, Robbins PF, Li Y, El-Gamil M, Rosenberg SA, Yang JC: Identification of tyrosinase-related protein 2 as a tumor rejection antigen for the B16 melanoma. J. Exp. Med. 1997, 185:453–9. 54. Seremet T, Brasseur F, Coulie PG: Tumor-specific antigens and immunologic adjuvants in cancer immunotherapy. Cancer J. 2011, 17:325–30. 55. Dudley ME, Wunderlich JR, Robbins PF, Yang JC, Hwu P, Schwartzentruber DJ, Topalian SL, Sherry R, Restifo NP, Hubicki AM, Robinson MR, Raffeld M, Duray P, Seipp CA, Rogers-Freezer L, Morton KE, Mavroukakis SA, White DE, Rosenberg SA: Therapeutic cancer vaccines in combination with conventional therapy. Science (80). 2002, 298:850–4. 56. Coulie PG, Brichard V, Van Pel A, Wölfel T, Schneider J, Traversari C, Mattei S, De Plaen E, Lurquin C, Szikora JP, Renauld JC, Boon T: A new gene coding for a differentiation antigen recognized by autologous cytolytic T lymphocytes on HLA-A2 melanomas. J. Exp. Med. 1994, 180:35–42.

Page 99: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

77

57. Bakker AB, Schreurs MW, de Boer AJ, Kawakami Y, Rosenberg SA, Adema GJ, Figdor CG: Melanocyte lineage-specific antigen gp100 is recognized by melanoma-derived tumor-infiltrating lymphocytes. J. Exp. Med. 1994, 179:1005–9. 58. Lurquin C, Van Pel A, Mariamé B, De Plaen E, Szikora JP, Janssens C, Reddehase MJ, Lejeune J, Boon T: Structure of the gene of tum- transplantation antigen P91A: the mutated exon encodes a peptide recognized with Ld by cytolytic T cells. Cell 1989, 58:293–303. 59. Mariuzza RA, Phillips SE, Poljak RJ: The structural basis of antigen-antibody recognition. Annu. Rev. Biophys. Biophys. Chem. 1987, 16:139–59. 60. Wölfel T, Hauer M, Schneider J, Serrano M, Wölfel C, Klehmann-Hieb E, De Plaen E, Hankeln T, Meyer zum Büschenfelde KH, Beach D: A p16INK4a-insensitive CDK4 mutant targeted by cytolytic T lymphocytes in a human melanoma. Science (80-. ). 1995, 269:1281–4. 61. Linard B, Bézieau S, Benlalam H, Labarrière N, Guilloux Y, Diez E, Jotereau F: A ras-mutated peptide targeted by CTL infiltrating a human melanoma lesion. J. Immunol. 2002, 168:4802–8. 62. Andersen MH, Fensterle J, Ugurel S, Reker S, Houben R, Guldberg P, Berger TG, Schadendorf D, Trefzer U, Bröcker E-B, Straten P thor, Rapp UR, Becker JC: Therapeutic cancer vaccines in combination with conventional therapy. Cancer Res. 2004, 64:5456–60. 63. Sharkey MS, Lizée G, Gonzales MI, Patel S, Topalian SL: CD4(+) T-cell recognition of mutated B-RAF in melanoma patients harboring the V599E mutation. Cancer Res. 2004, 64:1595–9. 64. Yotnda P, Firat H, Garcia-Pons F, Garcia Z, Gourru G, Vernant JP, Lemonnier FA, Leblond V, Langlade-Demoyen P: Cytotoxic T cell response against the chimeric p210 BCR-ABL protein in patients with chronic myelogenous leukemia. J. Clin. Invest. 1998, 101:2290–6. 65. Nowell PC, Hungerford DA: A minute chromosome in human chronic granulocytic leukemia. Science (80-. ). 1960, 132:1488–1501. 66. Rappuoli R: Reverse vaccinology. Curr. Opin. Microbiol. 2000, 3:445–50.

Page 100: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

78

67. Harndahl M, Rasmussen M, Roder G, Dalgaard Pedersen I, Sørensen M, Nielsen M, Buus S: Peptide-MHC class I stability is a better predictor than peptide affinity of CTL immunogenicity. Eur. J. Immunol. 2012, 42:1405–16. 68. Rudolph MG, Stanfield RL, Wilson IA: How TCRs bind MHCs, peptides, and coreceptors. Annu. Rev. Immunol. 2006, 24:419–66. 69. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer. Nature 2001, 412:822–6. 70. Spentzos D, Levine DA, Ramoni MF, Joseph M, Gu X, Boyd J, Libermann TA, Cannistra SA: Therapeutic cancer vaccines in combination with conventional therapy. J. Clin. Oncol. 2004, 22:4700–10. 71. Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, D’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J. Natl. Cancer Inst. 2006, 98:1183–92. 72. Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010, 17:98–110. 73. Snuderl M, Fazlollahi L, Le LP, Nitta M, Zhelyazkova BH, Davidson CJ, Akhavanfard S, Cahill DP, Aldape KD, Betensky RA, Louis DN, Iafrate AJ: Mosaic amplification of multiple receptor tyrosine kinase genes in glioblastoma. Cancer Cell 2011, 20:810–7. 74. Sottoriva A, Spiteri I, Piccirillo SGM, Touloumis A, Collins VP, Marioni JC, Curtis C, Watts C, Tavaré S: Intratumor heterogeneity in human glioblastoma reflects cancer evolutionary dynamics. Proc. Natl. Acad. Sci. U. S. A. 2013, 110:4009–14.

Page 101: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

79

75. Campoli M, Chang C-C, Ferrone S: HLA class I antigen loss, tumor immune escape and immune selection. Vaccine 2002, 20 Suppl 4:A40–5. 76. Chang C-C, Campoli M, Restifo NP, Wang X, Ferrone S: Immune selection of hot-spot beta 2-microglobulin gene mutations, HLA-A2 allospecificity loss, and antigen-processing machinery component down-regulation in melanoma cells derived from recurrent metastases following immunotherapy. J. Immunol. 2005, 174:1462–71. 77. Sioud M, Hansen M, Dybwad A: Profiling the immune responses in patient sera with peptide and cDNA display libraries. Int. J. Mol. Med. 2000, 6:123–8. 78. YT C: Identification of human tumor antigens by serological expression cloning: an online review on SEREX. Cancer Immun 2004. 79. Hanash S: Disease proteomics. Nature 2003, 422:226–32. 80. Loging WT, Lal A, Siu IM, Loney TL, Wikstrand CJ, Marra MA, Prange C, Bigner DD, Strausberg RL, Riggins GJ: Identifying potential tumor markers and antigens by database mining and rapid expression screening. Genome Res. 2000, 10:1393–402. 81. Scanlan MJ, Gordon CM, Williamson B, Lee S-Y, Chen Y-T, Stockert E, Jungbluth A, Ritter G, Jäger D, Jäger E, Knuth A, Old LJ: Identification of cancer/testis genes by database mining and mRNA expression analysis. Int. J. cancer 2002, 98:485–92. 82. Barrett T, Wilhite SE, Ledoux P, Evangelista C, Kim IF, Tomashevsky M, Marshall KA, Phillippy KH, Sherman PM, Holko M, Yefanov A, Lee H, Zhang N, Robertson CL, Serova N, Davis S, Soboleva A: NCBI GEO: archive for functional genomics data sets--update. Nucleic Acids Res. 2013, 41:D991–5. 83. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton HC, Gaasterland T, Glenisson P, Holstege FC, Kim IF, Markowitz V, Matese JC, Parkinson H, Robinson A, Sarkans U, Schulze-Kremer S, Stewart J, Taylor R, Vilo J, Vingron M: Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat. Genet. 2001, 29:365–71.

Page 102: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

80

84. Gry M, Rimini R, Strömberg S, Asplund A, Pontén F, Uhlén M, Nilsson P: Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 2009, 10:365. 85. Van der Bruggen P, Stroobant V, Vigneron N, Van den Eynde B: Peptide database: T cell-defined tumor antigens. Cancer Immun 2013. 86. Novellino L, Castelli C, Parmiani G: A listing of human tumor antigens recognized by T cells: March 2004 update. Cancer Immunol. Immunother. 2005, 54:187–207. 87. Almeida LG, Sakabe NJ, deOliveira AR, Silva MCC, Mundstein AS, Cohen T, Chen Y-T, Chua R, Gurung S, Gnjatic S, Jungbluth AA, Caballero OL, Bairoch A, Kiesler E, White SL, Simpson AJG, Old LJ, Camargo AA, Vasconcelos ATR: CTdatabase: a knowledge-base of high-throughput and curated data on cancer-testis antigens. Nucleic Acids Res. 2009, 37:D816–9. 88. Yoshitake Y, Nakatsura T, Monji M, Senju S, Matsuyoshi H, Tsukamoto H, Hosaka S, Komori H, Fukuma D, Ikuta Y, Katagiri T, Furukawa Y, Ito H, Shinohara M, Nakamura Y, Nishimura Y: Proliferation potential-related protein, an ideal esophageal cancer antigen for immunotherapy, identified using complementary DNA microarray analysis. Clin. Cancer Res. 2004, 10:6437–48. 89. Charoentong P, Angelova M, Efremova M, Gallasch R, Hackl H, Galon J, Trajanoski Z: Bioinformatics for cancer immunology and immunotherapy. Cancer Immunol. Immunother. 2012, 61:1885–903. 90. Chatterjee M, Draghici S, Tainsky MA: Immunotheranostics: breaking tolerance in immunotherapy using tumor autoantigens identified on protein microarrays. Curr. Opin. Drug Discov. Devel. 2006, 9:380–5. 91. Miller JC, Zhou H, Kwekel J, Cavallo R, Burke J, Butler EB, Teh BS, Haab BB: Antibody microarray profiling of human prostate cancer sera: antibody screening and identification of potential biomarkers. Proteomics 2003, 3:56–63. 92. Hanash SM, Pitteri SJ, Faca VM: Mining the plasma proteome for cancer biomarkers. Nature 2008, 452:571–9. 93. Imielinski M, Cha S, Rejtar T, Richardson E a, Karger BL, Sgroi DC: Integrated proteomic, transcriptomic, and biological network

Page 103: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

81

analysis of breast carcinoma reveals molecular features of tumorigenesis and clinical relapse. Mol. Cell. Proteomics 2012, 11:M111.014910. 94. Nilsson T, Mann M, Aebersold R, Yates JR, Bairoch A, Bergeron JJM: Mass spectrometry in high-throughput proteomics: ready for the big time. Nat. Methods 2010, 7:681–5. 95. Falk K, Rötzschke O, Stevanović S, Jung G, Rammensee HG: Allele-specific motifs revealed by sequencing of self-peptides eluted from MHC molecules. Nature 1991, 351:290–6. 96. Reinherz EL, Tan K, Tang L, Kern P, Liu J, Xiong Y, Hussey RE, Smolyar A, Hare B, Zhang R, Joachimiak A, Chang HC, Wagner G, Wang J: The crystal structure of a T cell receptor in complex with peptide and MHC class II. Science 1999, 286:1913–21. 97. Van der Burg SH, Visseren MJ, Brandt RM, Kast WM, Melief CJ: Immunogenicity of peptides bound to MHC class I molecules depends on the MHC-peptide complex stability. J. Immunol. 1996, 156:3308–14. 98. Holzhütter HG, Kloetzel PM: A kinetic model of vertebrate 20S proteasome accounting for the generation of major proteolytic fragments from oligomeric peptide substrates. Biophys. J. 2000, 79:1196–205. 99. Kuttler C, Nussbaum AK, Dick TP, Rammensee HG, Schild H, Hadeler KP: An algorithm for the prediction of proteasomal cleavages. J. Mol. Biol. 2000, 298:417–29. 100. Keşmir C, Nussbaum AK, Schild H, Detours V, Brunak S: Therapeutic cancer vaccines in combination with conventional therapy. Protein Eng. 2002, 15:287–96. 101. Peters B, Bulik S, Tampe R, Van Endert PM, Holzhütter H-G: Identifying MHC class I epitopes by predicting the TAP transport efficiency of epitope precursors. J. Immunol. 2003, 171:1741–9. 102. Zhang GL, Petrovsky N, Kwoh CK, August JT, Brusic V: PRED(TAP): a system for prediction of peptide binding to the human transporter associated with antigen processing. Immunome Res. 2006, 2:3.

Page 104: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

82

103. Bhasin M, Lata S, Raghava GPS: Therapeutic cancer vaccines in combination with conventional therapy. Methods Mol. Biol. 2007, 409:381–6. 104. Saxová P, Buus S, Brunak S, Keşmir C: Predicting proteasomal cleavage sites: a comparison of available methods. Int. Immunol. 2003, 15:781–7. 105. Zhang GL, Ansari HR, Bradley P, Cawley GC, Hertz T, Hu X, Jojic N, Kim Y, Kohlbacher O, Lund O, Lundegaard C, Magaret C a, Nielsen M, Papadopoulos H, Raghava GPS, Tal V-S, Xue LC, Yanover C, Zhu S, Rock MT, Crowe JE, Panayiotou C, Polycarpou MM, Duch W, Brusic V: Machine learning competition in immunology - Prediction of HLA class I binding peptides. J. Immunol. Methods 2011, 374:1–4. 106. Lin HH, Ray S, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008, 9:8. 107. Lin HH, Zhang GL, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 2008, 9 Suppl 12:S22. 108. Lundegaard C, Lund O, Nielsen M: Accurate approximation method for prediction of class I MHC affinities for peptides of length 8, 10 and 11 using prediction tools trained on 9mers. Bioinformatics 2008, 24:1397–8. 109. Nielsen M, Lund O: NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction. BMC Bioinformatics 2009, 10:296. 110. Parker KC, Bednarek M a, Coligan JE: Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. J. Immunol. 1994, 152:163–75. 111. Schuler MM, Nastke M-D, Stevanovikć S: SYFPEITHI: database for searching and T-cell epitope prediction. Methods Mol. Biol. 2007, 409:75–93. 112. Hu X, Mamitsuka H, Zhu S: Ensemble approaches for improving HLA class I-peptide binding prediction. J. Immunol. Methods 2011, 374:47–52.

Page 105: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

83

113. Huang JC, Jojic N: Modeling major histocompatibility complex binding by nonparametric averaging of multiple predictors and sequence encodings. J. Immunol. Methods 2011, 374:35–42. 114. Frankild S, de Boer RJ, Lund O, Nielsen M, Kesmir C: Amino acid similarity accounts for T cell cross-reactivity and for “holes” in the T cell repertoire. PLoS One 2008, 3:e1831. 115. Pierce BG, Weng Z: A flexible docking approach for prediction of T cell receptor-peptide-MHC complexes. Protein Sci. 2013, 22:35–46. 116. Tung C-W, Ziehm M, Kämper A, Kohlbacher O, Ho S-Y: POPISK: T-cell reactivity prediction using support vector machines and string kernels. BMC Bioinformatics 2011, 12:446. 117. Saethang T, Hirose O, Kimkong I, Tran VA, Dang XT, Nguyen LAT, Le TKT, Kubo M, Yamada Y, Satou K: Therapeutic cancer vaccines in combination with conventional therapy. J. Immunol. Methods 2012. 118. Finn OJ: Cancer immunology. N. Engl. J. Med. 2008, 358:2704–15. 119. Reinhold B, Keskin DB, Reinherz EL: Molecular Detection of Targeted Major Histocompatibility Complex I-Bound Peptides Using a Probabilistic Measure and Nanospray MS(3) on a Hybrid Quadrupole-Linear Ion Trap. Anal. Chem. 2010, 82:9090–9099. 120. Andersen RS, Kvistborg P, Frøsig TM, Pedersen NW, Lyngaa R, Bakker AH, Shu CJ, Straten PT, Schumacher TN, Hadrup SR: Parallel detection of antigen-specific T cell responses by combinatorial encoding of MHC multimers. Nat. Protoc. 2012, 7:891–902. 121. Erlich H: HLA DNA typing: past, present, and future. Tissue Antigens 2012, 80:1–11. 122. Sette A, Sidney J: HLA supertypes and supermotifs: a functional perspective on HLA polymorphism. Curr. Opin. Immunol. 1998, 10:478–82.

Page 106: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

84

123. Sette A, Sidney J: Nine major HLA class I supertypes account for the vast preponderance of HLA-A and -B polymorphism. Immunogenetics 1999, 50:201–12. 124. Robinson J, Halliwell J a, McWilliam H, Lopez R, Parham P, Marsh SGE: The IMGT/HLA database. Nucleic Acids Res. 2013, 41:D1222–7. 125. Kosmrlj A, Read EL, Qi Y, Allen TM, Altfeld M, Deeks SG, Pereyra F, Carrington M, Walker BD, Chakraborty AK: Effects of thymic selection of the T-cell repertoire on HLA class I-associated control of HIV infection. Nature 2010, 465:350–4. 126. Hildesheim A, Apple RJ, Chen C-J, Wang SS, Cheng Y-J, Klitz W, Mack SJ, Chen I-H, Hsu M-M, Yang C-S, Brinton LA, Levine PH, Erlich HA: Association of HLA class I and II alleles and extended haplotypes with nasopharyngeal carcinoma in Taiwan. J. Natl. Cancer Inst. 2002, 94:1780–9. 127. Forbes SA, Bhamra G, Bamford S, Dawson E, Kok C, Clements J, Menzies A, Teague JW, Futreal PA, Stratton MR: The Catalogue of Somatic Mutations in Cancer (COSMIC). Curr. Protoc. Hum. Genet. 2008, Chapter 10:Unit 10.11. 128. Firth H V, Richards SM, Bevan a P, Clayton S, Corpas M, Rajan D, Van Vooren S, Moreau Y, Pettett RM, Carter NP: DECIPHER: Database of Chromosomal Imbalance and Phenotype in Humans Using Ensembl Resources. Am. J. Hum. Genet. 2009, 84:524–33. 129. Magrane M: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, 2011:bar009. 130. Kawashima I, Hudson SJ, Tsai V, Southwood S, Takesako K, Appella E, Sette A, Celis E: The multi-epitope approach for immunotherapy for cancer: identification of several CTL epitopes from various tumor-associated antigens expressed on solid epithelial tumors. Hum. Immunol. 1998, 59:1–14. 131. Andersen MH, Becker JC, Straten P thor: Regulators of apoptosis: suitable targets for immune therapy of cancer. Nat. Rev. Drug Discov. 2005, 4:399–409.

Page 107: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

85

132. Lee MJ, Ye AS, Gardino AK, Heijink AM, Sorger PK, MacBeath G, Yaffe MB: Sequential application of anticancer drugs enhances cell death by rewiring apoptotic signaling networks. Cell 2012, 149:780–94. 133. Liu J, Ewald BA, Lynch DM, Nanda A, Sumida SM, Barouch DH: Modulation of DNA vaccine-elicited CD8+ T-lymphocyte epitope immunodominance hierarchies. J Virol. 2006, 80:11991–11997. 134. Ranieri E, Kierstead LS, Zarour H, Kirkwood JM, Lotze MT, Whiteside T, Storkus WJ: Dendritic cell/peptide cancer vaccines: clinical responsiveness and epitope spreading. Immunol. Invest. 2000, 29:121–5. 135. Henderson RA, Finn OJ: Human tumor antigens are ready to fly. Adv. Immunol. 1996, 62:217–56. 136. Söllner J, Heinzel A, Summer G, Fechete R, Stipkovits L, Szathmary S, Mayer B: Concept and application of a computational vaccinology workflow. Immunome Res. 2010, 6 Suppl 2:S7. 137. Olsen LR, Zhang GL, Reinherz EL, Brusic V: FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology. Immunome Res. 2011, 7:1–9. 138. Wang Z, Gerstein M, Snyder M: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 2009, 10:57–63. 139. Ramsay G: DNA chips: state-of-the art. Nat. Biotechnol. 1998, 16:40–4. 140. Luo Q, Gu Y, Wu S-L, Rejtar T, Karger BL: Two-dimensional strong cation exchange/porous layer open tubular/mass spectrometry for ultratrace proteomic analysis using a 10 microm id poly(styrene- divinylbenzene) porous layer open tubular column with an on-line triphasic trapping column. Electrophoresis 2008, 29:1604–11. 141. Slamon DJ, Clark GM, Wong SG, Levin WJ, Ullrich A, McGuire WL: Human breast cancer: correlation of relapse and survival with amplification of the HER-2/neu oncogene. Science (80-. ). 1987, 235:177–82.

Page 108: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

86

142. Allred DC, Clark GM, Molina R, Tandon AK, Schnitt SJ, Gilchrist KW, Osborne CK, Tormey DC, McGuire WL: Overexpression of HER-2/neu and its relationship with other prognostic factors change during the progression of in situ to invasive breast cancer. Hum. Pathol. 1992, 23:974–9. 143. Fisk B, Blevins TL, Wharton JT, Ioannides CG: Identification of an immunodominant peptide of HER-2/neu protooncogene recognized by ovarian tumor-specific cytotoxic T lymphocyte lines. J. Exp. Med. 1995, 181:2109–17. 144. Carter P, Presta L, Gorman CM, Ridgway JB, Henner D, Wong WL, Rowland AM, Kotts C, Carver ME, Shepard HM: Humanization of an anti-p185HER2 antibody for human cancer therapy. Proc. Natl. Acad. Sci. U. S. A. 1992, 89:4285–9. 145. Slamon DJ, Leyland-Jones B, Shak S, Fuchs H, Paton V, Bajamonde A, Fleming T, Eiermann W, Wolter J, Pegram M, Baselga J, Norton L: Use of chemotherapy plus a monoclonal antibody against HER2 for metastatic breast cancer that overexpresses HER2. N. Engl. J. Med. 2001, 344:783–92. 146. Wang C, Krishnakumar S, Wilhelmy J, Babrzadeh F, Stepanyan L, Su LF, Levinson D, Fernandez-Viña MA, Davis RW, Davis MM, Mindrinos M: High-throughput, high-fidelity HLA genotyping with deep sequencing. Proc. Natl. Acad. Sci. U. S. A. 2012, 109:8676–81. 147. Boegel S, Löwer M, Schäfer M, Bukur T, de Graaf J, Boisguérin V, Türeci O, Diken M, Castle JC, Sahin U: HLA typing from RNA-Seq sequence reads. Genome Med. 2013, 4:102. 148. Zhang GL, Keskin DB, Reinherz EL, Brusic V: A cDNA microarray for rapid and economical identification of HLA profiles of individuals. In 2011 IEEE Int. Conf. Bioinforma. Biomed. Work. IEEE; 2011:677–679. 149. Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M: NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic Acids Res. 2008, 36:W509–12. 150. Nielsen M, Justesen S, Lund O, Lundegaard C, Buus S: NetMHCIIpan-2.0 - Improved pan-specific HLA-DR predictions

Page 109: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

87

using a novel concurrent alignment and weight optimization training procedure. Immunome Res. 2010, 6:9. 151. Olsen LR, Zhang GL, Keskin DB, Reinherz EL, Brusic V: Conservation analysis of dengue virus T-cell epitope-based vaccine candidates using peptide block entropy. Front. Immunol. 2011, 2:1–15. 152. Vogel CL, Cobleigh MA, Tripathy D, Gutheil JC, Harris LN, Fehrenbacher L, Slamon DJ, Murphy M, Novotny WF, Burchmore M, Shak S, Stewart SJ, Press M: Efficacy and safety of trastuzumab as a single agent in first-line treatment of HER2-overexpressing metastatic breast cancer. J. Clin. Oncol. 2002, 20:719–26. 153. Mitra D, Brumlik MJ, Okamgba SU, Zhu Y, Duplessis TT, Parvani JG, Lesko SM, Brogi E, Jones FE: An oncogenic isoform of HER2 associated with locally disseminated breast cancer and trastuzumab resistance. Mol. Cancer Ther. 2009, 8:2152–62. 154. Scaltriti M, Rojo F, Ocaña A, Anido J, Guzman M, Cortes J, Di Cosimo S, Matias-Guiu X, Ramon y Cajal S, Arribas J, Baselga J: Expression of p95HER2, a truncated form of the HER2 receptor, and response to anti-HER2 therapies in breast cancer. J. Natl. Cancer Inst. 2007, 99:628–38. 155. Rongcun Y, Salazar-Onfray F, Charo J, Malmberg KJ, Evrin K, Maes H, Kono K, Hising C, Petersson M, Larsson O, Lan L, Appella E, Sette A, Celis E, Kiessling R: Identification of new HER2/neu-derived peptide epitopes that can elicit specific CTL against autologous and allogeneic carcinomas and melanomas. J. Immunol. 1999, 163:1037–44. 156. Tamoxifen for early breast cancer: an overview of the randomised trials. Early Breast Cancer Trialists’ Collaborative Group. Lancet 1998, 351:1451–67. 157. Reid A, Vidal L, Shaw H, de Bono J: Dual inhibition of ErbB1 (EGFR/HER1) and ErbB2 (HER2/neu). Eur. J. Cancer 2007, 43:481–9. 158. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41:D808–15.

Page 110: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

88

159. Kawaguchi Y, Kono K, Mimura K, Mitsui F, Sugai H, Akaike H, Fujii H: Targeting EGFR and HER-2 with cetuximab- and trastuzumab-mediated immunotherapy in oesophageal squamous cell carcinoma. Br. J. Cancer 2007, 97:494–501. 160. Gutierrez C, Schiff R: HER2: biology, detection, and clinical implications. Arch. Pathol. Lab. Med. 2011, 135:55–62. 161. Negro A, Brar BK, Lee K-F: Essential roles of Her2/erbB2 in cardiac development and function. Recent Prog. Horm. Res. 2004, 59:1–12. 162. Fuchs IB, Landt S, Bueler H, Kuehl U, Coupland S, Kleine-Tebbe A, Lichtenegger W, Schaller G: Analysis of HER2 and HER4 in human myocardium to clarify the cardiotoxicity of trastuzumab (Herceptin). Breast Cancer Res. Treat. 2003, 82:23–8. 163. Ambrosini G, Adida C, Altieri DC: A novel anti-apoptosis gene, survivin, expressed in cancer and lymphoma. Nat. Med. 1997, 3:917–21. 164. Altieri DC: Targeting survivin in cancer. Cancer Lett. 2013, 332:225–8. 165. Mahotka C, Wenzel M, Springer E, Gabbert HE, Gerharz CD: Survivin-deltaEx3 and survivin-2B: two novel splice variants of the apoptosis inhibitor survivin with different antiapoptotic properties. Cancer Res. 1999, 59:6097–102. 166. Andersen MH, Pedersen LO, Becker JC, Straten PT: Identification of a cytotoxic T lymphocyte response to the apoptosis inhibitor protein survivin in cancer patients. Cancer Res. 2001, 61:869–72. 167. Schmitz M, Diestelkoetter P, Weigle B, Schmachtenberg F, Stevanovic S, Ockert D, Rammensee HG, Rieber EP: Generation of survivin-specific CD8+ T effector cells by dendritic cells pulsed with protein or selected peptides. Cancer Res. 2000, 60:4845–9. 168. Span PN, Tjan-Heijnen VCG, Heuvel JJTM, de Kok JB, Foekens JA, Sweep FCGJ: Do the survivin (BIRC5) splice variants modulate or add to the prognostic value of total survivin in breast cancer? Clin. Chem. 2006, 52:1693–700.

Page 111: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

89

169. Végran F, Boidot R, Bonnetain F, Cadouot M, Chevrier S, Lizard-Nacol S: Apoptosis gene signature of Survivin and its splice variant expression in breast carcinoma. Endocr. Relat. Cancer 2011, 18:783–92. 170. Tanaka K, Iwamoto S, Gon G, Nohara T, Iwamoto M, Tanigawa N: Expression of survivin and its relationship to loss of apoptosis in breast carcinomas. Clin. Cancer Res. 2000, 6:127–34. 171. Garcia I, Martinou I, Tsujimoto Y, Martinou JC: Prevention of programmed cell death of sympathetic neurons by the bcl-2 proto-oncogene. Science 1992, 258:302–4. 172. Andersen MH, Svane IM, Kvistborg P, Nielsen OJ, Balslev E, Reker S, Becker JC, Straten PT: Immunogenicity of Bcl-2 in patients with cancer. Blood 2005, 105:728–34. 173. Wang M, Johansen B, Nissen MH, Thorn M, Kløverpris H, Fomsgaard A, Buus S, Claësson MH: Identification of an HLA-A*0201 restricted Bcl2-derived epitope expressed on tumors. Cancer Lett. 2007, 251:86–95. 174. Li F, Brattain MG: Role of the Survivin gene in pathophysiology. Am. J. Pathol. 2006, 169:1–11. 175. Fukuda S, Pelus LM: Survivin, a cancer target with an emerging role in normal adult tissues. Mol. Cancer Ther. 2006, 5:1087–98. 176. Lewis KD, Samlowski W, Ward J, Catlett J, Cranmer L, Kirkwood J, Lawson D, Whitman E, Gonzalez R: A multi-center phase II evaluation of the small molecule survivin suppressor YM155 in patients with unresectable stage III or IV melanoma. Invest. New Drugs 2011, 29:161–6. 177. Uyttenhove C, Pilotte L, Théate I, Stroobant V, Colau D, Parmentier N, Boon T, Van den Eynde BJ: Evidence for a tumoral immune resistance mechanism based on tryptophan degradation by indoleamine 2,3-dioxygenase. Nat. Med. 2003, 9:1269–74. 178. Sørensen RB, Berge-Hansen L, Junker N, Hansen CA, Hadrup SR, Schumacher TNM, Svane IM, Becker JC, thor Straten P, Andersen MH: The immune system strikes back: cellular immune responses against indoleamine 2,3-dioxygenase. PLoS One 2009, 4:e6910.

Page 112: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

90

179. Munir S, Larsen SK, Iversen TZ, Donia M, Klausen TW, Svane IM, Straten PT, Andersen MH: Natural CD4+ T-cell responses against indoleamine 2,3-dioxygenase. PLoS One 2012, 7:e34568. 180. Pearson JT, Siu S, Meininger DP, Wienkers LC, Rock DA: In vitro modulation of cytochrome P450 reductase supported indoleamine 2,3-dioxygenase activity by allosteric effectors cytochrome b(5) and methylene blue. Biochemistry 2010, 49:2647–56. 181. Kvistborg P, Hadrup SR, Svane IM, Andersen MH, Straten PT: Characterization of a single peptide derived from cytochrome P450 1B1 that elicits spontaneous human leukocyte antigen (HLA)-A1 as well as HLA-B35 restricted CD8 T-cell responses in cancer patients. Hum. Immunol. 2008, 69:266–72. 182. Fridman A, Finnefrock AC, Peruzzi D, Pak I, La Monica N, Bagchi A, Casimiro DR, Ciliberto G, Aurisicchio L: An efficient T-cell epitope discovery strategy using in silico prediction and the iTopia assay platform. Oncoimmunology 2012, 1:1258–1270. 183. Sørensen RB, Hadrup SR, Svane IM, Hjortsø MC, Thor Straten P, Andersen MH: Indoleamine 2,3-dioxygenase specific, cytotoxic T cells as immune regulators. Blood 2011, 117:2200–10. 184. Baek Sørensen R, Faurschou M, Troelsen L, Schrama D, Jacobsen S, Becker JC, Thor Straten P, Andersen MH: Melanoma inhibitor of apoptosis protein (ML-IAP) specific cytotoxic T lymphocytes cross-react with an epitope from the auto-antigen SS56. J. Invest. Dermatol. 2009, 129:1992–9. 185. Sørensen RB, Hadrup SR, Køllgaard T, Svane IM, thor Straten P, Andersen MH: Efficient tumor cell lysis mediated by a Bcl-X(L) specific T cell clone isolated from a breast cancer patient. Cancer Immunol. Immunother. 2007, 56:527–33. 186. Tsai SL, Chen MH, Yeh CT, Chu CM, Lin AN, Chiou FH, Chang TH, Liaw YF: Purification and characterization of a naturally processed hepatitis B virus peptide recognized by CD8+ cytotoxic T lymphocytes. J. Clin. Invest. 1996, 97:577–84. 187. Inaba K, Metlay JP, Crowley MT, Steinman RM: Dendritic cells pulsed with protein antigens in vitro can prime antigen-specific, MHC-restricted T cells in situ. J. Exp. Med. 1990, 172:631–40.

Page 113: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

91

188. Collins SA, Guinn B-A, Harrison PT, Scallan MF, O’Sullivan GC, Tangney M: Viral vectors in cancer immunotherapy: which vector for which strategy? Curr. Gene Ther. 2008, 8:66–78. 189. Spencer BA, McBride RB, Hershman DL, Buono D, Herr HW, Benson MC, Gupta-Mohile S, Neugut AI: Adjuvant intravesical bacillus calmette-guérin therapy and survival among elderly patients with non-muscle-invasive bladder cancer. J. Oncol. Pract. 2013, 9:92–8. 190. Nunberg JH, Doyle M V, York SM, York CJ: Interleukin 2 acts as an adjuvant to increase the potency of inactivated rabies virus vaccine. Proc. Natl. Acad. Sci. U. S. A. 1989, 86:4240–3. 191. Mocellin S, Pasquali S, Rossi CR, Nitti D: Interferon alpha adjuvant therapy in patients with high-risk melanoma: a systematic review and meta-analysis. J. Natl. Cancer Inst. 2010, 102:493–501. 192. Sjogren MH: Thymalfasin: an immune system enhancer for the treatment of liver disease. J. Gastroenterol. Hepatol. 2004, 19:S69–72. 193. Clive KS, Tyler JA, Clifton GT, Holmes JP, Mittendorf EA, Ponniah S, Peoples GE: Use of GM-CSF as an adjuvant with cancer vaccines: beneficial or detrimental? Expert Rev. Vaccines 2010, 9:519–25. 194. Rech AJ, Vonderheide RH: Clinical use of anti-CD25 antibody daclizumab to enhance immune responses to tumor antigen vaccination by targeting regulatory T cells. Ann. N. Y. Acad. Sci. 2009, 1174:99–106. 195. Fernández-Suárez XM, Galperin MY: The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection. Nucleic Acids Res. 2013, 41:D1–7. 196. Bui H-H, Sidney J, Li W, Fusseder N, Sette A: Development of an epitope conservancy analysis tool to facilitate the design of epitope-based diagnostics and vaccines. BMC Bioinformatics 2007, 8:361. 197. Bui H-H, Sidney J, Dinh K, Southwood S, Newman MJ, Sette A: Predicting population coverage of T-cell epitope-based diagnostics and vaccines. BMC Bioinformatics 2006, 7:153.

Page 114: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

92

198. Katoh K, Standley DM: MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 2013. 199. Olsen LR, Kudahl UJ, Simon C, Sun J, Schönbach C, Reinherz EL, Zhang GL, Brusic V: BlockLogo: Visualization of peptide and sequence motif conservation. J. Immunol. Methods 2013:8–15. 200. Amberger J, Bocchini CA, Scott AF, Hamosh A: McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 2009, 37:D793–6. 201. Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, Fitzgerald S, Gil L, García-Girón C, Gordon L, Hourlier T, Hunt S, Juettemann T, Kähäri AK, Keenan S, Komorowska M, Kulesha E, Longden I, Maurel T, McLaren WM, Muffato M, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS, Ritchie GRS, Ruffier M, Schuster M, Sheppard D, Sobral D, Taylor K, Thormann A, Trevanion S, White S, Wilder SP, Aken BL, Birney E, Cunningham F, Dunham I, Harrow J, Herrero J, Hubbard TJP, Johnson N, Kinsella R, Parker A, Spudich G, Yates A, Zadissa A, Searle SMJ: Ensembl 2013. Nucleic Acids Res. 2013, 41:D48–55. 202. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, Sirota-Madi A, Olender T, Golan Y, Stelzer G, Harel A, Lancet D: GeneCards Version 3: the human gene integrator. Database (Oxford). 2010, 2010:baq020. 203. Benson D a, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2012, 40:D48–53. 204. The UniProt Consortium: Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41:D43–7. 205. Forbes S a, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011, 39:D945–50. 206. Rhodes DR, Kalyana-Sundaram S, Mahavisno V, Varambally R, Yu J, Briggs BB, Barrette TR, Anstet MJ, Kincead-Beal C, Kulkarni P, Varambally S, Ghosh D, Chinnaiyan AM: Oncomine 3.0: genes,

Page 115: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

93

pathways, and networks in a collection of 18,000 cancer gene expression profiles. Neoplasia 2007, 9:166–80. 207. Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, Zwahlen M, Kampf C, Wester K, Hober S, Wernerus H, Björling L, Ponten F: Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28:1248–50. 208. Chatr-Aryamontri A, Breitkreutz B-J, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O’Donnell L, Reguly T, Breitkreutz A, Sellam A, Chen D, Chang C, Rust J, Livstone M, Oughtred R, Dolinski K, Tyers M: The BioGRID interaction database: 2013 update. Nucleic Acids Res. 2013, 41:D816–23. 209. Kanehisa M, Goto S: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000, 28:27–30. 210. Croft D, O’Kelly G, Wu G, Haw R, Gillespie M, Matthews L, Caudy M, Garapati P, Gopinath G, Jassal B, Jupe S, Kalatskaya I, Mahajan S, May B, Ndegwa N, Schmidt E, Shamovsky V, Yung C, Birney E, Hermjakob H, D’Eustachio P, Stein L: Reactome: a database of reactions, pathways and biological processes. Nucleic Acids Res. 2011, 39:D691–7. 211. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 2005, 102:15545–50. 212. Caoili SEC: Benchmarking B-cell epitope prediction for the design of peptide-based vaccines: problems and prospects. J. Biomed. Biotechnol. 2010, 2010:910524. 213. Gajewski TF: Molecular profiling of melanoma and the evolution of patient-specific therapy. Semin. Oncol. 2011, 38:236–42. 214. Brusic V, Marina O, Wu CJ, Reinherz EL: Proteome informatics for cancer research: from molecules to clinic. Proteomics 2007, 7:976–91. 215. Blades RA, Keating PJ, McWilliam LJ, George NJ, Stern PL: Loss of HLA class I expression in prostate cancer: implications for immunotherapy. Urology 1995, 46:681–6; discussion 686–7.

Page 116: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

94

216. Curiel TJ, Coukos G, Zou L, Alvarez X, Cheng P, Mottram P, Evdemon-Hogan M, Conejo-Garcia JR, Zhang L, Burow M, Zhu Y, Wei S, Kryczek I, Daniel B, Gordon A, Myers L, Lackner A, Disis ML, Knutson KL, Chen L, Zou W: Specific recruitment of regulatory T cells in ovarian carcinoma fosters immune privilege and predicts reduced survival. Nat. Med. 2004, 10:942–9. 217. Gabrilovich DI, Nagaraj S: Myeloid-derived suppressor cells as regulators of the immune system. Nat. Rev. Immunol. 2009, 9:162–74. 218. Prendergast GC: Immune escape as a fundamental trait of cancer: focus on IDO. Oncogene 2008, 27:3889–900. 219. Bronte V, Serafini P, De Santo C, Marigo I, Tosello V, Mazzoni A, Segal DM, Staib C, Lowel M, Sutter G, Colombo MP, Zanovello P: IL-4-induced arginase 1 suppresses alloreactive T cells in tumor-bearing mice. J. Immunol. 2003, 170:270–8. 220. Ben-Baruch A: Inflammation-associated immune suppression in cancer: The roles played by cytokines, chemokines and additional mediators. Semin. Cancer Biol. 2006, 16:38–52. 221. Latchman Y, Wood CR, Chernova T, Chaudhary D, Borde M, Chernova I, Iwai Y, Long AJ, Brown JA, Nunes R, Greenfield EA, Bourque K, Boussiotis VA, Carter LL, Carreno BM, Malenkovich N, Nishimura H, Okazaki T, Honjo T, Sharpe AH, Freeman GJ: PD-L2 is a second ligand for PD-1 and inhibits T cell activation. Nat. Immunol. 2001, 2:261–8. 222. Galon J, Franck P, Marincola FM, Angell HK, Thurin M, Lugli A, Zlobec I, Berger A, Bifulco C, Botti G, Tatangelo F, Britten CM, Kreiter S, Chouchane L, Delrio P, Hartmann A, Asslaber M, Maio M, Masucci G V, Mihm M, Vidal-Vanaclocha F, Allison JP, Gnjatic S, Hakansson L, Huber C, Singh-Jasuja H, Ottensmeier C, Zwierzina H, Laghi L, Grizzi F, Ohashi PS, Shaw P a, Clarke B a, Wouters BG, Kawakami Y, Hazama S, Okuno K, Wang E, O’Donnell-Tormey J, Lagorce C, Pawelec G, Nishimura MI, Hawkins R, Lapointe R, Lundqvist A, Khleif SN, Ogino S, Gibbs P, Waring P, Sato N, Torigoe T, Itoh K, Patel PS, Shukla SN, Palmqvist R, Nagtegaal ID, Wang Y, D’Arrigo C, Kopetz S, Sinicrope F a, Trinchieri G, Gajewski TF, Ascierto P a, Fox B a: Cancer classification using the Immunoscore: a worldwide task force. J. Transl. Med. 2012, 10:205.

Page 117: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

95

223. Galon J, Pagès F, Marincola FM, Thurin M, Trinchieri G, Fox B a, Gajewski TF, Ascierto P a: The immune score as a new possible approach for the classification of cancer. J. Transl. Med. 2012, 10:1. 224. Pagès F, Kirilovsky A, Mlecnik B, Asslaber M, Tosolini M, Bindea G, Lagorce C, Wind P, Marliot F, Bruneval P, Zatloukal K, Trajanoski Z, Berger A, Fridman W-H, Galon J: Therapeutic cancer vaccines in combination with conventional therapy. J. Clin. Oncol. 2009, 27:5944–51.

Page 118: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

96

Page 119: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Paper II Identification of T cell vaccine targets from multiple sequence alignments

Journal of Immunology (in review)

Lars Rønn Olsen1,2,3, Christian Simon1,4, Ulrich Johan Kudahl1,4, Frederik Otzen Bagger2,3,5, Ole Winther2,6, Ellis Leonard Reinherz1,7,8,

Guang Lan Zhang1,8,9, Vladimir Brusic1,8,9,*

1Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 2Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 3Biotech Research and Innovation Center (BRIC), University of Copenhagen, Copenhagen, Denmark 4Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark 5The Finsen Laboratory, Rigshospitalet, Faculty of Health Sciences, University of Copenhagen, Copenhagen, Denmark 6Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark 7Laboratory of Immunobiology, Dana-Farber Cancer Institute, Boston, MA, USA 8Department of Medicine, Harvard Medical School, Boston, MA, USA 9Department of Computer Science, Metropolitan College, Boston University, Boston, MA, USA *Corresponding author

Page 120: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

98

Abstract

Computational methods for T cell-based vaccine target discovery focus on selection of highly conserved peptides identified across pathogen variants, followed by prediction of their binding of HLA molecules. This approach compresses pathogen diversity into smaller sets of antigen targets. However, experimental studies have shown that T cells often target diverse regions in highly variable viral pathogens and this diversity may need to be addressed through redefinition of suitable peptide targets. We have developed a method for antigen assessment and selection for polyvalent vaccines, which identifies immune epitope candidates from multiple sequence alignments. We applied this method to assess 37 recently discovered dengue virus epitopes, and to predict broadly covering CD8+ T-cell epitope candidates in multi-species flavivirus data set. We also used this strategy to predict CD4+ T-cell epitopes from a norovirus data set. In both studies we significantly increased the number of potential vaccine targets compared to the number of targets discovered using the traditional approach where low-frequency peptides are excluded. Our method has a novel visualization scheme for summarizing the T cell-based antigenic potential of any given proteome or protein using clear and easy to interpret graphics. A web-accessible software implementation is freely available at: http://met-hilab.bu.edu/blockcons.

Page 121: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

99

1 Introduction

Advances in the study of the human immune system are aided by continuous improvements in large-scale measurements of molecular and cellular processes, as well as by computational advances that allow processing, analysis, and modeling of the rapidly growing data sets. Clinical applications of bioinformatics enable computational simulations and predictions that support experimental research and speed up discovery of methods for improved diagnosis, selection and optimization of treatment, and vaccine discovery.  Along with sanitation, vaccines are the most effective and economic public health tools for control of infectious disease. Nearly 50 successful vaccines have been developed and widely used [1]. The eradication of smallpox and the control of polio, measles, tetanus and diphtheria [2], for example, are the testimony to the success of vaccines. However, vaccine development faces formidable challenges, including the limited effectiveness of a number of vaccines, the need for frequent vaccine reformulation, as well as a complete lack of vaccines for some diseases. Rapid advances in biotechnology, such as successful applications of high-throughput sequencing, nucleotide and protein microarrays, mass spectrometry, high-throughput cellular assays, computational simulation, imaging and visualization, among others, enable rapid increase in our understanding of the human immune system and of the mechanisms it uses to recognize pathogens [3].  Improved understanding of the complexity of both the immune system and pathogen-host interactions makes vaccine development a task of high combinatorial complexity. In particular the variability of both pathogens and the human immune system make vaccine discovery and design challenging [4], as a central goal of vaccination is to generate long lasting and broadly protective immunity against target pathogens. Practical solutions to the variability problem include polyvalent vaccines such as those being developed for dengue [5] or seasonal vaccine reformulation such as influenza vaccines [6]. The majority of traditional vaccines provide protection through neutralizing antibodies targeting critical B-cell epitopes present on the pathogen. Antibodies provide protection against infection for the majority of protective immune responses that are induced by vaccines [7]. However, pathogens like influenza, dengue, malaria, or HIV evade humoral immunity through modifications of their surface antigens. Rapid mutation of surface proteins in such pathogens result in

Page 122: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

100

immune escape, decreased effectiveness of vaccine formulations over time or the complete lack of effective vaccines. CD4+ and CD8+ T cells alone rarely offer protection and prevention of disease. However, they participate in reduction, control, and clearance of intracellular pathogens. They have been linked with protective immunity against malaria [8], hepatitis C [9], listeria [10], and tuberculosis [11]. T-cell epitopes are recognized as continuous targets and are distributed across the entire pathogen proteomes thus providing a larger pool of antigenic targets across pathogen populations and have several properties that make them excellent vaccine targets: they allow for easy introduction of chemical modifications, enable design of enhanced stability, enable precise delivery of the vaccine constructs, and can be tailored to individual patients [12]. The reverse vaccinology framework improves vaccine discovery and design by combining bioinformatics and experimentation. In this framework, the entire pathogenic genome is screened for antigens and these antigens are then screened for B-cell and T-cell epitopes, followed by experimental validation and their use as vaccine constructs [13]. A major bottleneck in reverse immunology is that variable viruses mutate and change their B-cell and T-cell epitopes. Here, we address one of the primary bottlenecks in reverse vaccinology, the antigen selection step, and provide a systematic approach to assessing the immunological potential of viral antigens across hundreds of thousands of viral variant sequences of. We also show how these targets can then be systematically screened for candidate T-cell epitopes for a selection of common HLA molecules. This analysis utilizes peptide block entropy [14] and provides an order of magnitude larger number of conserved candidate T-cell epitopes as compared to previous analyses of conserved T-cell epitopes [15]. 1.1 Analysis of viral diversity and selection of vaccine targets Antigen conservation and variability analysis for immunological applications is traditionally performed by calculating sequence similarity from local sequence alignments [16], or by calculating the frequency of nucleotides or amino acids on each position in a multiple sequence alignment (MSA) of homologous pathogen genes or proteins [17]. Regions, in which several consecutive residues show high conservation, are then further analyzed for immunogenic potential either by computational predictions, experimental testing, or combination thereof. A major drawback of vaccine target selection using existing methods is the systematic exclusion of low frequency variants [18–22], since the frequency of occurrence of any given peptide within pathogen variants

Page 123: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

101

does not impact its immunogenic potential, as both rare and common variants can be immunogenic. Typically, potential T cell-mediated immunogenicity is assessed by the binding affinity to the HLA molecules. A systematic analysis of immune epitope diversity involves the selection of all epitope targets based on their known immunogenic properties (HLA binding or existence of neutralizing antibodies). This is followed by assembling a suitable set of candidates to cover both population and pathogen diversity in a polyvalent vaccine construct [23]. Since HLA recognizes epitopes as peptides rather than as individual amino acids, it is more appropriate to perform conservation analyses of continuous peptides rather than their individual amino acids [24]. Such analysis is performed on the columns of suitably sized sliding windows (from here on termed "blocks") from the rows of sequences in an MSA (Figure 1).

Figure 1: Subdivision of a multiple sequence alignment into blocks of peptides, l amino acids in length. In this example, l = 9. Block 1 is highlighted in blue. Moving the sliding window down the MSA in increments of one position will give the remaining blocks. An MSA of homologous protein sequences can be performed using algorithms such as MAFFT [25], MUSCLE [26], or ClustalW [27]. From the MSA, blocks of peptides of a given size (usually 8-11 amino acids long for HLA class I restricted and 13-25 amino acids long for HLA class II restricted T-cell epitopes) are extracted from each position in the alignment. The number of peptides in each block indicates the diversity in the block, for which Shannon entropy and consensus frequency can be calculated as informative metrics [28]. HLA binding affinities are predicted for all peptides in all blocks. Multiple algorithms for prediction of HLA binding are available [29]. Examples of highly accurate classification algorithms include BIMAS [30], SYFPEITHI [31], novel ensemble methods such as PM and AvgTanh [32], averaging methods [33], and NetMHC [34] and

MSA of N protein sequences of length L, giving rise to L-l-1 blocks of peptides of length l.

Block 1

AGVLWDVPSPPPMGK...AGVLWDVPSPPPMGK...AGVLWDVPSPPPMGK...AGVLWDVASPPPMGK...AGVLWDVASPPPMGK...AGVLWDVASPPPMGK...AGVLWDVSSPPPMGK...

Page 124: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

102

NetMHCII [35]. The latter two methods were shown to outperform other algorithms for a number of HLA alleles [36, 37]. Because blocks are extracted from an MSA of homologous proteins, it is likely that the peptides within a given block display high sequence homology and the majority show similar HLA binding properties even when sequence variations exist. Similarly, the regions surrounding a block will be of high mutual homology, thus increasing the likelihood that peptides from the same block will be processed and presented on the surface of target cells in a similar fashion [38]. Blocks in which all peptides are predicted to bind one or more HLA supertypes with high affinity represent conserved potential vaccine target regions. 1.2 Using epitope predictions from MSA for identification of

vaccine targets The selection of vaccine targets using a traditional approach, where sequences conserved in 90% or more of the viral strains are identified followed by computational prediction of potential T-cell epitopes, produces a limited list of targets. Consequently, only the most frequent peptides are selected and subsequently studied for their ability to bind HLA molecules. However, targeting efficiency in some variable viruses is considered low, if protective epitopes are of low frequency. For example, some validated T-cell epitopes are of relatively low frequency in highly variable viral pathogens such as flaviviruses [39] influenza [24], and HIV [40]. Immune escape is common in rapidly mutating RNA viruses [41], such as influenza [42], Hepatitis C Virus [43], HIV [44], and flaviviruses [45]. However, in some blocks in the MSA of homologous proteins, higher level of immunological conservation is observed as compared to sequence variability. Immunologically conserved blocks have one or more peptides in the block, all of which are predicted to bind to the same HLA alleles or supertypes with similar affinity. For polyvalent vaccine design, such immunologically conserved regions can be useful, as viral diversity can be compressed in a small number of constructs where a small number of peptides cover pathogen diversity. This allow for simultaneous immunization with several epitopes, a necessary tactic against highly mutating viruses, in which mutations introducing drug resistance can occur within a single day [46, 47]. We previously applied the block analysis method for vaccine target discovery in dengue virus (DENV) [14] and reported a 10-fold larger number of potential CD8+ vaccine target candidates as compared to an earlier benchmark study of DENV vaccine target candidates [15]. Recently, 37 experimentally validated T-cell epitopes have been reported for DENV [48]. We here demonstrate the utility of our

Page 125: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

103

method for assessing the potential of these peptides as vaccine targets, by analyzing them in context of their respective blocks. Although DENV represents a useful case study, the method can, in theory, be applied to any pathogen or cancer where T cell-mediated immunity is studied. 1.3 Using block conservation analysis for vaccine target

discovery We extend the characterization of potentially cross-reactive MHC class I binders present in the five most relevant flavivirus pathogens: DENV, West Nile virus, yellow fever virus, Japanese encephalitis virus, and tick-borne encephalitis virus, here collectively referred to as the “panFive” data set. We have also expanded the analysis to MHC class II epitopes in norovirus (NV) to demonstrate the utility of block conservation in the search for potential CD4+ T-cell vaccine target candidates. The immune response to NV is believed to be predominantly antibody based [49, 50] and CD4+ T cells have recently been shown to play a major role in clearance of NV infections [51, 52]. 2 Materials and methods

2.1 Multiple sequence alignment MSAs were performed using MAFFT [25]. When aligning highly variable protein sequences, such as proteins from influenza virus or HIV, MSAs will invariably contain a high proportion of gaps. Gaps, typically denoted by a dash "-", are artifacts of the MSA algorithms and distort the analysis of peptides in blocks derived from MSAs. We applied an algorithm based on short read local alignments to remove gaps from blocks before further analysis where appropriate. Initially, the algorithm removes the gaps and extends the length of the shortened peptides to match the block length. Extensions are made either upstream or downstream, depending on which direction yields the best alignment with the rest of the block. 2.2 Block conservation Variability metrics are based on information content calculated as Shannon entropy [28] and conservation defined as the frequency of the predominant peptide. In the block conservation analysis we calculate the information content and frequency of all peptides in an MSA of N homologous proteins, as previously described [14]. Briefly, the peptides at starting position x are here collectively referred to as a “block”, Bx. The peptides in each block have user-defined length, l,

Page 126: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

104

and in each given block, a number, Wx, of unique peptides exist. Starting from the first position x = 1 and increasing x in increments of one across the entire MSA of length L results in L-l+1 blocks. The block conservation is assessed using the minimum percentage, yx, of a block at starting position x that must be covered by a subset of peptides, Sw, for a block to be considered conserved. We analyzed the DENV and panFive proteomes for HLA class I and NV proteome for class II binders, respectively. For these analyses, the following parameters were used for panFive: l = 9, yx = 99%, and the following for NV: l = 15, yx = 99%. 2.3 Data The data used as examples of the utility of the method includes protein sequence data from DENV, West Nile virus, Yellow fever virus, Japanese encephalitis virus, and Tick-borne encephalitis virus extracted from FLAVIdB [53] and NV sequences extracted from GenBank [54]. The number of sequences used in this study is shown in Table 1. All proteome sequences were automatically annotated using the alignment of queries with well-annotated GenBank reference sequences of the respective species (GenBank accession numbers are listed in Supplementary materials Table S1). 2.4 T-cell epitope prediction MHC class I binding affinities of peptides were predicted using NetMHC 3.0 [34]. MHC class II binding affinities of peptides were predicted using NetMHCII 2.0 [55]. Both these methods were shown to offer good accuracy relative to other available methods [36, 37]. The default thresholds for binding level affinity (IC50 < 500 nM for weak binders and IC50 < 50 nM for strong binders) were used for binding prediction. Predictions were performed for the following HLA class I alleles: A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, B*08:01, B*15:01, and the following class II alleles: DRB1*01:01, DRB1*11:01, DRB1*04:01, and DRB1*07:01. For prediction of binding to all known HLA class I alleles, we used NetMHCpan 2.4 [56]. NetMHCpan enables prediction of peptide binding to 886 HLA-A alleles, 1412 HLA-B alleles, and 617 HLA-C alleles.

Page 127: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

105

Table 1: Sequence data. Number of strains included in sequence analysis of the species in this study.

Species Full proteome Partial proteome Total

DENV1 1,209 1,085 2,294

DENV2 835 1,781 2,616

DENV3 565 1,665 2,230

DENV4 103 606 709

WNV 177 1,361 1,538

YFV 26 389 415

JEV 67 1,238 1,305

TBE 34 620 654

NV 1,205 10,711 12,916

2.5 Visualizing block analysis results Given the large quantity of outputs from block analysis, block conservation and epitope prediction visualizations provides a convenient way of summarizing results. The conservation of blocks is visualized using bar plots of the minimum number of peptides required to fulfill the user-defined coverage threshold, yx, (Y axis) for each starting position in an MSA of the protein or proteome in question (X axis). The predicted binding affinities of peptide blocks to the user-specified HLA-I or HLA-II alleles are visualized using heat map displayed below the conservation graph, where each column in the heat map corresponds to the binding affinity of the given position in the MSA, and each row corresponds to an HLA allele. This approach to visualization allows simultaneous display and overview of predictions of binding to multiple HLA alleles. Custom visualizations can be designed using the software implementation of the method found at http://met-hilab.bu.edu/blockcons. Detailed information about peptides and residue frequencies in each block are visualized using our BlockLogo tool (http://research4.dfci.harvard.edu/cvc/blocklogo/ or mirror site http://met-hilab.bu.edu/blocklogo/) together with WebLogo [57]. 3 Results

3.1 Block conservation of MHC class I binders in DENV We previously analyzed DENV for cross-reactive epitope candidates using the block entropy approach [14]. Figure 2 visualizes the results

Page 128: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

106

of the block conservation analysis of the DENV polyprotein sequences. The bar plot displaying the intra-block variability is useful to view for the entire proteome or any subsets thereof, whereas the HLA binding heat map is primarily useful for visualizing blocks in which a high proportion of peptides are predicted to bind the same HLA allele. Figure 2 shows a condensed view of the DENV proteome: the 116 blocks in which more than 99% of peptides in the block were predicted to bind the same HLA. The DENV proteome is highly variable and only 13 single conserved nonamers were found to be conserved across DENV strains. However, all of the regions shown in this figure are found to have conservation in terms of peptide binding predictions. The visualization scheme clearly shows regions of immunological potential and allows for intuitive filtering of data before experimental validation. Here we focus only on HLA alleles for which prediction algorithms are highly accurate. Predictions can be performed for any known HLA molecule and visualized in a similar manner for a broader overview. Visualization of binding predictions for all peptides to all HLA alleles to which binding can be predicted is accessible in supplementary materials (Figure S1). 3.2 Block conservation of MHC class I binders in flavivirus

panFive The block conservation analysis and prediction of epitopes from MSA were performed on the sequences from the panFive data set. Using the conservation parameters specified in the methods section, 29 blocks were predicted to contain epitopes where all peptides in the respective blocks bind the sme HLA molecule. None of the 29 blocks consist of less than two peptides. Hence, none of the potentially valuable vaccine target regions would be discovered using the traditional approach for conservation and variability analysis. Figure 3 shows the visual summary of immunological block conservation of panFive polyprotein sequences. To examine characteristics of peptides in a given block, a BlockLogo and a WebLogo were generated to visually represent conservation and variability of peptides and individual residues within block 1990 (Figure 4). The BlockLogo shows that five peptides collectively cover the vast majority of the studied strains. The details about predicted HLA binding affinities are available in Table 2. As can be seen, all but two peptides (with a combined frequency of 0.86%) in the block bind HLA A*11:01.

Page 129: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

     

 Figure 2: Blocks of conserved HLA binding in DENV MSA. Visualization of the block conservation analysis and MHC class I binding affinity predictions for for 116 blocks from the DENV polyprotein in which at least 99% of the peptides were predicted to bind at least one of eight HLA alleles used in this example (A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, B*08:01, B*15:01). The bars show the minimum number of peptides in a block (Y axis) at a given starting position in the MSA (X axis) required for fulfilling the user defined coverage threshold, yx (in this case 99%). The heat map below the bar show the percentage of peptides in the block predicted to bind to each of the HLA alleles predicted for in these examples. The color of each position in the heat map matrix ranges from blue (0% accumulated conservation by predicted binders in the block for the given allele) to red (blocks predicted to bind to the given allele with a minimum binding affinity of 500 nM represents 99% conservation in the block). The starting positions of the blocks are shown below the heat map.

A*01:01A*02:01A*03:01A*11:01A*24:02B*07:02B*08:01B*08:01B*15:01

Page 130: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

108

Figure 3: Blocks of conserved HLA binding in panFive MSA. Visualization of the block conservation analysis and MHC class I binding affinity predictions for DENV, WNV, YFV, JEV, and TBEV for 29 blocks from the panFive polyprotein in which at least 99% of the peptides were predicted to bind at least one of the seven HLA alleles used in this example (A*01:01, A*02:01, A*03:01, A*11:01, A*24:02, B*07:02, B*08:01, B*15:01). The bars show the minimum number of peptides in a block (Y axis) at a given starting position in the MSA (X axis) required for fulfilling the user defined coverage threshold, yx. The heat map below the bar show the percentage of peptides in the block predicted to bind to each of the HLA alleles predicted for in these examples. The color of each position in the heat map matrix ranges from blue (0% accumulated conservation by predicted binders in the block for the given allele) to red (blocks predicted to bind to the given allele with a minimum binding affinity of 500 nM represents 99% conservation in the block). The starting positions of the blocks are shown below the heat map.

A*01:01A*02:01A*03:01A*11:01A*24:02B*07:02B*08:01B*08:01B*15:01

Page 131: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

109

Figure 4: Sequence logo visualizations of panFive block 1990. WebLogo (A) and BlockLogo (B) visualization of block 1990 of the panFive polyprotein MSA.

B)

A)

Page 132: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Table 2: Conservation of HLA binding peptide blocks in panFive. Details of conservation and variability and predicted HLA binding affinities (nM) of peptides in block 1990 of the panFive polyprotein MSA.

Peptide Frequency Accumulated frequency A*02:01 A*03:01 A*11:01 A*24:02 B*07:02 B*08:01 B*15:01

KTFDTEYQK 58.49 58.49 20660 137 10 25719 24457 26024 21402

KTFDSEYVK 25.80 84.28 21396 89 8 25793 24397 24158 17647

KSYETEYPK 5.87 90.15 21794 67 8 25795 24662 25909 17355

KTFDTEYPK 3.48 93.63 17585 116 7 26737 23279 25510 17303

KSYDTEYPK 2.22 95.86 22082 97 7 27170 24317 26477 16707

KTFDSEYIK 1.06 96.92 21375 98 8 26534 25226 24787 20850

KTFEKDYSR 0.90 97.81 20515 2453 68 31596 23138 22420 19111

KTFEREYPT 0.73 98.54 543 18060 8824 32433 22137 18456 17094

KTFDTEYTK 0.27 98.81 19064 154 10 21410 23871 25755 19748

KTFDSEYAK 0.27 99.07 19934 78 7 29011 23836 24354 17982

KTFDTEYIK 0.23 99.30 21606 188 10 25313 25207 25483 21319

KTFEKDYTR 0.17 99.47 20552 3430 116 27245 22851 23935 19876

KTFEKEYPT 0.13 99.60 660 20000 11054 33883 22802 21115 18044

RTFDTEYQK 0.10 99.70 22051 142 8 23614 24482 25173 20010

KTFETEYQK 0.10 99.80 20412 88 11 24270 24555 25628 21681

KTFDAEYVK 0.07 99.87 21296 159 11 24822 23982 25028 18503

KTFNTEYQK 0.07 99.93 22173 71 10 24046 25316 24914 21174

KTFDFEYIK 0.03 99.97 18706 113 6 25174 25202 25219 21598

KTFDTEYQR 0.03 100.00 21243 2288 21 29794 24981 25539 21629

Page 133: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

111

3.3 Block conservation of MHC class II binders in NV We performed block conservation analysis and prediction of epitopes from an MSA of the NV protein sequences. Due to a more homologous nature of this data set (i.e. it consists of only one species, as opposed to panFive), more regions of conserved HLA binding were found. In the NV sequences, we found 184 conserved blocks that corresponded to the block conservation threshold (yx = 99%) and in which all peptides were predicted to bind the same HLA (Figure 5). Three of these blocks are defined by a single peptide. This compares favorably to traditional conservation analysis where less than 2% of the potentially immunogenic regions were captured. We found that the average number of peptides in potentially immunogenic blocks (characterized by binding of all peptides in the block to at least one HLA allele) is approximately 10. Block 277 of the NV p48 protein represents an example of one such region. The intra-block diversity is visualized using BlockLogo and WebLogo in Figure 6, and a closer look at peptide HLA binding prediction in this block reveals that all peptides are predicted to bind to both DRB1*01:01 and DRB1*07:01 (Table 3). 3.4 Assessing the immunological potential of vaccine targets

in DENV and NV The analysis of 37 class I DENV T-cell epitopes reported by Weiskopf et al. [48] shows that 21 of these epitopes are located in blocks of conserved HLA binding (at least 99% of the epitope's block binds to the same HLA) (Table 4). The utility of our method is demonstrated by examples such as known T-cell epitope GTSGSPIIDK [48, 58, 59], which is present only in 11.17% of the DENV strains, but is found in a block comprising 7 peptides, all predicted to bind HLA A*11:01, and collectively covering 99.85% of the total viral population. This epitope and corresponding block represent a valuable target pool for further study of polyvalent vaccine constructs. LoBue et al. [52] showed that NV epitopes have similarly low targeting efficiency. The authors identified eight cross-reactive CD4+ T-cell epitopes in the NV capsid peptide, none of which are conserved by traditional measures in the NV data set used here. However, two epitopes (FYQEAAPAQSDVALL and DVALLRFVNPDTGRV), although only present in 58.33% and 69% of sequenced NV strains respectively, are found blocks in which more than 99% of the peptides are predicted to bind the same HLA. This indicates that in spite of low conservation of the individual peptides,

Page 134: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

112

the block may well be of immunological importance and potentially useful for polytope vaccine constructs. Assessment of immunological potential of all eight epitopes found by LoBue et al. is summarized in Table 5. 3.5 Software implementation for custom visualizations We developed a software implementation of the visualization scheme, which is freely available at http://met-hilab.bu.edu/blockcons. To generate a custom visualization, the user must submit an MSA of proteins of interest (pathogen or tumor proteins), and select analysis parameters as described below. First, a block size between 8 and 25 amino acids (corresponding to the length of T-cell epitopes) can be chosen. Then, a conservation threshold (the minimum accumulated frequency of peptides in a block, required for the block to be considered adequately covered) for each block can be selected. Most blocks will contain a large number of very low frequency variants, which can be filtered from the block if desired. For example, a peptide present only in a small fraction of examined viral proteins may be considered evolutionarily unstable (one or a few occurrences isolated in time and geographic location), and may be exempt from the analysis if desired. For this purpose, a conservation threshold of, for example, 99% can be chosen. Similarly, this threshold acts as an immunological conservation threshold, i.e. the minimum accumulated frequency of predicted HLA binders in a block, in order for the block to be considered adequately conserved in terms of potential immunological function. To efficiently summarize the data, the size of the visualization can be reduced based on user defined threshold for displaying blocks. The immunological conservation threshold will filter the output by displaying only blocks in which the minimum accumulated frequency of HLA binders is above the threshold. HLA binding predictions are performed using either NetMHC 3.0 for prediction of MHC class I binders (for blocks of length 8-12) and NetMHCII 2.0 for prediction of MCH class II binders (for blocks of length 13-25). A number of common alleles can be selected, please refer to the materials and methods section for details.

Page 135: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Figure 5: Blocks of conserved HLA binding in NV MSA. Visualization of the block conservation analysis and MHC class II binding affinity predictions for NV for 184 blocks from the panNV polyprotein in which at least 99% of the peptides were predicted to bind at least one of the four HLA II alleles used in this example (DRB1*01:01, DRB1*11:01, DRB1*04:01, and DRB1*07:01). The bars show the minimum number of peptides in a block (Y axis) at a given starting position in the MSA (X axis) required to fulfill the user defined coverage threshold, yx. The heat map below the bar show the percentage of peptides in the block predicted to bind to each of the HLA alleles predicted for in these examples. The color of each position in the heat map matrix ranges from blue (0% accumulated conservation by predicted binders in the block for the given allele) to red (blocks predicted to bind to the given allele with a minimum binding affinity of 500 nM represents 99% conservation in the block).

DRB1*01:01DRB1*04:01DRB1*07:01DRB1*11:01

Page 136: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

114

Figure 6: Sequence logo visualizations of NV block 277. Sequence logo (A) and BlockLogo (B) visualization of block 277 of NV p48 protein multiple sequence alignment.

A)

B)

Page 137: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Table 3: Conservation of HLA binding peptide blocks in NV. Details of conservation and variability and predicted HLA binding affinities (nM) of peptides in block 277 of the panNV polyprotein MSA.

Peptide Frequency Accumulated frequency

DRB1*01:01 DRB1*04:01 DRB1*07:01 DRB1*11:01

LRPLNIINILASCDW 72.27 72.27 60.70 890.70 178.60 215.00

LRPLNILNILASCDW 15.13 87.39 30.30 966.50 180.30 127.30

VRPLNILNILASCDW 3.78 91.18 38.20 988.50 203.20 134.60

LKPLNILNILASCDW 2.94 94.12 33.70 1018.40 171.40 133.50

LKPLNILNILATCDW 2.94 97.06 33.60 1220.90 221.90 231.80

LRPLNVINILASCDW 2.10 99.16 34.80 1153.90 206.00 346.40

VRPLNIINILASCDW 0.42 99.58 90.60 879.30 189.70 223.70

IRPLNILNILASCDW 0.42 100.00 32.30 975.00 187.10 129.30

Page 138: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Table 4: Assessment of immunological potential of DENV epitopes. This table shows a number of assessment parameters for the DENV epitopes discovered by Weiskopf et al. [40].

Epitope Frequency in block

Predicted HLA binding

Total number of peptides in

block

Number of binders in

block

Viral population coverage by

binders to any HLA

Viral population coverage by binders to predominant HLA

ITEAELTGY 28.76% A0101 (55nM) 14 8 87.02% 87.02% (A0101)

RSCTLPPLRY 51.66% A0301 (348nM) A1101 (271nM)

4 1 51.66% 51.66% (A1101)

MTDDIGMGV 23.38% A0101 (52nM) A0201 (88nM)

18 6 50.40% 44.43% (A0101)

LTDALALGM 30.05% A0101 (91nM) 10 5 34.43% 30.71% (A0101)

VIDLDPIPY 15.38% A0101 (41nM) 8 6 99.75% 99.75% (A0101)

YTDYMPSMK 30.38% A0101 (436nM) A1101 (56nM)

12 11 78.46% 78.46% (A1101)

RLITVNPIV 30.35% A0201 (15nM) 13 13 99.16% 99.16% (A0201)

IMAVGMVSI 26.81% A0201 (49nM) B1501 (299nM)

9 9 99.94% 99.94% (A0201)

GLLTVCYVL 30.79% A0201 (20nM) 5 4 96.10% 96.10% (A0201)

LLVISGLFPV 9.11% A0201 (14nM) 14 13 95.61% 74.89% (A0201)

AAAWYLWEV 30.13% A0201 (8nM) 11 10 99.42% 64.9% (A1101)

YLPAIVREA 93.81% A0201 (51nM) 7 6 96.06% 96.06% (A0201)

DLMRRGDLPV 30.20% A0201 (189nM) 6 5 99.93% 99.93% (A0201)

ALSELPETL 30.64% A0201 (33nM) 8 4 34.59% 34.59% (A0201)

IILEFFLIV 30.46% A0201 (19nM) 7 7 99.82% 99.82% (A0201)

KLAEAIFKL 29.83% A0201 (7nM) 14 14 99.66% 99.66% (A0201)

SQIGAGVYK 30.60% A1101 (20nM) 7 4 54.79% 51.21% (A1101)

GTSGSPIIDK 11.17% A0301 (193nM) A1101 (18nM)

7 7 99.85% 99.85% (A1101)

KTFDSEYVK 28.69% A0301 (89nM) 9 9 99.48% 99.48% (A1101)

Page 139: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

98

A1101 (8nM)

RIYSDPLALK 30.05% A0301 (12nM) A1101 (16nM)

7 7 99.63% 99.63% (A1101)

ATVLMGLGK 27.69% A1101 (24nM) 8 8 100.00% 100.00% (A1101)

STYGWNLVR 30.83% A0301 (208nM) A1101 (12nM)

7 7 99.67% 99.67% (A1101)

TVMDIISRR 26.51% A1101 (15nM) 5 5 99.93% 99.93% (A1101)

RQMEGEGVFK 13.27% A0301 (31nM) A1101 (15nM)

13 6 44.87% 30.71% (A1101)

RTTWSIHAK 27.51% A0301 (63nM) A1101 (9nM)

6 4 97.02% 97.02% (A1101)

RPTFAAGLLL 29.61% B0702 (22nM) 13 13 99.45% 99.45% (B0702)

LPAIVREAI 93.66% B0702 (20nM) 7 7 99.71% 99.71% (B0702)

APTRVVAAEM 55.46% B0702 (94nM) 3 3 99.97% 99.97% (B0702)

VPNYNLIIM 49.48% B0702 (105nM) 5 4 99.85% 99.85% (B0702)

APIMDEEREI 19.76% B0702 (352nM) 12 2 23.56% 23.56% (B0702)

TPEGIIPSMF 30.68% B0702 (223nM) 6 6 99.89% 99.89% (B0702)

KPRWLDARI 30.05% B0702 (42nM) 7 6 99.19% 99.19% (B0702)

RPASAWTLYA 42.40% B0702 (467nM) 3 1 42.40% 42.40% (B0702)

TPMLRHSI 30.79% B0702 (19nM) B0801 (48nM)

4 4 100.00% 100.00% (B0702)

SPNPTVEAGR 13.72% none 7 0 0% 0%

TPRMCTREEF 29.50% B0702 (87nM) 11 11 99.41% 99.41% (B0702)

RPTPRGTVM 30.46% B0702 (6nM) 10 10 98.86% 98.86% (B0702)

Page 140: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Table 5: Assessment of immunological potential of NV epitopes. This table shows a number of assessment parameters for the NV epitopes discovered by LoBue et al. [44].

Epitope Frequency in block

Predicted HLA binding

Total number of peptides in

block

Number of binders in

block

Viral population coverage by

binders to any HLA

Viral population coverage by binders to predominant HLA

SPNNTPGDVLFDLSL 3.64% none 128 38 28.90% 28.84% (DRB10101)

FDLSLGPHLNPFLLH 3.72% DRB10101 (32.2nM) DRB11101 (485.8nM)

117 94 66.91% 66.91% (DRB10101)

PFLLHLSQMYNGWVG 1.23% DRB10101 (27.3nM) DRB11101 (130.9nM) DRB10401 (66.3nM)

76 75 80.60% 80.6% (DRB10101)

CSGYPNMNLDCLLPQ 17.55% none 68 6 1.31% 0.8% (DRB10101)

CLLPQEWVQHFYQEA 48.98% none 38 7 2.33% 2.33% (DRB10101)

FYQEAAPAQSDVALL 58.33% DRB10101 (15.4nM) DRB10401 (286nM)

49 49 99.43% 99.23% (DRB10101)

DVALLRFVNPDTGRV 69.00% DRB10101 (50.6nM) DRB11101 (125.9nM) DRB10401 (70.1nM)

41 41 99.46% 99.46% (DRB10101)

Page 141: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

119

4 Discussion

The introduction of computational methodologies to vaccinology has enabled a significant step towards rational vaccine design. The current paradigm for antigen selection is based on assembling a number of highly conserved peptides predicted to bind HLA, to cover diversity of pathogen and host with the smallest number of peptides. From a practical point of view, it would be cheaper and faster to experimentally validate fewer candidates, technically easier to include fewer peptides in vaccine designs, and the final epitope pool carries smaller risk of undesired immunodominance. However, from an evolutionary point of view, some viruses susceptible to treatment are likely to escape due to high mutation rates. Indeed, recent experimental efforts have shown that the targeting efficiency of highly variable viruses tends to be rather low [24, 39, 40]. This prompted us to design a computational method with the aim of assembling pools of lower frequency T-cell epitope candidates to use their collective ability to confer neutralization of pathogens or cancers. Specifically, we aim to select epitopes from regions of some variability, as targeted regions are likely to mutate due to selective pressure, but our methods focuses exclusively on regions in which all known variant peptides are predicted to be similarly antigenic. The conservation analysis of peptide blocks (rather than individual residues) in conjunction with prediction of epitopes from MSA enables the identification of pools of epitope candidates with the potential for broad coverage. Furthermore, predicting HLA binders from MSAs of homologous peptides increases the likelihood that all peptides in the pool are processed and presented in a similar manner. The inclusion of multiple peptides in a vaccine construct carries the risk of immunodominance of one or a few peptides in a pool, thus increasing the chance for immune escape by mutation. However, even if immunodominance of a particular peptide is observed, this can be obviated, in principle, by multiple epitope vaccination [59]. The analysis of peptides in MSAs using block analysis provides a new method for computational analysis of T-cell epitope vaccine candidates. In addition, our tools provide a compact, informative graphical overview of conservation and variability vs. potential immunogenicity. This tool allows for pre-screening of potential vaccine targets before engaging in expensive and time consuming experimental screening of vaccine targets. Assessing the epitopes discovered by Weiskopf et al. [48] using peptide block analysis showed that 21 of the 37 epitopes reported in

Page 142: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

120

their study were located in blocks of conserved HLA binding, while 16 epitopes were outside the conserved blocks. For practical reasons, good vaccine targets should provide broad coverage across viral variants and maintain immunogenicity. It would therefore be useful to determine how many epitopes can be efficiently included in a polyvalent vaccine construct. Establishing such a range would aid the assessment of epitopes as presented in Table 4, as these epitopes are found in blocks ranging in size from three to 14 epitope candidates. The number of epitopes in blocks of conserved binding may be higher in reality, since HLA binding affinity was only predicted for a subset of HLA alleles, representative of HLA supertypes. We analyzed a set of peptides from DENV and panFive flaviviruses for HLA class I binders and peptides of NV for HLA class II binders. This was done for 12 selected HLA alleles, but prediction servers such as netMHCpan [56] allow for prediction of HLA class I binding to 2914 different alleles and netMHCpan II [60] allow predictions to 655 HLA class II alleles. In our analysis we have provided a proof of principle and covered most common HLA alleles. Further analyses can therefore be extended to a large number of HLA alleles, including many rare alleles. The application of epitope prediction from MSA to the flavivirus panFive and NV data significantly increased the number of potential targets to be considered: from no targets to 29 blocks (consisting of 196 peptides) in panFive, from 3 targets to 184 blocks (consisting 1835 peptides) in NV, and a 10-fold increase in DENV targets relative to previous studies [14]. Although our results call for increased experimental validation, advances in mass-spectrometry methods [61] and flow cytometry-based methods [62] make large-scale T-cell epitope identification viable. The analysis of panFive data set provides an example showing that high conservation alone is insufficient criterion for selection of potential immunogens. Block 1347 of the MSA of panFive sequences contains 32 different peptides and the Shannon entropy is 2.85. This position can therefore be regarded as highly variable. However, only two of the 32 peptides, representing 0.92% of viral population, are predicted not to bind HLA-A*11:01 (see supplementary materials Table S2 and Figure S2). This means that highly variable regions may produce good immunogen candidates. A similar observation was made in the NV block analysis, where the blocks starting at position 623 and 624 each contain 36 peptides that are all predicted to bind the same HLA alleles. Such blocks are perhaps too large to include in

Page 143: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

121

vaccine constructs, and may instead be considered for exclusion from vaccine formulation because they potentially direct immune responses towards variable regions and, as a consequence, may not protect against large proportion of viral population. 4.1 Concluding remarks We have developed a novel method that integrates conservation analysis and prediction of immunogenic potential of peptides within viral antigens for fast identification of potential CTL vaccine targets. The scale of analysis provided an order of magnitude larger number of discovered targets compared with previous approaches. We addressed the problems of scale and integration of multiple results by designing a visual results representation for rapid identification of potential polyvalent vaccine targets. This method is an addition to available toolset for discovery and design of universal polyvalent CTL vaccines against viral pathogens. Software implementation is freely available at http://met-hilab.bu.edu/blockcons. 5 References

1. André, F. E. Vaccinology: past achievements, present roadblocks and future promises. Vaccine 2003, 21: 593–5. 2. Ehreth, J. The global value of vaccination. Vaccine 2003, 21: 596–600. 3. Kurstak, E. Towards the new global vaccinology era in prevention and control of diseases. Vaccine 2003, 21: 580–1. 4. Brusic, V., and J. T. August. The changing field of vaccine development in the genomics era. Pharmacogenomics 2004, 5: 597–600. 5. Morrison, D., T. J. Legg, C. W. Billings, R. Forrat, S. Yoksan, and J. Lang. A novel tetravalent dengue vaccine is well tolerated and immunogenic against all 4 serotypes in flavivirus-naive adults. The Journal of infectious diseases 2010, 201: 370–7. 6. Treanor J.J., H. K. Talbot, S. E. Ohmit, L. A. Coleman, M. G. Thompson, P. Y. Cheng, J. G. Petrie, G. Lofthus, J. K. Meece, J. V. Williams, L. Berman, C. Breese Hall, A. S. Monto, M. R. Griffin, E. Belongia E and D. K. Shay. Effectiveness of seasonal influenza vaccines in the United States during a season with circulation of all three vaccine strains. Clinical infectious diseases 2012, 55: 951-9.

Page 144: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

122

7. Plotkin S.A. Correlates of protection induced by vaccination. Clinical and vaccine immunology 2010, 17:1055-1065. 8. Sun P., R. Schwenk, K. White, J. A. Stoute, J. Cohen, W. R. Ballou, G. Voss, K. E. Kester, D. G. Heppner, U. Krzych. Protective immunity induced with malaria vaccine, RTS,S, is linked to Plasmodium falciparum circumsporozoite protein-specific CD4+ and CD8+ T cells producing IFN-gamma. Journal of Immunology 2003, 171:6961-7. 9. Bowen D. G. and C. M. Walker. Adaptive immune responses in acute and chronic hepatitis C virus infection. Nature 2005, 436:946-52. 10. Olson JA, C. McDonald-Hyman, S. C. Jameson, S. E. Hamilton. Effector-like CD8(+) T Cells in the Memory Population Mediate Potent Protective Immunity. Immunity 2013, 38:1250-60. 11. Khader S.A., G. K. Bell, J. E. Pearl, J. J. Fountain, J. Rangel-Moreno, C. E. Cilley, F. Shen, S. M. Eaton, S. L. Gaffen, S. L. Swain, R. M. Locksley, L. Haynes, T. D. Randall and A. M. Cooper. IL-23 and IL-17 in the establishment of protective pulmonary CD4+ T cell responses after vaccination and during Mycobacterium tuberculosis challenge. Nature Immunology 2007, 8:369-77. 12. Purcell A. W., J. McCluskey and J. Rossjohn. More than one reason to rethink the use of peptides in vaccine design. Nature Reviews. Drug Discovery 2007, 6: 404-14. 13. Seib K.L., X. Zhao, R. Rappuoli. Developing vaccines in the era of genomics: a decade of reverse vaccinology. Clinical microbiology and infection 2012, 5: 109-16 14. Olsen, L. R., G. L. Zhang, D. B. Keskin, E. L. Reinherz, and V. Brusic. Conservation analysis of dengue virus T-cell epitope-based vaccine candidates using peptide block entropy. Frontiers in immunology 2011, 2: 1–15. 15. Khan, A. M., O. Miotto, E. J. M. Nascimento, K. N. Srinivasan, A. T. Heiny, G. L. Zhang, E. T. Marques, T. W. Tan, V. Brusic, J. Salmon, and J. T. August. Conservation and variability of dengue virus proteins: implications for vaccine design. PLoS neglected tropical diseases 2008, 2: e272.

Page 145: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

123

16. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. J. Lipman. Basic local alignment search tool. Journal of molecular biology 1990, 215: 403–10. 17. Schneider, T. D., and R. M. Stephens. Sequence logos: a new way to display consensus sequences. Nucleic acids research 1990, 18: 6097–100. 18. Gaschen, B., J. Taylor, K. Yusim, B. Foley, F. Gao, D. Lang, V. Novitsky, B. Haynes, B. H. Hahn, T. Bhattacharya, and B. Korber. Diversity considerations in HIV-1 vaccine selection. Science 2002, 296: 2354–60. 19. De Groot, A. S., L. Marcon, E. a Bishop, D. Rivera, M. Kutzler, D. B. Weiner, and W. Martin. HIV vaccine development by computer assisted design: the GAIA vaccine. Vaccine 2005, 23: 2136–48. 20. Gao, F., E. A. Weaver, Z. Lu, Y. Li, H. Liao, B. Ma, S. M. Alam, R. M. Scearce, L. L. Sutherland, J. Yu, J. M. Decker, G. M. Shaw, D. C. Montefiori, B. T. Korber, B. H. Hahn, and B. F. Haynes. Antigenicity and immunogenicity of a synthetic human immunodeficiency virus type 1 group m consensus envelope glycoprotein. Journal of virology 2005, 79: 1154–63. 21. Fischer, W., H. X. Liao, B. F. Haynes, N. L. Letvin, and B. Korber. Coping with Viral Diversity in HIV Vaccine Design  : A Response to Nickle et al . PLoS Computational Biology 2008, 4. 22. Fischer, W., S. Perkins, J. Theiler, T. Bhattacharya, K. Yusim, R. Funkhouser, C. Kuiken, B. Haynes, N. L. Letvin, B. D. Walker, B. H. Hahn, and B. T. Korber. Polyvalent vaccines for optimal coverage of potential T-cell epitopes in global HIV-1 variants. Nature medicine 2007, 13: 100–6. 23. Santra S, H.X. Liao, R. Zhang, M. Muldoon, S. Watson, W. Fischer, J. Theiler, J. Szinger, H. Balachandran, A. Buzby, D. Quinn, R. J. Parks, C. Y. Tsao, A. Carville, K. G. Mansfield, G. N. Pavlakis, B. K. Felber, B. F. Haynes, B. T. Korber BT and N. L. Letvin. Mosaic vaccines elicit CD8+ T lymphocyte responses that confer enhanced immune coverage of diverse HIV strains in monkeys. Nature Medicine 2010, 16: 324-8. 24. Heiny, A. T., O. Miotto, K. N. Srinivasan, A. M. Khan, G. L. Zhang, V. Brusic, T. W. Tan, and J. T. August. Evolutionarily

Page 146: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

124

conserved protein sequences of influenza a viruses, avian and human, as vaccine targets. PloS one 2007, 2: e1190. 25. Katoh, K., and H. Toh. Recent developments in the MAFFT multiple sequence alignment program. Briefings in bioinformatics 2008, 9: 286–98. 26. Edgar, R. C. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC bioinformatics 2004, 5: 113. 27. Thompson, J. D., T. J. Gibson, and D. G. Higgins. Multiple sequence alignment using ClustalW and ClustalX. Current protocols in bioinformatics / editoral board, Andreas D. Baxevanis ... [et al.] 2002, Chapter 2: Unit 2.3. 28. Shannon, C. E. A mathematical theory of communication. Bell System Technical Journal 1948, 27: 379–423, 623–656. 29. Zhang, G. L., H. R. Ansari, P. Bradley, G. C. Cawley, T. Hertz, X. Hu, N. Jojic, Y. Kim, O. Kohlbacher, O. Lund, C. Lundegaard, C. a Magaret, M. Nielsen, H. Papadopoulos, G. P. S. Raghava, V.-S. Tal, L. C. Xue, C. Yanover, S. Zhu, M. T. Rock, J. E. Crowe, C. Panayiotou, M. M. Polycarpou, W. Duch, and V. Brusic. Machine learning competition in immunology - Prediction of HLA class I binding peptides. Journal of immunological methods 2011, 374: 1–4. 30. Parker, K. C., M. a Bednarek, and J. E. Coligan. Scheme for ranking potential HLA-A2 binding peptides based on independent binding of individual peptide side-chains. Journal of immunology 1994, 152: 163–75. 31. Schuler, M. M., M.-D. Nastke, and S. Stevanovikć. SYFPEITHI: database for searching and T-cell epitope prediction. Methods in molecular biology 2007, 409: 75–93. 32. Hu, X., H. Mamitsuka, and S. Zhu. Ensemble approaches for improving HLA class I-peptide binding prediction. Journal of immunological methods 2011, 374: 47–52. 33. Huang, J. C., and N. Jojic. Modeling major histocompatibility complex binding by nonparametric averaging of multiple predictors and sequence encodings. Journal of immunological methods 2011, 374: 35–42.

Page 147: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

125

34. Lundegaard, C., O. Lund, and M. Nielsen. Prediction of epitopes using neural network based methods. Journal of immunological methods 2011, 374: 26–34. 35. Nielsen, M., C. Lundegaard, and O. Lund. Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC bioinformatics 2007, 8: 238. 36. Lin, H. H., S. Ray, S. Tongchusak, E. L. Reinherz, and V. Brusic. Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC immunology 2008, 9: 8. 37. Lin, H. H., G. L. Zhang, S. Tongchusak, E. L. Reinherz, and V. Brusic. Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC bioinformatics 2008, 9 Suppl 12: S22. 38. Martinez, A. N., S. Tenzer, and H. Schild. T-cell epitope processing (the epitope flanking regions matter). Methods in molecular biology (Clifton, N.J.) 2009, 524: 407–15. 39. Hertz, T., D. Nolan, I. James, M. John, S. Gaudieri, E. Phillips, J. C. Huang, G. Riadi, S. Mallal, and N. Jojic. Mapping the landscape of host-pathogen coevolution: HLA class I binding and its relationship with evolutionary conservation in human and viral proteins. Journal of virology 2011, 85: 1310–21. 40. Rolland, M., N. Frahm, D. C. Nickle, N. Jojic, W. Deng, T. M. Allen, C. Brander, D. E. Heckerman, and J. I. Mullins. Increased breadth and depth of cytotoxic T lymphocytes responses against HIV-1-B Nef by inclusion of epitope variant sequences. PloS one 2011, 6: e17969. 41. Domingo, E., and J. J. Holland. RNA virus mutations and fitness for survival. Annual review of microbiology 1997, 51: 151–78. 42. Webster, R. G., W. J. Bean, O. T. Gorman, T. M. Chambers, and Y. Kawaoka. Evolution and ecology of influenza A viruses. Microbiological reviews 1992, 56: 152–79. 43. Forns, X., R. H. Purcell, and J. Bukh. Quasispecies in viral persistence and pathogenesis of hepatitis C virus. Trends in microbiology 1999, 7: 402–10.

Page 148: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

126

44. Davenport, M. P., L. Loh, J. Petravic, and S. J. Kent. Rates of HIV immune escape and reversion: implications for vaccination. Trends in microbiology 2008, 16: 561–6. 45. Ye, J., B. Zhu, Z. F. Fu, H. Chen, and S. Cao. Immune evasion strategies of flaviviruses. Vaccine 2013, 31: 461–71. 46. Perelson, A. S. Modelling viral and immune system dynamics. Nature reviews. Immunology 2002, 2: 28–36. 47. Sanjuán, R., M. R. Nebot, N. Chirico, L. M. Mansky, and R. Belshaw. Viral mutation rates. Journal of virology 2010, 84: 9733–48. 48. Weiskopf, D., L. E. Yauch, M. a Angelo, D. V John, J. a Greenbaum, J. Sidney, R. V Kolla, A. D. De Silva, A. M. de Silva, H. Grey, B. Peters, S. Shresta, and A. Sette. Insights into HLA-restricted T cell responses in a novel mouse model of dengue virus infection point toward new implications for vaccine design. Journal of immunology 2011, 187: 4268–79. 49. Hale, A. D., D. C. Lewis, X. Jiang, and D. W. Brown. Homotypic and heterotypic IgG and IgM antibody responses in adults infected with small round structured viruses. Journal of medical virology 1998, 54: 305–12. 50. Erdman, D. D., G. W. Gary, and L. J. Anderson. Serum immunoglobulin A response to Norwalk virus infection. Journal of clinical microbiology 1989, 27: 1417–8. 51. Chachu, K. A., A. D. LoBue, D. W. Strong, R. S. Baric, and H. W. Virgin. Immune mechanisms responsible for vaccination against and clearance of mucosal and lymphatic norovirus infection. PLoS pathogens 2008, 4: e1000236. 52. LoBue, A. D., L. C. Lindesmith, and R. S. Baric. Identification of cross-reactive norovirus CD4+ T cell epitopes. Journal of virology 2010, 84: 8530–8. 53. Olsen, L. R., G. L. Zhang, E. L. Reinherz, and V. Brusic. FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology. Immunome research 2011, 7: 1–9.

Page 149: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

127

54. Benson, D. A., I. Karsch-Mizrachi, D. J. Lipman, J. Ostell, and E. W. Sayers. GenBank. Nucleic acids research 2011, 39: D32–7. 55. Nielsen, M., and O. Lund. NN-align. An artificial neural network-based alignment algorithm for MHC class II peptide binding prediction. BMC bioinformatics 2009, 10: 296. 56. Nielsen, M., C. Lundegaard, T. Blicher, K. Lamberth, M. Harndahl, S. Justesen, G. Røder, B. Peters, A. Sette, O. Lund, and S. Buus. NetMHCpan, a method for quantitative predictions of peptide binding to any HLA-A and -B locus protein of known sequence. PloS one 2007, 2: e796. 57. Crooks, G. E., G. Hon, J.-M. Chandonia, and S. E. Brenner. WebLogo: a sequence logo generator. Genome research 2004, 14: 1188–90. 58. Chotiyarnwong, P., G. B. Stewart-Jones, M. J. Tarry, W. Dejnirattisai, C. Siebold, M. Koch, D. I. Stuart, K. Harlos, P. Malasit, G. Screaton, J. Mongkolsapaya, and E. Y. Jones. Humidity control as a strategy for lattice optimization applied to crystals of HLA-A*1101 complexed with variant peptides from dengue virus. Acta crystallographica. Section F, Structural biology and crystallization communications 2007, 63: 386–92. 59. Liu, J., B. A. Ewald, D. M. Lynch, A. Nanda, S. M. Sumida, and D. H. Barouch. Modulation of DNA vaccine-elicited CD8+ T-lymphocyte epitope immunodominance hierarchies. Journal of virology 2006, 80: 11991–11997. 60. Nielsen, M., S. Justesen, O. Lund, C. Lundegaard, and S. Buus. NetMHCIIpan-2.0 - Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure. Immunome research 2010, 6:9. 61. Reinhold, B., D. B. Keskin, and E. L. Reinherz. Molecular Detection of Targeted Major Histocompatibility Complex I-Bound Peptides Using a Probabilistic Measure and Nanospray MS(3) on a Hybrid Quadrupole-Linear Ion Trap. Analytical chemistry 2010, 81: 9090–9099. 62. Andersen, R. S., P. Kvistborg, T. M. Frøsig, N. W. Pedersen, R. Lyngaa, A. H. Bakker, C. J. Shu, P. T. Straten, T. N. Schumacher, and S. R. Hadrup. Parallel detection of antigen-specific T cell responses by

Page 150: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

128

combinatorial encoding of MHC multimers. Nature protocols 2012, 7: 891–902.

Page 151: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

129

Paper III BlockLogo: visualization of peptide and sequence motif conservation

Journal of Immunological Methods 2013 (In press)

Lars Rønn Olsen1,2, Ulrich Johan Kudahl1,3, Christian Simon1,3, Jing Sun1,4, Christian Schönbach5, Ellis L. Reinherz1,4,6, Guang Lan

Zhang1,4,7, Vladimir Brusic1,4,7,*

1Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 2Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 3Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark 4Department of Medicine, Harvard Medical School, Boston, MA, USA 5Department of Bioscience and Bioinformatics, Graduate School of Computer Science and Systems Engineering, Kyushu Institute of Technology, Fukuoka, Japan 6Laboratory of Immunobiology and Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 7Department of Computer Science, Metropolitan College, Boston University, Boston, MA, USA

*Corresponding author

Page 152: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

130

Abstract

BlockLogo is a web-server application for the visualization of protein and nucleotide fragments, continuous protein sequence motifs, and discontinuous sequence motifs using calculation of block entropy from multiple sequence alignments. The user input consists of a multiple sequence alignment, selection of motif positions, type of sequence, and output format definition. The output has BlockLogo along with the sequence logo, and a table of motif frequencies. We deployed BlockLogo as an online application and have demonstrated its utility through examples that show visualization of T-cell epitopes and B-cell epitopes (both continuous and discontinuous). Our additional example shows a visualization and analysis of structural motifs that determine the specificity of peptide binding to HLA-DR molecules. The BlockLogo server also employs selected experimentally validated prediction algorithms to enable on-the-fly prediction of MHC binding affinity to 15 common HLA class I and class II alleles as well as visual analysis of discontinuous epitopes from multiple sequence alignments. It enables visualization and analysis of structural and functional motifs that are usually described as regular expressions. It provides a compact view of discontinuous motifs composed of distant positions within biological sequences. BlockLogo is available at: http://research4.dfci.harvard.edu/cvc/blocklogo/ and http://met-hilab.bu.edu/blocklogo/.

Page 153: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

131

1 Introduction

Sequence logos are useful tools for visual display of conservation and variability in a multiple sequence alignment (MSA) of DNA, RNA, or protein sequences [1]. Individual nucleotides or residues in each position in an MSA are displayed by stacking the characters, where the height of each character corresponds to its frequency relative to the frequencies of all the characters in that position, and the height of the stack is determined by the total information content [2]. Sequence logos aid the interpretation of sequence data by visualization of conserved motifs representing various functional or structural properties. Examples of motifs that have been visually analyzed using sequence logos are: transcription factors [3], enzyme DNA sequences [4], proteolytic cleavage sites [5], T-cell epitopes [6, 7], and the analysis of targets of neutralizing antibodies in HIV [8], among others. Sequence logos display stacked motifs with the most frequent residues shown at the bottom and the least frequent motif displayed on the top of the stack. Sequence logos visualize biological sequence motifs where the height of the logo element represents its log-transformed frequency displayed in bits of information. Logos often do not display low-frequency motifs because their heights are below useful resolution. The most popular sequence logo web server is WebLogo [9]. It enables users to generate standard sequence logos for DNA, RNA, and protein sequences. In addition to the WebLogo web server, several specialized logo generators have been developed to visualize specific motifs or functional sequence units that are unapparent from the standard sequence logos. Examples of extensions to the basic sequence logo are: RNA structure logo [10] which combines the standard sequence logo with information about base pairing and mutual information of base pairs; enoLOGOS [11] which displays energy measurements, probability matrices and alignment matrices in addition to the standard sequence logo; two-sample logo [12] which displays comparative sequence logos for two sets of MSA; CorreLogo [13] calculates mutual information of nucleotides in different positions to determine correlation and potential base pairing; Phylo-mLogo [14] creates sequence logos for the comparison of phylogenetically distinct clades within an MSA of DNA sequences; Blogo [15] displays a sequence logo with statistically significant bias of individual positions; RNAlogo [16] extends the RNA structure logo with a graphical representation of secondary structure; PoreLogo [17] uses sequence logos and 3D protein structures to visualize motifs of channels in transmembrane proteins; iceLogo [18] provides a probability-based visualization by allowing users to define reference sequences of the

Page 154: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

132

sample’s origin; Seq2Logo [19] offers the capacity to visualize amino acid sequence profiles in terms of amino acid enrichment and depletion; RIlogo [20] for the visualization of RNA-RNA interactions; and CodonLogo [21] which enables visualization of conserved codon patterns. The BlockLogo web server (Figure 1) enables visualization of continuous and discontinuous immune epitopes and various sequence motifs. To our knowledge, it is the first logo web server that specifically enables visualization and analysis of immunologically relevant motifs. WebLogo is suitable for the visualization of immunological motifs such as immune epitopes. A main limitation of the standard sequence logo for this type of application is that sequence logos carry no information about the relationship between the residues in the logo, but treat each residue as an individual independent position. Often, such logos have limited interpretability. For example, the sequence logo of influenza A HA peptide 232-241 (Figure 2A) shows variability that can be encoded by as many as 3,072 different peptides (4×1×1×4×3×2×4×2×2×2, corresponding to the number of different residues in each position). The BlockLogo presented in Figure 2B and Table 1 shows, at a glance, that the vast majority of actual sequence diversity is produced by only five peptides that can be read directly from BlockLogo. The actual number of different peptides that have produced sequence logo displayed in Figure 2A is seven, as shown in Table 1. The peptides visible in this BlockLogo have frequencies >6%, while each of the two peptides not readable from BlockLogo has a frequency of <1%. Sequence logos can be useful for visualizing individual anchor position variability of MHC binding peptides, however since many motifs, such as T-cell epitopes, are recognized as linear peptides rather than individual residues, they should be visualized as continuous sequence blocks or fragments. A typical MHC class I T-cell epitopes may be between 8 and 11 amino acids long. MHC class II epitopes can be longer than 30 amino acids but they bind MHC through a nine amino acid long binding core [22]. The input to the BlockLogo web server tool is an MSA of nucleotides, of short peptides of equal length, or of a user-defined subset of positions (here termed a “block”) within an MSA of longer protein sequences. The user-defined positions from within an MSA (i.e. positions derived from the continuous or discontinuous motifs) define the blocks. The information content (Shannon entropy) and relativefrequency of each block are calculated, and the sequences printed in the BlockLogo, stacked according to frequency, from the most to the least frequent, from the bottom to the top of the stack. An extension of BlockLogo enables the prediction of the binding affinity of

Page 155: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

133

identified peptides for a selection of common HLA molecules using the netMHC prediction algorithms [23, 24] that have been experimentally validated for accuracy.

Figure 1. The front page of BlockLogo with an example of input for the visualization of region 220-229 of MSA of influenza A HA. Numbering is relative to the MSA alignment position and the input in this example is in the FASTA format.

Page 156: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

134

Figure 2 (A) Sequence logo plot of the residues in the 10-residue block starting at position 232 of the Influenza virus HA protein generated using WebLogo. (B) BlockLogo of the peptides in the 232-241 block. The residue position in the MSA is shown on the X-axis, and the information content is shown on the Y-axis. See Table 1 for peptide frequencies and HLA binding affinity predictions. Table 1: HLA binding predictions of each peptide present in the block of 10-mer peptides starting at position 232 in an MSA of influenza HA proteins. Prediction was performed for HLA A*02:01 allele, but can be done for a number of alleles (see materials and methods). The table is also included in the BlockLogo web server output if HLA binding predictions are selected upon submission.

# Peptide Frequency (%) Accumulated frequency (%)

Predicted binding affinity (nM)

1 SLYQNADAYV 64.65 64.65 11.45

2 FLYAQASGRI 14.14 78.79 24.37

3 ALYHTENAYV 7.07 85.86 9.74

4 TLYRTENAYV 6.06 91.92 20.51

5 FLYAQAAGRI 6.06 97.98 22.23

6 FLHAQASGRI 1.01 98.99 130.48

7 SLYQNADSYV 1.01 100.00 19.39

A)

B)

Page 157: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

135

2 Materials and Methods

2.1 Variability and conservation metrics Calculation of information content of individual positions in an MSA of homologous protein sequences is based on Shannon entropy [2]. Similarly, Shannon entropy can be calculated for each motif within a defined block. Each block contains W unique motifs of length l in a dataset of N sequences. The formula used for the calculation of block entropy is [7]:

𝐻 𝐵! = − 𝑃! 𝑥 𝑙𝑜𝑔!(𝑃!(𝑥))!

!!!

(1)

Where H(Bx) is the total entropy of a block of motifs starting at position x, and w is a unique motif in the space of W unique motifs in block Bx. Pw(x) is the frequency of motif w at position x. In standard sequence logos, the theoretical maximum entropy of single position in a protein sequence is log220 ≈ 4.32 bits (corresponding to equal representation of all 20 amino acids), so each amino acid in a position can be represented by its fractional information content of that position (4.32 - H(x)). The theoretical maximum entropy of a block is 34.58 bits for 8-mer motifs (log2208), ~38.90 bits for 9-mers (log2209), and ~43.22 bits for 10-mers (log22010). The maximum bit value on the Y-axis is calculated according to the input alphabet (RNA, DNA, or amino acids) and the selected block length. The height of displayed BlockLogo is scaled to match the height of the sequence logo. The fractional information content of each unique motif, w, in each block, Bx, is calculated using the formula [7]:

where H(w) is the entropy of peptide w; Pw(x) is the frequency of peptide w; and H(Bx) is the total information content of the block B starting at position x, in the MSA. The blocks (peptides or nucleotide fragments, or discontinuous motifs) are displayed in order from the most frequent to the least frequent block starting from the base of the X-axis. 2.2 Prediction of T-cell epitopes The HLA binding affinity of peptides is predicted using NetMHC 3.0 [23] and NetMHCII 2.2 [24]. These algorithms were chosen based on

𝐻 𝑤 =  𝑃! 𝑥 𝐻(𝐵!) (2)

Page 158: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

136

their high accuracy determined in our previous studies of the accuracy of online HLA binding prediction servers [25–27]. When the HLA binding prediction option is selected, netMHC is used to predict HLA class I binders if the selected block is of length 8-11, and for HLA class II binders if the selected block is of length 13-25. 2.3 Visualization of continuous peptides Conservation of continuous immunological motifs, such as T-cell epitopes, can be easily visualized and characterized using BlockLogo. To display this information, the user must submit an MSA of homologous protein sequences, and select a continuous range for visualization. The hemagglutinin (HA) sequences used to generate the examples presented in this article were collected from FluKB (http://research4.dfci.harvard.edu/cvc/flukb/). All the sequences were aligned using MAFFT [28]. Example data and their outputs are available at: http://research4.dfci.harvard.edu/cvc/blocklogo/HTML/examples.php and at the mirror site: http://met-hilab.bu.edu/blocklogo/HTML/examples.php. 2.4 Visualization of discontinuous peptides In some cases the investigated motifs within an MSA are not linear peptides. For example, the residues forming a B-cell epitope are typically discontinuous positions within a protein sequence. To display a discontinuous motif, the user needs to define the set of positions from the MSA selected for visualization. By indicating the epitope positions in the uploaded MSA, discontinuous epitopes are extracted from the sequences, converted into virtual strings, and then processed by the BlockLogo and WebLogo enabling cross-comparison. The discontinuous BlockLogo and sequence logo have the MSA positions indicated below the stacked logos. For examples of discontinuous motifs, the neutralizing HA antibody F10, HLA DRB1 binding pocket 1 β chain was visualized. The information of neutralizing antibody F10 and validated strains was collected from the literature [29]. B-cell epitopes were defined using two measurements: the accessible surface area (ASA) loss [30] and the minimum distance between antibody and antigen atoms [31]. Residues with more than 20% ASA loss between the HA monomer and HA/antibody complex, and residues with atoms located within 4 Å of the F10 antibody atoms were considered to be part of the B-cell epitope. The F10 neutralizing epitope was defined from the F10-HA structure (PDB ID: 3FKU) (Figures 3 and 4). The HA protein sequence in FluKB with highest similarity to the HA sequence in the F10-HA was chosen using BLAST search [32]. The MAFFT tool [28] was used to generate the MSA of all HA

Page 159: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

137

proteins in FluKB (29,113 complete HA protein sequences). The epitope positions defined by Sui et al. (2009) were mapped to the MSA. Then, a motif was extracted with residues on these positions for each sequence The information of HLA DRB1 binding pocket was collected from the literature [33] and the HLA-DRB1 sequences were extracted from the IMGT/HLA database [34]. The example data and their output are available at BlockLogo under the example tab.

Figure 3 (A) The structure of influenza A HA protein with the neutralizing antibody F10 (PDB ID: 3FKU) and its conformational epitope shown in pink, corresponding to residues 12, 32, 34, 36, 292, 293, 294 and 319 in chain A, and 18, 19, 20, 21, 38, 41, 42, 45, 49, 52, 53 and 56 in chain B. (B) The discontinuous epitope on HA protein recognized by F10. In both WebLogos and BlockLogos, the colors of the amino acids correspond to their chemical properties; polar amino acids (G, S, T, Y, C, Q, and N) are shown in green, basic amino acids (K, R, and H) are shown in blue, acidic amino acids (D and E) are shown in red, and hydrophobic amino acids (A, V, L, I, P, W, F, and M) are shown in black. 2.5 Software implementation BlockLogo is written in Perl and uses Encapsulated PostScript format. The logos are created from open source templates available through the WebLogo web site [9]. The program uses the open source package ImageMagick (www.imagemagick.org) to convert the images to the supported formats.

A) B)

Page 160: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

138

3 User interface

The user is prompted to copy/paste an MSA, or upload a file containing an MSA, in standard FASTA or ClustalW formats. Users can select a block from the MSA by specifying the start and end positions of the subset, or a series of individual positions corresponding to the positions of a discontinuous motif. The motifs that have a gap in any of the positions within the specified range will be excluded by default. In the analysis of discontinuous motifs, the sequences with gaps in specified positions will be included in the analysis if the user selects this option. Motifs of low frequency may not be visible when displayed by BlockLogo – this is a property of all logo visualizations. If the image height in pixels multiplied by the percent occurrence of a motif is less than 3, the low resolution of these low frequency motifs makes it difficult to see them within the logo. We therefore enabled the user to define image options (image format and size in pixels), which will alter the appearance and resolution of the logo and the resulting size of the image file. The logo on the results site is scaled to fit the size of the browser window. Clicking the logo on the result page will display the logo in the user defined format and size to enable generation of publication-quality figures. A standard sequence logo generated using locally installed WebLogo is printed below the BlockLogo to enable the comparison of two images. On-the-fly prediction of HLA class I and II binding affinities in the defined block can be performed by selecting “predict epitopes in block” option together with user input of target HLA alleles. The results are displayed as a listing of HLA alleles including a table with detailed information on the predicted epitopes. The home page of BlockLogo server is shown in Figure 1. 4 Example applications

4.1 Conservation of influenza A T-cell epitopes To illustrate the utility of BlockLogo, we analyzed a block of peptides in 29,113 influenza virus HA protein sequences, containing approximately 36.1 bits of information. All peptides in the block of 10-mers, starting at position 232 were predicted to bind to HLA A*02:01 with similar affinities. The relative frequencies of individual peptides within the viral population cannot be determined from the standard sequence logo produced with WebLogo (Figure 2A), but are clear from the BlockLogo (Figure 2B). Table 1 lists seven different

Page 161: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

139

peptides visualized in Figure 2 that represent the complete list of motifs found in the MSA, along with their individual frequencies, cumulative frequencies, and their predicted binding affinities to HLA A*02:01. The most frequent peptide in this block is present in the 64.65% of viral population – this information is not obvious from the standard sequence logo analysis. The combination of MHC binding prediction and BlockLogo visualization reveals this particular region to be highly antigenic, and thus potentially valuable in polyvalent vaccine designs. This peptide is not a known T-cell epitope and it is of potential interest since it is a predicted binder of high affinity to HLA A*02:01 and is highly conserved among HA. 4.2 Conservation of influenza A cross-neutralizing B-cell

epitopes The BlockLogo can also be used to display motifs as virtual peptides composed of a selection of discontinuous sites within a protein. This function can be applied to visualize conservation of B-cell epitopes, which can, for example, be used for the representation and characterization of cross-neutralizing viral B-cell epitopes described in [35]. The discontinuous BlockLogo (Figure 3B) displays the diversity of residues forming conformational epitope that is recognized by the broadly neutralizing antibody F10 [29] shown in Figure 3A and B. This BlockLogo shows the conservation/variability of F10 B-cell epitopes identified within the alignment of 29,113 sequences of full-length influenza HA proteins. The length of the block determines the information content of this motif, which is approximately 68 bits of information. The BlockLogo (Figure 4B) is a better indicator of the diversity of F10 neutralizing epitopes than the traditional sequence logo (Figure 4A), since it includes information about the frequency of actual sequences of naturally occurring epitopes. The description of ten most frequent epitopes (discontinuous peptides), including the influenza subtype of origin and the status of experimental binding validation is given in Table 2. The complete list of motifs in this example comprises 112 peptides. The details of this example with the full list of motifs can be accessed at the BlockLogo web site. 4.3 Variability of HLA-DRB1 binding pocket P1 The usage of BlockLogo can easily be extended beyond T and B-cell epitopes to predict and visualize other peptide-protein interactions, and structural and functional motifs. For example, BlockLogo can be used to visualize variation in known structural motifs, such as HLA class II binding pocket 1 (P1) of HLA-DR, defined by variable β1 chain and invariant HLA-DR α chain. Pocket P1 accommodates the primary anchor of class-II HLA-DR binding peptides. Positions that

Page 162: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

140

Figure 4 (A) Sequence logo of neutralizing epitopes for the broadly neutralizing antibody F10 within 29,113 influenza A virus HA proteins. (B) BlockLogo of the discontinuous residues representing the F10 neutralizing epitope. The numbering in these two figures corresponds to the residue positions in the MSA of the HA proteins. Table 2 shows the motif frequencies along with the corresponding neutralization assay results. Table 2: Ten most frequent influenza A HA discontinuous peptides on neutralizing epitope region recognized by neutralizing antibody F10 in FluKB (29,113 complete HA protein sequences). The table shows the amino acids of the epitope, HA subtype, frequency within the data set, and validation status - escape variants are those strains not neutralized by the F10.

# Discontinuous peptide Subtype Frequency (%)

Accumulated frequency

(%) Validation

1 HHVLSLPTVDGWLTQITVNI H1 24.61 24.61 N/A 2 HNTLDKPTVDGWLTQINLNI H3 15.64 40.25 Escape 3 HHQISMPTVDGWKTQITVNI H5 13.49 53.74 Neutralized 4 HHVLSLPTVDGWQTQITVNI H1 7.61 61.35 N/A 5 QHKLTLPVVAGWRTQITVNV H9 3.93 65.28 Neutralized 6 HNTLDKPTIDGWLTQINLNI H3 3.51 68.79 N/A 7 THALSKPNIAGWLTQITLNS B 2.95 71.74 N/A 8 HHVLSLPTIDGWQTQITVNI H1 2.61 74.35 Neutralized 9 HHVLNKTTIDGWRTQITVNI H6 1.99 76.34 N/A

10 HTQLTKPTIDGWLTQINLNI H4 1.59 77.93 N/A

A)

B)

Page 163: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

141

define binding pockets for a large number of HLA-DR molecules were described earlier [33]. These sequence motifs can be used to determine preferences for the primary anchor residue of binding peptides and shared specificities. The variability of the HLA DRB1 P1 pocket sequences among all known HLA-DRB1 alleles (Robinson et al, 2013) is visualized in Figure 5. Table 3 lists these motifs, their frequencies in the population, and serogroups in which they are found. Six variable positions (positions 81, 82, 85, 86, 89 and 90 in the alignment) constitute the pocket P1. Of 959 HLA-DRB1 protein sequences containing 23.9 bits of information, three motifs (HNVVFT, HNVGFT, and HNAVFT) account for 97% of the HLA-DRB1 sequences, seven motifs are represented each by a set of 2-5 sequences, and seven motifs are represented by a single sequence (Table 3). The vast majority of alleles from a particular serogroup contain a major motif (approximately 90% of the alleles) and a small number (approximately 10%) have a minor motif. Motif HNVVFT is a major signature of DRB1*03, 13, 14, and 15 serogroups; HNVGFT is a major signature for DRB1*01, 04, 07, 08, 09, 10, 11, and 16 serogroups; and HNAVFT is a major signature for DRB1*12 serogroup. In addition, motif HNVVFT is a minor signature of DRB1*04 and 11 serogroups; HNVGFT is a minor signature for DRB1*03, 14 and 15 serogroups; and HNAVFT is a minor signature for the DRB1*1 serogroup. Other motifs are observed in HLA alleles that are extremely rare in the general population (less than 1%). These results show that the fine specificity of primary anchor binding is determined by three major structural motifs and these motifs are unequally distributed between the serogroups.

Figure 5: Visualization of diversity of binding pocket 1 βchain in DRB1 alleles using sequence logo (A) and BlockLogo (B). See Table 3 for the frequencies and alleles for each motif.

B)

A)

Page 164: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

142

Table 3: Frequency and allele distribution of discontinuous motifs in binding pocket 1 β chain of the DRB1 protein from the MSA of 947 DRB1 sequences. The “NA” stands for rare alleles where major and minor serogroups could not be defined, with their observed serotypes given in the brackets. These rare alleles represent 2.85% of all HLA-DRB1 sequences that are likely to have different binding specificities of peptide repertoires.

#   Discontinuous    peptide  

Frequency  (%)  

Accumulated    frequency  (%)  

Number  of    sequences  

DRB1  serogroup  signatures  

1   HNVVFT 45.62   45.62   432   03,  13,  14,  15  (major)  04,  11  (minor)  

2   HNVGFT 45.51   91.13   431  01,  04,  07,  08,  09,  10,  11,  16  (major)  03,  14,  15  (minor)  

3   HNAVFT 6.02   97.15   57   12  (major)  01  (minor)  

4   YNVVFT 0.53   97.68   5   NA  (04,  14,  15)  

5   YNVGFT 0.42   98.10   4   NA  (04,  11,  15)  

6   HNVDFT 0.32   98.42   3   NA  (07,  11,  13)  

7   HSVVFT 0.21   98.63   2   NA  (03,  13)  

8   HNVSFT 0.21   98.84   2   NA  (08,  13)  

9   HNVAFT 0.21   99.05   2   NA  (03)  

10   HNVMFT 0.21   99.26   2   NA  (13,  14)  

11   RNVVFT 0.11   99.37   1   NA  (15)  

12   HNFGFT 0.11   99.47   1   NA  (13)  

13   HNLGFT 0.11   99.58   1   NA  (11)  

14   QNVGFT 0.11   99.68   1   NA  (11)  

15   HNIGFT 0.11   99.79   1   NA  (11)  

16   HNIVFT 0.11   99.89   1   NA  (03)  

17   DNVGFT 0.11   100.00   1   NA  (01)  

5 Conclusion and discussion

BlockLogo is a novel sequence logo tool optimized for the visualization of user-defined continuous and discontinuous motifs, fragments, and peptides. Paired with the prediction of HLA binding, BlockLogo is a useful tool for the rapid assessment of the immunological potential of selected regions within an MSA, such as those containing human pathogen sequences or tumor antigen alignments. The BlockLogo tool provides an easily interpretable visual representation of the immunological status and frequency for each predicted epitope. The observed frequencies of epitopes and their corresponding receptors (T or B-cell receptors) are vital for vaccine design, since the selection and combination of targets determine the pathogen coverage and the host population coverage of the vaccine.

Page 165: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

143

Since continuous epitopes are recognized as peptides rather than individual amino acids, traditional sequence logos do not show the specific peptides and their corresponding frequencies that are found in the analyzed sequences. BlockLogo thus provides a more precise and more informative representation of these motifs. Experimental approaches for the identification and validation of sequence motifs useful as vaccine targets involving multiple HLA alleles and pathogen proteomes are laborious and costly. BlockLogo complements wet lab experimental methods by enabling pre-screening of key antigenic regions that are likely to contain vaccine targets. Additionally, BlockLogo can be used to visualize variability of discontinuous motifs, such as B-cell epitopes, protein-protein interaction sites, and receptor-ligand sites. To our knowledge, BlockLogo is the first logo generator that allows users to create logos based on a custom character set, consisting of either continuous or discontinuous motifs. BlockLogo is available at: http://research4.dfci.harvard.edu/cvc/blocklogo and http://met-hilab.bu.edu/blocklogo/. 6 References

1. Schneider TD, Stephens RM: Sequence logos: a new way to display consensus sequences. Nucleic Acids Res. 1990, 18:6097–100. 2. Shannon CE: A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27:379–423, 623–656. 3. Wade JT, Hall DB, Struhl K: The transcription factor Ifh1 is a key regulator of yeast ribosomal protein genes. Nature 2004, 432:1054–8. 4. Goll MG, Bestor TH: Eukaryotic cytosine methyltransferases. Annu. Rev. Biochem. 2005, 74:481–514. 5. Mahrus S, Trinidad JC, Barkan DT, Sali A, Burlingame AL, Wells JA: Global sequencing of proteolytic cleavage sites in apoptosis by specific labeling of protein N termini. Cell 2008, 134:866–76. 6. Bryson S, Julien J-P, Hynes RC, Pai EF: Crystallographic definition of the epitope promiscuity of the broadly neutralizing anti-human immunodeficiency virus type 1 antibody 2F5: vaccine design implications. J. Virol. 2009, 83:11862–75.

Page 166: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

144

7. Olsen LR, Zhang GL, Keskin DB, Reinherz EL, Brusic V: Conservation analysis of dengue virus T-cell epitope-based vaccine candidates using peptide block entropy. Front. Immunol. 2011, 2:1–15. 8. Sun Z-YJ, Oh KJ, Kim M, Yu J, Brusic V, Song L, Qiao Z, Wang J, Wagner G, Reinherz EL: HIV-1 broadly neutralizing antibody extracts its epitope from a kinked gp41 ectodomain region on the viral membrane. Immunity 2008, 28:52–63. 9. Crooks GE, Hon G, Chandonia J-M, Brenner SE: WebLogo: a sequence logo generator. Genome Res. 2004, 14:1188–90. 10. Gorodkin J, Heyer LJ, Brunak S, Stormo GD: Displaying the information contents of structural RNA alignments: the structure logos. Comput. Appl. Biosci. 1997, 13:583–6. 11. Workman CT, Yin Y, Corcoran DL, Ideker T, Stormo GD, Benos P V: enoLOGOS: a versatile web tool for energy normalized sequence logos. Nucleic Acids Res. 2005, 33:W389–92. 12. Vacic V, Iakoucheva LM, Radivojac P: Two Sample Logo: a graphical representation of the differences between two sets of sequence alignments. Bioinformatics 2006, 22:1536–7. 13. Bindewald E, Schneider TD, Shapiro B a: CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments. Nucleic Acids Res. 2006, 34:W405–11. 14. Shih AC-C, Lee DT, Peng C-L, Wu Y-W: Phylo-mLogo: an interactive and hierarchical multiple-logo visualization tool for alignment of many sequences. BMC Bioinformatics 2007, 8:63. 15. Li W, Yang B, Liang S, Wang Y, Whiteley C, Cao Y, Wang X: BLogo: a tool for visualization of bias in biological sequences. Bioinformatics 2008, 24:2254–5. 16. Chang T-H, Horng J-T, Huang H-D: RNALogo: a new approach to display structural RNA alignment. Nucleic Acids Res. 2008, 36:W91–6. 17. Oliva R, Thornton JM, Pellegrini-Calace M: PoreLogo: a new tool to analyse, visualize and compare channels in transmembrane proteins. Bioinformatics 2009, 25:3183–4.

Page 167: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

145

18. Colaert N, Helsens K, Martens L, Vandekerckhove J, Gevaert K: Improved visualization of protein consensus sequences by iceLogo. Nat. Methods 2009, 6:786–7. 19. Thomsen MCF, Nielsen M: Seq2Logo: a method for construction and visualization of amino acid binding motifs and sequence profiles including sequence weighting, pseudo counts and two-sided representation of amino acid enrichment and depletion. Nucleic Acids Res. 2012, 40:W281–7. 20. Menzel P, Seemann SE, Gorodkin J: RILogo: visualizing RNA-RNA interactions. Bioinformatics 2012, 28:2523–6. 21. Sharma V, Murphy DP, Provan G, Baranov P V: CodonLogo: a sequence logo-based viewer for codon patterns. Bioinformatics 2012, 28:1935–6. 22. Reinherz EL, Tan K, Tang L, Kern P, Liu J, Xiong Y, Hussey RE, Smolyar A, Hare B, Zhang R, Joachimiak A, Chang HC, Wagner G, Wang J: The crystal structure of a T cell receptor in complex with peptide and MHC class II. Science 1999, 286:1913–21. 23. Lundegaard C, Lund O, Nielsen M: Prediction of epitopes using neural network based methods. J. Immunol. Methods 2011, 374:26–34. 24. Nielsen M, Lundegaard C, Lund O: Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC Bioinformatics 2007, 8:238. 25. Lin HH, Ray S, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008, 9:8. 26. Lin HH, Zhang GL, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 2008, 9 Suppl 12:S22. 27. Zhang GL, Ansari HR, Bradley P, Cawley GC, Hertz T, Hu X, Jojic N, Kim Y, Kohlbacher O, Lund O, Lundegaard C, Magaret C a, Nielsen M, Papadopoulos H, Raghava GPS, Tal V-S, Xue LC, Yanover C, Zhu S, Rock MT, Crowe JE, Panayiotou C, Polycarpou MM, Duch W, Brusic V: Machine learning competition in immunology - Prediction of HLA class I binding peptides. J. Immunol. Methods 2011, 374:1–4.

Page 168: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

146

28. Katoh K, Standley DM: MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 2013. 29. Sui J, Hwang WC, Perez S, Wei G, Aird D, Chen L, Santelli E, Stec B, Cadwell G, Ali M, Wan H, Murakami A, Yammanuru A, Han T, Cox NJ, Bankston LA, Donis RO, Liddington RC, Marasco WA: Structural and functional bases for broad-spectrum neutralization of avian and human influenza A viruses. Nat. Struct. Mol. Biol. 2009, 16:265–73. 30. Chothia C: Hydrophobic bonding and accessible surface area in proteins. Nature 1974, 248:338–9. 31. McConkey BJ, Sobolev V, Edelman M: Discrimination of native protein structures using atom-atom contact scoring. Proc. Natl. Acad. Sci. U. S. A. 2003, 100:3215–20. 32. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215:403–10. 33. Chelvanayagam G: A roadmap for HLA-DR peptide binding specificities. Hum. Immunol. 1997, 58:61–9. 34. Robinson J, Halliwell J a, McWilliam H, Lopez R, Parham P, Marsh SGE: The IMGT/HLA database. Nucleic Acids Res. 2013, 41:D1222–7. 35. Xu R, Ekiert DC, Krause JC, Hai R, Crowe JE, Wilson IA: Structural basis of preexisting immunity to the 2009 H1N1 pandemic influenza virus. Science 2010, 328:357–60.

Page 169: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

147

Paper IV TANTIGEN: a tumor antigens database and analysis platform for vaccine target discovery

Cancer Research (submitted)

Songsak Tongchusak1, Guang Lan Zhang1,2,6*, Lars Rønn Olsen1,3,4, Hong Huang Lin1,5, Ellis L. Reinherz1,6,7, Vladimir Brusic1,2,6

1Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical Schoo, Boston, MA, USA 2Department of Computer Science, Metropolitan College, Boston University, Boston, MA, USA 3Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 4Biotech Research and Innovation Center (BRIC), University of Copenhagen, Copenhagen, Denmark 5Department of Medicine, Boston University School of Medicine, Boston, MA, USA 6Department of Medicine, Harvard Medical School, Boston, MA, USA 7Laboratory of Immunobiology, Dana-Farber Cancer Institute, Boston, MA, USA *Corresponding author

Page 170: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

148

Abstract

Tumor antigens (TAs) represent a major set of both diagnostic and therapeutic targets. We have cataloged 4014 TA entries representing variants of 258 unique protein TAs reported in the literature. KB-builder, an in-house database development framework was deployed to construct the Tumor T cell Antigen knowledge base, named TANTIGEN. Each record contains the information on TA sequence, variants (splice isoforms and mutation variants), known T-cell epitopes and HLA ligands, and literature references. A set of computational tools for in-depth analysis of TAs has been integrated into TANTIGEN. These tools include TA classification, TA nomenclature, sequence comparison using BLAST search, multiple alignments of antigens, mutation mapping, and T-cell epitope/HLA ligand visualization. Predicted Class I and Class II HLA binding peptides for 15 common HLA alleles are included in this database as putative targets. TANTIGEN provides a rich data source and an advanced analysis platform for cancer vaccine target discovery accessible at http://cvc.dfci.harvard.edu/tadb/ or mirror site http://met-hilab.bu.edu/tantigen/.

Page 171: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

149

1 Introduction

Tumor-derived molecules that interact with cells and products of the immune system are known as tumor antigens (TAs) [1]. Tumor-specific antigens (neoantigens) are found in tumors and not in normal tissue, while tumor-associated antigens show increased expression in tumor cells relative to their healthy counterparts. TAs represent a major set of both diagnostic and therapeutic targets. TAs represent markers that are either specific for individual tumors or are overexpressed in tumors relative to normal tissues [2]. TAs can be processed intracellularly and presented as peptides to T cells of the immune system for recognition followed by immune responses. This process involves antigen processing and presentation pathways that involve major histocompatibility complex (MHC) molecules that bind peptides and present them on the surface of cancer cells [3]. T cells play a significant role in tumor rejection and they have been tested as therapeutic agents that target TAs in a large number of clinical trials [4, 5]. One of the most important observations of tumor immunology is that the immune system is able to discriminate between normal and tumor cells. After MAGEA1 was identified as a TA recognized by cytolytic T lymphocyte on human melanoma, a large number of TAs have been identified [6]. TAs have been studied as diagnostic and therapy targets in many clinical trials – nearly 1% of all reported clinical trials in clinicaltrials.gov (1,395 of 149,048 trials as of July 2013). Among these, 613 entries, involving 30 cancer types, were vaccine trials, the majority of which were in Phase II. Immunotherapies leading to improved clinical outcomes has been demonstrated in some patients [7–10], but tumors deploy a range of mechanisms to preclude efficient immune responses, such as impairment of efficient presentation of TAs, negative immune regulation, or induction of tolerance [11]. To address these issues, combinatorial approaches to tumor therapy are needed for clinical efficacy [12]. Peptide-based vaccine strategies involve multivalent long peptide constructs, multi-peptide vaccines that combine cytotoxic- and helper-epitopes, peptide cocktail constructs, peptide vaccines fused with other epitopes, personalized peptide vaccines, and peptide-pulsed dendritic cell vaccines. Peptide vaccines were shown to induce cytotoxicity, but alone their efficacy can be limited [13]. The factors responsible for limited response include damage to immune regulation mechanisms, evasion of immune responses, induction of tolerance, interference or damage to the mechanism of antigen processing and presentation, and others [11]. Another factor that influences immune regulation of TAs is immunological specificity based on the genetic makeup and specificity of immune repertoire of individuals, as well as the diversity of TAs.

Page 172: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

150

Studies of anti-tumor vaccines in animal models grant promise for vaccines against human tumors. In controlled studies with mice, immunization against a variety of tumors were mostly effective and showed good success rates [14]. In contrast, when similar vaccine targets and similar immunization protocols were studied in human clinical trials, they have proven effective against precancerous tumors, while the ability to protect against established cancers decreases as tumors progress to advanced stage [15]. Peptide vaccine trials by NCI with 440 cancer patients with metastatic cancer including; melanoma, renal cancer, ovarian cancer, colorectal cancer and breast cancer illustrated that the overall objective response rate was 2.6% [16]. The lack of vaccine efficacy can be attributed to i) high growth rate of tumors overcome immune the rate of immune destruction, ii) immunosuppressive activity of tumors, and iii) down regulation of immune responses [15]. Another question that needs to be considered is the specificity of targeting TAs. It is important to ensure that the diversity of TAs is properly characterized within the context of individual’s immune profile before TAs can be considered for immune targeting.

A number of studies have defined TAs that are candidate vaccine targets. Unique TAs have been proposed as targets for immunotherapies [4]. Unique antigens are molecules derived from somatic mutation of proteins expressed in tumor cells or alternative product of proteins derived from RNA splicing [17]. These antigens show strong ability to elicit antigen-specific T-cell responses in ex vivo studies. However, until recently, vaccination or tumor immunotherapy was not effective, but recently dendritic cell vaccine Provenge was approved by the FDA and a number of phase II-III clinical trials have been in progress [18]. Oncoantigens, the molecules that support tumorigenesis, have been proposed as preferable candidate targets for cancer vaccines and it has been proposed that their targeting reduces immune evasion [19]. Given all these open issues, identification and detailed characterization of vaccine targets is a main bottleneck in tumor vaccine development. Technical advances in instrumentation, sample processing, immunological assays, and bioinformatics techniques have generated large amounts of immunological data, including experimentally identified TAs and T-cell epitopes, novel tumor biomarkers identified through DNA or protein arrays, and differentially expressed genes that may be involved in tumorogenesis.

Page 173: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

151

Several data sources provide information on tumor T-cell antigens. The Cancer Immunity Peptide Database developed by Ludwig Institute for Cancer Research defined four data tables containing 150 TAs with defined T-cell epitopes; 56 TAs resulting from mutations, 31 shared tumor-specific antigens, 12 differentiation antigens, and 51 antigens overexpressed in tumors [20]. A list of human TAs reported in the literature as of February 2004 [21] was collected. The list includes T-cell-defined epitopes, while analogs, artificially modified epitopes, and virus-encoded and antibodies-recognized antigens have been excluded. These data sources provide valuable data on human tumor T-cell antigens, but they are not up to date and do not provide any bioinformatics tools for data analysis. To bridge the gap between data and knowledge and facilitate vaccine target discovery we have developed TANTIGEN, a molecular database of human tumor T-cell antigens. It is a primary data source and the analysis platform for cancer vaccine target discovery focusing on human TAs that contain HLA ligands and T-cell epitopes. TANTIGEN provides data and analyses that suggest which TAs could be used in clinical studies and help with the assessment of their immunogenic and clinical potential for active and adoptive immunotherapy of human cancer. To our knowledge, it is the first comprehensive database focusing on tumor T-cell antigens analysis that is integrated with in silico tools for the analysis of diversity and immunological potential. TANTIGEN is accessible at http://cvc.dfci.harvard.edu/tadb/ or mirror site http://met-hilab.bu.edu/tantigen/. 2 Methods

2.1 Data collection Tumor T-cell antigens were collected from the reports of experimentally characterized T-cell epitopes and/or HLA ligands. We used two criteria for selection of tumor T-cell antigens. First, the antigen must be presented via the defined HLA alleles and second, the antigen must be recognized by T cells. The term “recognized by T cells” means that the antigen must be defined by experimental validation either through in vivo or ex vivo studies. The peptides shown to have the ability to stimulate T-cell responses are called T-cell epitopes. Peptides that show high binding affinity in peptide binding assays are called HLA binders or HLA ligands. Peptides eluted from HLA in mass spectrometry studies are called naturally processed HLA binders or ligands.

Page 174: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

152

We used two methods for retrieving T-cell antigen data. Data were collected from established data sources listed in Table 1. The PubMed data were collected manually or by semi-automated literature mining using text mining techniques for literature classification [22]. The names of TAs and their synonyms were collected from the GeneCards database [23] and antigen names used in this database were assigned based on the guidelines of the HUGO Gene Nomenclature Committee [24]. If the names and synonyms were missing in GeneCards, naming was based on the names reported in the original articles. Gene and protein identification numbers were collected from GenBank [25], GeneCards [23] and UniProt [26]. The corresponding antigen sequences were gathered from Uniprot or manually extracted from literature for all antigens except unique antigens; these proteins normally generate epitopes from substitution mutation, therefore normal protein sequences of these proteins were not included in the database. Substitution mutation information was collected from the Catalogue of Somatic Mutation of Cancer (COSMIC) [27] and from UniProt variants. Sequence variants of TAs were additionally collected by using sequence similarity search using the BLAST algorithm with UniProt database. Table 1: List of data sources for T-cell epitopes and HLA ligands data collection.

Data sources URL Number of T-cell epitopes and HLA ligands

Peptide database http://cancerimmunity.org/peptide/ 332

Human tumor antigens recognized by T cells

http://www.istitutotumori.mi.it/INT/AreaProfessionale/Human_Tumor/

294

SYFPEITHI http://www.syfpeithi.de/home.htm 140

IEDB http://www.immuneepitope.org 62

Charité http://www.charite.de/ch/derm/ti/eng/index1.html

9

Literature Retrieved from PubMed using antigen names

597

2.2 Data annotation and organization The collected data and information were manually curated for errors, inconsistencies, and ambiguous and conflicting information. These artifacts were identified and corrected. The annotated data were converted into a unified XML format. Three XML files were created for TANTIGEN containing information on antigens, their T-cell epitopes and HLA-ligands, respectively (Table 2).

Page 175: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

153

Table 2: Antigen record table AgACC Unique accession number for an antigen record in TANTIGEN Date Entry date of the record Last updated

Latest date of information update on the record

Antigen Name

Protein name based on HUGO gene nomenclature guidelines

Common name

Name of antigens used in tumor immunology filed

Full Name Complete protein name from GeneCard or literature Synonym Other protein names or gene names of the antigen Isoform Name

Systematic name of alternative products assigned by TANTIGEN (only available for full length antigen sequence)

Isoform Synonym

Other alternative product name of the antigens (only available for full length antigen sequence)

UniProt ID Cross-references to corresponding record in UniProt database if available NCBI Gene ID

Cross-references to corresponding record in NCBI Gene database if available

GeneCard ID Cross-references to corresponding record in GeneCard database if available COSMIC ID Cross-references to corresponding record in Catalogue of Somatic Mutations in

Cancer (only available in some substitution mutation entries) Gene expression profile

Gene expression profile suggested by analysis of EST counts from UniGene

Swiss-Prot VARIANT ID

Cross-references to corresponding record in UniProt database if available (only available in some substitution mutation entries)

Comment Comment on the antigen if available Annotation Annotation on antigen sequence type: full length sequence or fragment Isoforms Accession numbers of its isoforms and comparison of the isoforms sequences

using multiple sequence alignment Mutation entries

Accession numbers of substitution mutations and visualize single amino acid change by mutation map

T-cell epitope

Epitope sequence

Position HLA allele Reference

T-cell epitope sequence

Position of the epitope in the antigen sequence

Restricted HLA allele

PubMed link of the paper containing the T-cell epitope validation result

HLA ligand Ligand Sequence

Position HLA allele Reference

HLA ligand sequence

Position of the ligand in the antigen sequence

Restricted HLA allele

PubMed link of the paper containing the HLA ligand binding result

Predicted HLA binders

Prediction of 9-mer peptide binding affinities to 15 HLA Class I and Class II alleles, A*0101, A*0201, A*0301, A*1101, A*2402, B*0702, B*0801, B*1501, DRB1*0101, DRB1*0301, DRB1*0401, DRB1*0701, DRB1*1101, DRB1*1301, and DRB1*1501.

Reference sequence

Protein sequence before substitution mutation (only available in substitution mutation entries)

Antigen sequence

Protein sequence of the antigen

2.3 Antigen isoform nomenclature and identification In the TANTIGEN database, only full-length protein sequences received systematically designated isoform names. Naming was based on the human gene nomenclature guideline of HUGO (see Results). Identification of TA splice isoforms using sequences collected from TrEMBL were completed by sequence comparison against known

Page 176: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

154

antigen isoforms collected from UniProt and potential protein coding sequence isoform in AceView [28]. The isoform names present in those two datasets were gathered and organized in a table of isoform synonyms. 2.4 Data classification TAs in this database were classified as described by Van den Eynde and van der Bruggen [17]. Antigens that remain unclassified were marked as "Unclassified". All TAs were categorized into a classification diagram where each protein name is a link directing to corresponding database entries. 2.5 Database construction TANTIGEN was constructed using KB-builder, our in-house framework that streamlines the development and deployment of web-accessible immunological knowledge bases [29]. The web interface of TANTIGEN uses a set of graphical user interface forms with a combination of Perl, PHP, CGI, and C background scripts. 2.6 Basic analysis tools To facilitate the analysis of TAs, a selection of basic bioinformatics tools were integrated within the TANTIGEN, including basic keyword search that enables users to locate the subset of data or information of interest, BLAST (Basic local alignment search tool) that enables sequence homology search and multiple sequence alignment to compare multiple sequences [30]. To facilitate sequence similarity search, the collected TA protein sequences were organized into FASTA format and were converted into a searchable format to enable searching using BLAST algorithm. MAFFT, a multiple sequence alignment (MSA) tool, selected due to its outstanding performance in terms of speed and alignment quality, was downloaded and installed locally [31]. 2.7 Specialized analysis tools On-the-fly HLA binding prediction tools that enable peptide binding prediction to 15 frequent HLA class I and class II alleles (A*0101, A*0201, A*0301, A*1101, A*2402, B*0702, B*0801, B*1501, DRB1*0101, DRB1*0301, DRB1*0401, DRB1*0701, DRB1*1101, DRB1*1301, DRB1*1501) were integrated in TANTIGEN to facilitate efficient antigenicity predictions. For the above class I and class II predictions netMHCpan 2.4a [32] and netMHCIIpan 2.0b [33] are used. These algorithms were chosen based on our previous studies of the accuracy of online HLA binding prediction servers [34–36]. We developed a set of visualization tools specifically for TANTIGEN,

Page 177: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

155

which enable display of epitopes, providing a clear picture of published and experimentally verified T-cell epitopes. Additionally, an interactive visualization tool that displays a map of mutations in the TA sequences was implemented in TANTIGEN to provide a global view of all mutations reported in a given TA. 3 Results

3.1 Data collection A total of 954 T-cell epitopes and 480 HLA-ligands were collected from six data-sources (Table 1). The Peptide database of T-cell defined TAs [20] and the dataset of human TAs recognized by T cells [21] were the richest organized sources of TA-related peptides. These T-cell epitopes and HLA-ligands were categorized into five groups based on how the peptide entries were generated; a) peptides derived from TA sequences, b) peptides derived from mutations, c) peptides derived from alternative open reading frames (ORF), d) peptides derived from chromosomal translocation (fusion proteins) and e) peptides derived from internal tandem repeats.

New TAs can be generated by several mechanisms. For example acquired somatic mutation at DNA level can lead to protein product extensions containing T-cell epitopes, such as those found in caspase 8 [37]. Splice isoforms can produce alternative ORFs which can produce the same or different T-cell epitopes. For example, the L552S splice isoform of XAGE-1 harbors an immunogenic region unique to this variant [38].

Some T-cell epitopes are produced by mutations that have not been recorded in the primary sequence databases. These altered TAs were named using annotation in the original article such as 707-AP, ARTC1 and F4.2 [39–41]. We collected substitution mutation information for 136 TAs from COSMIC and UniProt databases. Among these proteins, APC, CDKN2A, CTNNB1, EGFR, LDLR, TP53 and TYR had multiple mutation positions resulting in 103, 239, 111, 169, 135, 1311, and 111 entries, respectively (Table 3). Gene ontology analysis of all TAs in TANTIGEN reveals that TAs are related to tube development, regulation of cell proliferation, blood vessel development, and cell cycle regulation (Figure 1).

Page 178: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

156

Figure 1: Enriched gene ontology terms for genes in TANTIGEN (p-value < 0.01). The node size corresponds to the ratio of TAs in TANTIGEN with a given term, while edges connect ontologies that are directly related. Related ontologies are highlighted in similar colors.

Tube development

Tube morphogenesis

Organ morphogenesis

Branching

morphogenesis

of a tube

Positive regulation

of cell proliferation

Regulation of

cell proliferation

Negative regulation

of cell proliferation

Regulation

of cell cycle

Negative regulation

of cell cycle

G1/S transition of

mitotic cell cycle

Regulation of cell

cycle process

Positive regulation

of cell cycle

Angiogenesis

Blood vessel

morphogenesis

Blood vessel

development

Positive regulation of

cell communication

Pigment metabolic

process involved in

developmental

pigmentation

Apoptotic

signaling

pathway

Negative regulation of

heart induction by

canonical Wnt receptor

signaling pathway

Page 179: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

97

Table 3: Number of mutation entries derived from COSMIC and UniProt in 136 unique tumor proteins TANTIGEN. Antigen COSMIC UniProt Total Antigen COSMIC UniProt Total Antigen COSMIC UniProt Total

ABCC3 0 8 8 FGF5 0 1 1 PAX3 0 27 27

ABL1 37 5 42 FMOD 0 1 1 PGK1 0 11 11

ACTN4 0 4 4 FN1 0 9 9 PRAME 0 1 1

AFP 0 2 2 FOLH1 0 3 3 PRDX5 0 3 3

AKAP13 1 12 13 FUT1 0 11 11 PRTN3 0 2 2

ALDH1A1 1 1 2 GPC3 0 1 1 PSCA 0 2 2

ALK 2 16 18 GPNMB 0 4 4 PTHLH 0 1 1

AML1 49 3 52 GPR143 0 24 24 PTPRK 2 1 3

ANKRD30A 0 3 3 HHAT 0 1 1 PXDNL 0 1 1

ANXA2 0 1 1 HMOX1 0 2 2 RAB38 0 1 1

APC 81 22 103 HPSE 0 1 1 RAGE 1 5 6

ATIC 0 1 1 HSPA1B 0 4 4 RARA 1 0 1

BAAT 0 1 1 IL13RA2 0 1 1 RHAMM 0 3 3

BCL-2 0 4 4 ITGB8 1 1 2 RNF43 0 5 5

BCR 0 16 16 ITPR2 5 0 5 RPA1 0 1 1

BIRC5 0 1 1 KLK3 0 6 6 RPL10A 0 2 2

BIRC7 0 1 1 KLK4 0 3 3 RPS2 0 1 1

BRAF 73 12 85 KRAS 46 5 51 SAGE1 0 2 2

BST2 0 1 1 LCK 0 3 3 SART1 0 2 2

CA9 0 1 1 LDLR 0 135 135 SART3 0 2 2

CALR3 0 3 3 LPGAT1 0 1 1 SCRN1 0 1 1

Page 180: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

98

CAN 0 3 3 LRP1 0 5 5 SDCBP 0 1 1

CASP5 0 3 3 MAGEA1 1 2 3 SIRT2 0 1 1

CASP8 2 3 5 MAGEA10

0 1 1 SNRPD1 0 1 1

CCNI 0 1 1 MAGEA3 0 1 1 SOX10 0 1 1

CDC27 2 1 3 MAGEA4 0 2 2 STAT1 1 3 4

CDK4 0 5 5 MAGEB2 0 2 2 STEAP1 0 4 4

CDKN1A 0 2 2 MAGEC2 1 0 1 TACSTD1 0 1 1

CDKN2A 210 29 239 MC1R 0 16 16 TERT 1 9 10

CEACAM5 0 4 4 MCL1 0 2 2 TGFBR2 3 21 24

CEL 0 2 2 ME1 0 1 1 TOP2A 1 3 4

CLCA2 0 2 2 MET 0 29 29 TOR3A 0 2 2

CSF1 0 4 4 MFGE8 0 2 2 TP53 142 1169 1311

CSPG4 0 1 1 MFI2 0 1 1 TPI1 0 8 8

CTAG 0 1 1 MMP14 0 8 8 TPM4 0 1 1

CTAG2 0 3 3 MMP2 0 8 8 TRAPPC1 0 1 1

CTNNB1 109 2 111 MSLN 0 3 3 TRPM8 0 1 1

CTSH 0 2 2 MUC1 0 1 1 TSPYL1 0 1 1

CYP1B1 0 19 19 MUM1 0 2 2 TTK 0 2 2

DDR1 0 6 6 MUM3 0 7 7 TYR 0 111 111

EEF2 0 1 1 MYO1B 0 3 3 TYRP1 1 2 3

EFTUD2 0 2 2 NFYC 1 1 2 UBXD5 0 3 3

EGFR 159 10 169 NPM1 1 0 1 WT1 18 33 51

EPHA2 1 4 5 NRAS 9 1 10 ZUBR1 1 5 6

Page 181: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

99

 

   

EPHA3 0 14 14 OCA2 0 55 55

ERBB2 24 6 30 OS9 0 3 3 Total 988 2079 3067

Page 182: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

160

3.2 Data annotation and organization We constructed three tables that define TANTIGEN database entries: the antigen table, T-cell epitope table, and HLA-ligand table. In the antigen table, the “Header” consists of TA accession number (AgACC), date of entry, last updated antigen name, common antigen name, full name, synonyms, isoform names, isoform synonyms, UniProt ID, NCBI Gene ID, GeneCard ID, COSMIC ID, Gene expression profile link, and Swiss-Prot VARIANT ID. The AgACC is the antigen accession number assigned as “AgACC” followed by unique number consisting of six digits. Antigen name is the name of each TA displayed as protein symbol suggested by HUGO Gene Nomenclature Committee [24]. The common name is the name commonly used in tumor biology, for example survivin (BIRC5), Her2 (ERBB2), TRP-2 (DCT), Pmel17 or gp100 (SILV), SUPT7L (ART-1). Full names and synonyms are lists of names or gene names associated with given TAs. The isoform name is a systematic name of alternative product given for the full-length TAs sequences. The name is assigned by our strategy based on HUGO Gene Nomenclature Committee. For example TERT_i1_id1_v1 is splice isoform one (_i1) of protein TERT , and this sequence has insertion-deletion (indels) pattern one (_id1) and sequence variation pattern designated as one (_v1). Isoform synonyms define lists of splice isoform names according to UniProt and AceView. Collection of isoform synonyms revealed that the naming of TAs is confusing and lacks standard nomenclature. Different naming annotations are used by researchers for assigning protein isoform name for example, numerical (1, 2, 3, etc.), alphabetical (A, a, B, b, C, c, or alpha, beta, gamma, etc.), alphanumeric (1A, Y1, 1FA, beta-1, alpha-2, etc.), sequence length (long and short), exon/intron (DeltaEx3, 2B/3B, int2B/3B, etc.), and molecular weight (p60, p110, p170, etc.). Swiss-Prot ID, NCBI-Gene ID, GeneCards ID, COSMIC ID and Swiss-Prot Variant ID are database identifiers of the proteins and genes in the Swiss-Prot, NCBI-Gene GeneCards, COSMIC and Swiss-Prot Variant, respectively. Gene expression profile links connect TANTIGEN entries with gene expression information suggested by the analysis of EST counts from UniGene (http://www.ncbi.nlm.nih.gov/unigene). The “Feature” section consists of fields for comments, annotations, isoforms, mutation entries, T-cell epitopes, HLA ligands and predicted HLA binders. The comment field contains various properties of TA sequences including protein isoforms, alternative ORFs, intron encoding sequences, chromosomal translocations, internal tandem

Page 183: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

161

repeats, and putative or defined TA isoforms. The field “putative” is used to describe antigen entries that were derived from the substitution mutations where epitope sequences or HLA ligands could not be found in the sequence databases and such sequences were extracted from the literature. The annotation field indicates whether the TA is a full-length protein or a fragment. The isoform field has the link to accession numbers of all variants of that TA, except for substitution mutation entries. This field also provides a link to an MSA of all antigen variants that are associated with the same base TA protein. The mutation entries field lists all accession numbers of variants of the same base antigen. It provides a link to visualize the mutation map of one or more reference sequences. T-cell epitope and HLA ligand fields provide the lists of peptide epitopes or ligands for HLA molecule. These tables provide information of the peptide sequences of T-cell epitopes or HLA ligands, position of peptides, HLA specific alleles, as well as the references. The references are displayed as clickable links displaying PubMed IDs. Clicking on epitope or HLA ligand sequence links, will generate antigen tables containing multiple sequence alignments with T-cell epitopes or ligands highlighted within the sequences of all isoforms.

The sequence field consists of the reference sequence and antigen sequences. The reference sequence is a single sequence presented only in the putative and substitution mutation entries. It is used as a reference to help users visualize single amino acid changes. The antigen sequence field contains amino acid sequence of the TA. For T-cell epitopes and HLA ligands derived from the mutated unique antigens, the peptide was manually extended by 20 amino acids each at the N and C termini from the position of mutated amino acid residues. The same criteria were also applied for substitution mutation entries. In addition, a 41 amino acid long reference sequence is also provided with each mutated entry. For T-cell epitope and HLA-ligand tables, “Header” contains the accession number for entries of T-cell epitopes or HLA ligands assigned as “T” or “L” followed by a unique number consisting of six digits. “Feature” designates a specific HLA allele and PubMed ID for that particular T-cell epitope or HLA ligand. “Sequence” is the TA T-cell epitope or ligand. MSA with highlighted T-cell epitopes and HLA ligands are shown below the displayed table. 3.3 Basic analysis tools Using the keyword search function, users are able to retrieve antigen records by keywords such as an antigen name (or synonyms), attributes of an antigen (such as function, sequence, or database

Page 184: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

162

accession numbers). Users can also search T-cell epitopes and HLA ligands by keywords such as an epitope/ligand sequence or a restricted HLA allele. Boolean operators "AND" and "OR" can be used to refine the search. BLAST search has been integrated into TANTIGEN to enable sequence similarity search. This tool can be used for TA isoform identification. To visualize the full scope of sequence variations associated with a given TA, a quality controlled MSA of its isoforms is enabled via a clickable link within each antigen record. To facilitate efficient antigenicity analysis we integrated on-the-fly HLA binding predictions (netMHCpan 2.4a and netMHCIIpan 2.0b) into TANTIGEN. This tool enables peptide binding prediction to 15 frequent HLA class I and class II alleles (A*0101, A*0201, A*0301, A*1101, A*2402, B*0702, B*0801, B*1501, and DRB1*0101, DRB1*0301, DRB1*0401, DRB1*0701, DRB1*1101, DRB1*1301, DRB1*1501). Table 4 shows a screenshot of the result table for predicted A*0201 9-mer binding peptides within the TA tyrosinase. Binding predictions are cross-referenced with experimentally verified T-cell epitopes and HLA ligands hosted in TANTIGEN and are marked in the rightmost column of the result table. As shown in the figure, a 9-mer peptide 369-377 (YMNGTMSQV) predicted to be a strong HLA-A*0201 binder that has been experimentally validated to be a HLA-A*0201 restricted T-cell epitope [42]. To visualize T-cell epitopes or HLA ligands, users can select from three peptide display formats. In format 1, T-cell epitopes and HLA ligands are highlighted within the antigen sequences. In format 2, peptides are displayed within a multiple sequence alignment within all antigen isoform. In format 3, each peptide is shown in a separate line defined for each HLA allele. To facilitate fast database querying, the database also provides a diagrammatic classification of TAs (Figure 2). TAs in this database were categorized into three major groups: unique antigens, shared antigens and unclassified antigens. Among unique antigens, a protein could be sub-classified based on the way of T-cell epitopes or HLA ligands generating. Shared antigens are sub-divided into three groups based on specificity (shared tumor specific), location (differentiation) and expression (overexpression). The antigens in each category were arranged in alphabetical order. Clicking on the link showing the name of TA, will provide a result table that lists all relevant antigens. Since TANTIGEN uses antigen names based on human gene nomenclature, some TA names might not be recognized by users who are familiar with the common names. To alleviate this problem TANTIGEN provides the lists of common antigen names,

Page 185: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

163

antigen nomenclature, and splice isoform nomenclature for all TAs. This table is provided within the nomenclature tab. Table 4: Predicted HLA-A*0201 binding peptide of Ag000039, TYR. Predictions have been done using netMHCpan. Predicted strong binders are underlined. Peptides are shown in descending order of their affinities. To show the peptides in ascending order of their positions, click on "Position" in the 1st row of the table.

RANK Position Peptide IC50 (nM) Experimental Status

1 214-222 FLLRWEQEI 4.77 T-cell epitope

2 369-377 YMNGTMSQV 5.98 T-cell epitope

3 490-498 ALLAGLVSL 8.51 T-cell epitope

4 1-9 MLLAVLYCL 12.84 T-cell epitope

5 9-17 LLWSFQTSA 14.61 HLA ligand

6 473-481 RIWSWLLGA 17.73 HLA ligand

7 491-499 LLAGLVSLL 23.37 T-cell epitope

8 487-495 VLTALLAGL 29.67 T-cell epitope

9 207-215 FLPWHRLFL 38.86 HLA ligand

10 262-270 LLSPASFFS 42.24

11 460-468 FQDYIKSYL 44.66

12 482-490 AMVGAVLTA 48.78 T-cell epitope

13 478-486 LLGAAMVGA 59.78

14 2-10 LLAVLYCLL 64.32 HLA ligand

15 483-491 MVGAVLTAL 103.39

16 200-208 FAHEAPAFL 160.52

17 463-471 YIKSYLEQA 189.16

18 5-13 VLYCLLWSF 208.87

19 137-145 YLTLAKHTI 214.82

20 133-141 KFFAYLTLA 260.26

21 175-183 LFVWMHYYV 269.96

22 145-153 ISSDYVIPI 384.47

23 380-388 SANDPIFLL 411.83

Some TAs are highly variable and have many recorded point mutations. To facilitate efficient mutation analysis and give a global picture of all point mutations in a TA, we implemented a TA mutation

Page 186: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

164

mapping tool. The mutation map of TA TP53 (Figure 3) displays an extremely large number of reported point mutations.

Figure 2: Tumor antigen classification chart in TANTIGEN. Each new addition to TANTIGEN is classified according to the principles of Van den Eynde and van der Bruggen [17].

T umor antigen

Unique antigenS hared antigen

Substitution m

utation

Intron encoding

Alternative O

RF

Chrom

osomal

translocation

Internal tandem

repeat

CASP

5C

DKN

2AO

GT

ART

C1

MU

M1

ACT

N4

BRA

FC

DK

4C

TNN

B1

EEF

2E

FTUD

2FN

1H

HA

T

HSPA1B

KRA

S

SIRT

2S

NRPD

1TPI1

ZUBR

1

ME

1

TRAPP

C1

MU

M3

MYO

1BN

FYC

NRA

SO

S9

LPGAT

1P

RDX

5

PTP

RK

ABL

-BCR

DEK

-CAN

ETV

6-AM

L1

LDLR

-FUT

NPM

1-ALK

1P

AX3-FKH

RP

ML

-RAR

AS

YT-S

SX1

FLT3

PAP

OLG

GPN

MB

CASP

8C

DC

27

707-AP

AN

KRD30A

KLK

3K

LK4

MC

1RM

LAN

AO

CA2

RAB

38S

CGB

2A2

SILV

SO

X2

GPR

143

ABC

C3

ACP

PA

DA

M17

AD

FPA

FPA

LDH

1A1

ALK

AM

L1

ART

4B

CL-2

FOLH

1

CCN

D1

CEA

CAM

5B

CL2L1

BIRC

5B

IRC7

BST

2C

A9

CCN

IC

CNB

1

FGF

5F4.2

DCT

TYRP

1

CEL

CSF

1

RAG

E

WD

R46

AIM

2

CYP

1B1

DD

R1

DEK

DKK

1E

GFR

EN

AH

EPH

A2

EPH

A3

ERB

B2

EZH

2

MU

C1

CPSF

1C

SPG

4

FMN

L1

GPC

3IL13R

A2

KAA

G1

MCL

1M

DM

2M

MP

2M

RPL

28M

SLN

CLC

A2

AN

XA

2B

AGE

CCD

C110

CSA

G2

CTA

G1A

CTA

G2

CXO

RF61

GAG

E1

HER

V-K

-MEL

VEN

TXP

1

MAG

EA3

GAG

E3

GAG

E4

GAG

E2

GAG

E5

GAG

E6

GAG

E7

GAG

E8

MAG

EA1

MAG

EA10

MAG

EA12

MAG

EA2

MAG

EA9

MAG

EB1

MAG

EB2

MAG

EC2

SAG

E1

SPA

17S

SX2

SSX

4

SYC

P1

MAG

EA4

MAG

EA6

CTA

G

MG

AT

5

Uncla ssifie d

Shared tum

or specific

Differentiation

Overexpressed

TGFBR

2

ABI2

ABL

1

NPM

1

ACR

BP

PRA

ME

AKA

P13

RH

AM

MR

NF

43S

ART

1

APC

SO

X10

CD

C2

CD

KN1A

ATIC

BCA

P31

BCR

SAR

T3

SCR

N1

BTB

D2

CALR

3C

AN

SFM

BT1

CTSH

DN

AJC

8

SO

X11

SO

X4

EIF

4EBP

1

STEAP

1

ETV

6

TACST

D1

TERT

FMO

D

BAA

T

TOP

2A

FUT

1H

3F3AH

MO

X1

HPSE

IER3

IGF

2BP

3ITG

B8

ITPR

2

JUP

MFI2

MM

P14

MU

C2

CO

TL1

LDLR

LGA

LS3B

PLRP

1LY

6KM

AGED

4

ZNF

395

MET

MFG

E8

TOP

2B

PA

2G4

PAG

E4

PAK

2P

ARP

12P

GK

1

TP53

PM

L1

PRT

N3

FOX

O1

PSC

AP

XDN

L

RAR

AR

CVRN

RPA

1

TPBG

RPSA

RPL

10AR

PS2

SLC

45A3

SEP

T2

UBE

2A

OAS

3

LCK

SSX

1

HSM

D

SLB

PS

LC35A

4

TPM4

TRPM

8

TYR

TRG

UBE

2V1

PAX

3P

PIB

TOR

3A

TSPYL

1

PTH

LH

TRGC

2

RG

S5

HN

RPL

SD

CBP

TYMS

STA

T1

SYT

TAPBP

XAG

E1

SYT

-SSX

2

UBX

D5

WH

SC2

WT1

WN

K2

TTKS

UPT

7L

BCR

-ABL

ETV

5H

SPA

1A

SYN

D1

TRIM68

XBP

1

HM

HA

1

Page 187: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

165

Figure 3. Screen shot of mutation map of tumor antigen TP53 in TANTIGEN. The highlighted amino acids in the reference sequences are positions where point mutations took place. Clicking on the amino acids below the point mutation positions leads to the mutated sequence data table. 4 Discussion

More than 1,400 T-cell epitopes and HLA ligands have been characterized. TANTIGEN is the first database dedicated to tumor vaccine target discovery that uses a systematic and comprehensive bioinformatics approach with integrated analysis tools. The database contains a large number of human tumor T-cell antigen entries that are fully referenced. Due to the lack of submission standards and insufficient data standardization in many public databases, a TA is often associated with different names. Such inconsistent nomenclature makes it difficult to apply computational approaches for data mining and knowledge discovery. TA names assigned in scientific literature often differ from names used in human gene nomenclature. For example, BING4 (WDR46), CEA (CEACAM5), HER2 (ERBB2), Survivin (BIRC5), COA-1 (UBXD5), DCT (TRP-2), RBAF600 (ZUBR1), KK-LC-1 (CXorf61), KM-HN-1 (CCDC110), TRAG3 (CSAG2), Pmel17/gp100 (SILV), NY-BR-1 (ANKRD30A), and g250 (CA9) use different common and nomenclature names. To reduce errors resulting from inconsistent naming, all TANTIGEN antigens were named according to the HUGO Gene Nomenclature; synonyms for each antigen name were collected and are displayed within each antigen record. Also, an antigen nomenclature table was developed to summarize the formal names, common names, and synonyms for all TAs collected in TANTIGEN. Apart from antigen name, splice isoform name can also be inconsistent. For example isoforms -1, -3 and -5 of UBE2V1 have been reported as isoforms -4, -2, and -3 in alternative references. In TANTIGEN we propose using of systematic isoform nomenclature by extending the guideline of HUGO Gene Nomenclature Committee. The improvement of TA nomenclature was demonstrated in particular with carcinoembryonic

Mutation(map(of(TP53( Reference sequence: Ag000037 10 20 30 40 50 60 70 80 90 100 | | | | | | | | | | MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPPVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTY ....HLHS.IK...RLD......N...A..I.TLFLP.V..YSIMPSGY...........................................MTFCILPRNI. ..........Q.........................T.P....T.FL.H...........................................PLTPASF.R.. ...........................................V....N..............................................FF...... Reference sequence: Ag001758 10 20 30 40 50 60 70 80 90 100 | | | | | | | | | | MEEPQSDPSVEPPLSQETFSDLWKLLPENNVLSPLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMPEAAPRVAPAPAAPTPAAPAPAPSWPLSSSVPSQKTY ...................................................HCL.K.TCSGDV.TRSGGTTHMTSTAVTLISEVSVQVLYCA........... ....................................................GY.V.QDQN.T..ILQV..CE.RG..VS.LVGL..TSF.S........... ..........................................................NL......R.D..LL.L...G............L........... ....................................................................T..G...............................

Page 188: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

166

antigen family because the nomenclature of this gene family has been inconsistent [39]. Lacroix [43] reported a poor usage of HUGO gene nomenclatures in SCGB2A2 or mammaglobin A in breast cancer. This protein was reported using a variety of names such as MAM, hMAM, hMAM-A, MG, MGB1, or MMG. Inconsistent isoform naming was also found in NY-ESO-1/LAGE-1. The protein CTAG (CAMEL), CTAG1A (NY-ESO-1), and CTAG2 (LAGE-1) are not different proteins, instead they are encoded by the NY-ESO-1/LAGE-1 gene. Analysis of coding DNA sequences revealed that these three isoform names are actually splice isoforms of the NY-ESO-1/LAGE-1 gene. Gene expression profiles provided by UniGene have been collected for in silico analysis of differential gene expression. Computational methods of transcriptional profiling or digital differential display (DDD) have been applied for increased understanding of cancer biology [44]. Identification of highly expressed genes may provide information for understanding new pathways, novel biomarkers, or prognostic markers of malignancy [45–47]. DDD has been used to investigate gene expression in a wide variety of cancers including breast, colon, lung, ovarian, pancreatic, and prostate cancers [34, 48, 49]. Systematic discovery of cancer vaccine targets relies heavily on the availability of accurate, up-to-date, and well-organized TA data. TA data are available through publications, technical reports, and databases. The challenge is to collect, clean, annotate, and archive these data and extract meaningful information and knowledge. We have cataloged 4245 curated TA entries representing multiple variants of 260 unique protein TAs reported in the literature. KB-builder, an in-house developed framework that streamlines the development and deployment of web-accessible immunological knowledge bases was employed to construct TANTIGEN. A set of computational tools for in-depth analysis including TA classification, TA nomenclature assignment, sequence comparison using BLAST search, multiple alignments of antigens, mutation mapping, and T-cell epitope/HLA ligand visualization, have been integrated in TANTIGEN. Prediction of HLA binding is enabled for 15 common HLA Class I and Class II alleles. TANTIGEN provides a rich data source and an advanced analysis platform for cancer vaccine discovery, designed to speed up rational cancer vaccine design by providing accurate and well-annotated data coupled with tailored computational analysis tools.

Page 189: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

167

5 References

1. Klein G: Tumor antigens. Annu. Rev. Microbiol. 1966, 20:223–52. 2. Haen SP, Rammensee H-G: The repertoire of human tumor-associated epitopes--identification and selection of antigens and their application in clinical trials. Curr. Opin. Immunol. 2013, 25:277–83. 3. Boon T, Cerottini JC, Van den Eynde B, van der Bruggen P, Van Pel A: Tumor antigens recognized by T lymphocytes. Annu. Rev. Immunol. 1994, 12:337–65. 4. Parmiani G, De Filippo A, Novellino L, Castelli C: Unique human tumor antigens: immunobiology and use in clinical trials. J. Immunol. 2007, 178:1975–9. 5. Cao X, Maloney KB, Brusic V: Data mining of cancer vaccine trials: a bird’s-eye view. Immunome Res. 2008, 4:7. 6. Van der Bruggen P, Traversari C, Chomez P, Lurquin C, De Plaen E, Van den Eynde B, Knuth A, Boon T: A gene encoding an antigen recognized by cytolytic T lymphocytes on a human melanoma. Science (80-. ). 1991, 254:1643–7. 7. Klein O, Ebert LM, Nicholaou T, Browning J, Russell SE, Zuber M, Jackson HM, Dimopoulos N, Tan BS, Hoos A, Luescher IF, Davis ID, Chen W, Cebon J: Melan-A-specific cytotoxic T cells are associated with tumor regression and autoimmunity following treatment with anti-CTLA-4. Clin. Cancer Res. 2009, 15:2507–13. 8. Topalian SL, Hodi FS, Brahmer JR, Gettinger SN, Smith DC, McDermott DF, Powderly JD, Carvajal RD, Sosman JA, Atkins MB, Leming PD, Spigel DR, Antonia SJ, Horn L, Drake CG, Pardoll DM, Chen L, Sharfman WH, Anders RA, Taube JM, McMiller TL, Xu H, Korman AJ, Jure-Kunkel M, Agrawal S, McDonald D, Kollia GD, Gupta A, Wigginton JM, Sznol M: Safety, activity, and immune correlates of anti-PD-1 antibody in cancer. N. Engl. J. Med. 2012, 366:2443–54. 9. Schwartzentruber DJ, Lawson DH, Richards JM, Conry RM, Miller DM, Treisman J, Gailani F, Riley L, Conlon K, Pockaj B, Kendra KL, White RL, Gonzalez R, Kuzel TM, Curti B, Leming PD, Whitman ED, Balkissoon J, Reintgen DS, Kaufman H, Marincola FM, Merino MJ, Rosenberg SA, Choyke P, Vena D, Hwu P: gp100 peptide vaccine

Page 190: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

168

and interleukin-2 in patients with advanced melanoma. N. Engl. J. Med. 2011, 364:2119–27. 10. Walter S, Weinschenk T, Stenzl A, Zdrojowy R, Pluzanska A, Szczylik C, Staehler M, Brugger W, Dietrich P-Y, Mendrzyk R, Hilf N, Schoor O, Fritsche J, Mahr A, Maurer D, Vass V, Trautwein C, Lewandrowski P, Flohr C, Pohla H, Stanczak JJ, Bronte V, Mandruzzato S, Biedermann T, Pawelec G, Derhovanessian E, Yamagishi H, Miki T, Hongo F, Takaha N, Hirakawa K, Tanaka H, Stevanovic S, Frisch J, Mayer-Mokler A, Kirner A, Rammensee H-G, Reinhardt C, Singh-Jasuja H: Multipeptide immune response to cancer vaccine IMA901 after single-dose cyclophosphamide associates with longer patient survival. Nat. Med. 2012, 18. 11. Snook AE, Magee MS, Waldman SA: GUCY2C-targeted cancer immunotherapy: past, present and future. Immunol. Res. 2011, 51:161–9. 12. Arens R, van Hall T, van der Burg SH, Ossendorp F, Melief CJM: Prospects of combinatorial synthetic peptide vaccine-based immunotherapy against cancer. Semin. Immunol. 2013. 13. Yamada A, Sasada T, Noguchi M, Itoh K: Next-generation peptide vaccines for advanced cancer. Cancer Sci. 2013, 104:15–21. 14. Ostrand-Rosenberg S: Animal models of tumor immunity, immunotherapy and cancer vaccines. Curr. Opin. Immunol. 2004, 16:143–50. 15. Lollini P-L, Cavallo F, Nanni P, Forni G: Vaccines for tumour prevention. Nat. Rev. Cancer 2006, 6:204–16. 16. Rosenberg S a, Yang JC, Restifo NP: Cancer immunotherapy: moving beyond current vaccines. Nat. Med. 2004, 10:909–15. 17. Van den Eynde BJ, van der Bruggen P: T cell defined tumor antigens. Curr. Opin. Immunol. 1997, 9:684–93. 18. Lizée G, Overwijk WW, Radvanyi L, Gao J, Sharma P, Hwu P: Harnessing the power of the immune system to target cancer. Annu. Rev. Med. 2013, 64:71–90. 19. Cavallo F, Calogero RA, Forni G: Are oncoantigens suitable targets for anti-tumour therapy? Nat. Rev. Cancer 2007, 7:707–13.

Page 191: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

169

20. Van der Bruggen P, Stroobant V, Vigneron N, Van den Eynde B: Peptide database: T cell-defined tumor antigens. Cancer Immun 2013. 21. Novellino L, Castelli C, Parmiani G: A listing of human tumor antigens recognized by T cells: March 2004 update. Cancer Immunol. Immunother. 2005, 54:187–207. 22. Olsen LR, Kudahl UJ, Winther O, Brusic V: Literature classification for semi-automated updating of biological knowledgebases. BMC Genomics 2013, 14 Suppl 6:S14. 23. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, Sirota-Madi A, Olender T, Golan Y, Stelzer G, Harel A, Lancet D: GeneCards Version 3: the human gene integrator. Database (Oxford). 2010, 2010:baq020. 24. Gray KA, Daugherty LC, Gordon SM, Seal RL, Wright MW, Bruford EA: Genenames.org: the HGNC resources in 2013. Nucleic Acids Res. 2013, 41:D545–52. 25. Benson D a, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic Acids Res. 2012, 40:D48–53. 26. Magrane M: UniProt Knowledgebase: a hub of integrated protein data. Database (Oxford). 2011, 2011:bar009. 27. Forbes S a, Bindal N, Bamford S, Cole C, Kok CY, Beare D, Jia M, Shepherd R, Leung K, Menzies A, Teague JW, Campbell PJ, Stratton MR, Futreal PA: COSMIC: mining complete cancer genomes in the Catalogue of Somatic Mutations in Cancer. Nucleic Acids Res. 2011, 39:D945–50. 28. Thierry-Mieg D, Thierry-Mieg J: AceView: a comprehensive cDNA-supported gene and transcripts annotation. Genome Biol. 2006, 7 Suppl 1:S12.1–14. 29. Zhang GL, Olsen LR, Kudahl UJ, Chitkushev L, Brusic V: Streamlining the development process of immunological knowledge-based systems. Methods Mol. Med. 2013. 30. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J. Mol. Biol. 1990, 215:403–10.

Page 192: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

170

31. Katoh K, Standley DM: MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 2013. 32. Hoof I, Peters B, Sidney J, Pedersen LE, Sette A, Lund O, Buus S, Nielsen M: NetMHCpan, a method for MHC class I binding prediction beyond humans. Immunogenetics 2009, 61:1–13. 33. Nielsen M, Justesen S, Lund O, Lundegaard C, Buus S: NetMHCIIpan-2.0 - Improved pan-specific HLA-DR predictions using a novel concurrent alignment and weight optimization training procedure. Immunome Res. 2010, 6:9. 34. Lin HH, Ray S, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC class I peptide binding prediction servers: applications for vaccine research. BMC Immunol. 2008, 9:8. 35. Lin HH, Zhang GL, Tongchusak S, Reinherz EL, Brusic V: Evaluation of MHC-II peptide binding prediction servers: applications for vaccine research. BMC Bioinformatics 2008, 9 Suppl 12:S22. 36. Zhang GL, Ansari HR, Bradley P, Cawley GC, Hertz T, Hu X, Jojic N, Kim Y, Kohlbacher O, Lund O, Lundegaard C, Magaret C a, Nielsen M, Papadopoulos H, Raghava GPS, Tal V-S, Xue LC, Yanover C, Zhu S, Rock MT, Crowe JE, Panayiotou C, Polycarpou MM, Duch W, Brusic V: Machine learning competition in immunology - Prediction of HLA class I binding peptides. J. Immunol. Methods 2011, 374:1–4. 37. Mandruzzato S, Brasseur F, Andry G, Boon T, van der Bruggen P: A CASP-8 mutation recognized by cytolytic T lymphocytes on a human head and neck carcinoma. J. Exp. Med. 1997, 186:785–93. 38. Wang T, Fan L, Watanabe Y, McNeill P, Fanger GR, Persing DH, Reed SG: L552S, an alternatively spliced isoform of XAGE-1, is over-expressed in lung adenocarcinoma. Oncogene 2001, 20:7699–709. 39. Morioka N, Kikumoto Y, Hoon DS, Morton DL, Irie RF: Cytotoxic T cell recognition of a human melanoma derived peptide with a carboxyl-terminal alanine-proline sequence. Mol. Immunol. 1995, 32:573–81. 40. Wang HY, Peng G, Guo Z, Shevach EM, Wang R-F: Recognition of a new ARTC1 peptide ligand uniquely expressed in tumor cells by

Page 193: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

171

antigen-specific CD4+ regulatory T cells. J. Immunol. 2005, 174:2661–70. 41. Suzuki K, Sahara H, Okada Y, Yasoshima T, Hirohashi Y, Nabeta Y, Hirai I, Torigoe T, Takahashi S, Matsuura A, Takahashi N, Sasaki A, Suzuki M, Hamuro J, Ikeda H, Wada Y, Hirata K, Kikuchi K, Sato N: Identification of natural antigenic peptides of a human gastric signet ring cell carcinoma recognized by HLA-A31-restricted cytotoxic T lymphocytes. J. Immunol. 1999, 163:2783–91. 42. Mosse CA, Meadows L, Luckey CJ, Kittlesen DJ, Huczko EL, Slingluff CL, Shabanowitz J, Hunt DF, Engelhard VH: The class I antigen-processing pathway for the membrane protein tyrosinase involves translation in the endoplasmic reticulum and processing in the cytosol. J. Exp. Med. 1998, 187:37–48. 43. Lacroix M: Poor usage of HUGO standard gene nomenclature in breast cancer studies. Breast Cancer Res. Treat. 2009, 114:385–6. 44. Beauchemin N, Draber P, Dveksler G, Gold P, Gray-Owen S, Grunert F, Hammarström S, Holmes K V, Karlsson A, Kuroki M, Lin SH, Lucka L, Najjar SM, Neumaier M, Obrink B, Shively JE, Skubitz KM, Stanners CP, Thomas P, Thompson JA, Virji M, von Kleist S, Wagener C, Watt S, Zimmermann W: Redefined nomenclature for members of the carcinoembryonic antigen family. Exp. Cell Res. 1999, 252:243–9. 45. Murray D, Doran P, MacMathuna P, Moss AC: In silico gene expression analysis--an overview. Mol. Cancer 2007, 6:50. 46. Dennis JL, Vass JK, Wit EC, Keith WN, Oien KA: Identification from public data of molecular markers of adenocarcinoma characteristic of the site of origin. Cancer Res. 2002, 62:5999–6005. 47. Yousef GM, Yacoub GM, Polymeris M-E, Popalis C, Soosaipillai A, Diamandis EP: Kallikrein gene downregulation in breast cancer. Br. J. Cancer 2004, 90:167–72. 48. Asmann YW, Kosari F, Wang K, Cheville JC, Vasmatzis G: Identification of differentially expressed genes in normal and malignant prostate by electronic profiling of expressed sequence tags. Cancer Res. 2002, 62:3308–14.

Page 194: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

172

49. Scheurle D, DeYoung MP, Binninger DM, Page H, Jahanzeb M, Narayanan R: Cancer gene discovery using digital differential display. Cancer Res. 2000, 60:4037–43.

Page 195: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

173

Paper V Literature classification for semi-automated updating of biological knowledgebases

BMC genomics 2013, vol 14, suppl 6: S14

Lars Rønn Olsen1,2, Ulrich Johan Kudahl2,3, Ole Winther1,4, Vladimir Brusic2,5,*

1Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 2Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 3Center for Biological Sequence Analysis, Department of Systems Biology, Technical University of Denmark, Lyngby, Denmark 4Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark 5Department of Computer Science, Metropolitan College, Boston University, Boston MA, USA *Corresponding author

Page 196: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

174

Abstract

As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases.

Page 197: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

175

1 Introduction

Databases are the cornerstone of bioinformatics analyses. Experimental methods keep advancing and high-throughput methods keep increasing in volume, the number of biological data repositories are growing rapidly [1]. Similarly, the quantity and complexity of the data are growing requiring both the refinement of analyses and higher resolution and accuracy of results. In addition to the most commonly used biological data types such as sequence data (gene and protein), structural data, and quantitative data (gene and protein expression), the increasing amount of high-level functional annotations of biological sequences are needed to enable detailed studies of biological systems. These high-level annotations are also captured in the databases, but to a much smaller degree than the essential data types. The literature, however, is a rich source of functional annotation information, and combining these two types of sources provides a body of data, information, and knowledge needed for practical application in bioinformatics and clinical bioinformatics. Extraction of knowledge from these sources is facilitated through emerging knowledgebases (KB) that enable not only data extraction, but also data mining, extraction of patterns hidden in the data, and predictive modeling. Thus, KB bring bioinformatics one step closer to the experimental setting compared to traditional databases since they are intended to enable summarization of hundreds of thousands of data points and in silico simulation of experiments all in one place. A knowledge-based system (KBS) is a computational system that uses logic, statistics and artificial intelligence tools for support in decision making and solving complex problems. The KBS include specialist databases designed for data mining tasks and knowledge management databases (knowledgebases). A KBS is a system comprising a KB, a set of analytical tools, a logic unit, and user interface. The logic unit connects user queries and determines, using workflows, how analytical tools are applied to the knowledge base to perform the analysis and produce the results. Primary sources such as UniProt [2] or GenBank [3], as well as specialized databases such as The Influenza Research Database (IRD) [4] and the Los Alamos National Laboratory HIV Databases (http://www.hiv.lanl.gov/), offer a number of integrated tools and annotated data, but their analytical workflows are limited to basic operations. Examples of more advanced KBS include FlavidB a KBS of flavivirus antigens, [5], FluKB a KBS of influenza antigens (http://research4.dfci.harvard.edu/cvc/flukb), and TANTIGEN a KBS of tumor antigens

Page 198: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

176

(http://cvc.dfci.harvard.edu/tadb/index.html). KBS focus on a narrow domain, and a set of analytical tools to perform complex analyses and decision support. KBS must contain sufficient data, and annotations to enable data mining for summarization, pattern discovery and building of models that simulate behavior of real systems. For example FlavidB, enables summarization of diversity of sequences for more than 50 species of flaviviruses. It also enables the analysis of the complete set of predicted T cell epitopes for 15 common HLA alleles and has the capacity to display the complete landscape of both predicted and experimentally verified HLA associated peptides. The extension of antigen analysis functionalities with FluKB enables analysis of cross-reactivity of all entries for neutralizing antibodies. Both these examples focus on identification, prediction, variability analysis and cross-reactivity of immune epitopes. The implementation of workflows in these KBS enables complex analyses to be performed by filling a single query form and results are presented in a single report. To get high quality results, we must ensure that KBS are up to date and error-free (to the extent possible). Since the information in KBS is derived from multiple sources, providing high quality updates is complex. Manual updating of KBS is impractical, so automation of the updating process is needed. Automated updating of data and annotation by extracting data from primary databases such as UniProt, GenBank, or IEDB is relatively simple since these sources enable export of data using standardized formats, mainly XML. Ideally, functional annotations will be deposited by direct submission to appropriate databases by the discoverers, but a historical lack of submission standards for higher-level biological data, has lead to the vast majority of this information being recording only in primary scientific literature. The use of data embedded in primary scientific literature accessible through PubMed or Google Scholar, is markedly more complex. The information stored in abstracts or full texts is, at best, semi-structured, but typically it is provided as free text. Given that as many as tens of thousands of articles may be published each year on a given topic, access to this information and assessment of its relevance require efficient methods for identification of publications of interest and rapid assessment of their suitability for inclusion in the KBS. Such analysis is facilitated through use of text mining techniques, ranging from simple statistical pattern learning based on term frequencies, to complex natural language processing techniques in order to produce text categorization, document summarization, information retrieval, and ultimately the data mining [6]. A long-term solution for this issue invariably involves standardizing submission

Page 199: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

177

and storage of complex biological data, but the knowledge currently embedded in the literature remains available for extraction. Text mining operations have previously been applied for specific knowledge extraction for vaccine development [7], as well as document classification for separation of abstracts by topic [8] and for semi-automated extraction of allergen cross-reactivity information [9]. In this article, we will define the conceptual framework for semi-automated updating of our tumor antigen knowledgebase, TANTIGEN, using data parsing, basic text mining operations, and a standardized submission system. 2 Results and discussion

2.1 Conceptual framework Depending on the content of the KBS one wishes to update, there are issues pertaining to the complexity of biological data that require considerations. Particularly we must address the diversity of data types, diversity of data formats, dispersion of data across different sources, and size of data sets. There are many biological data types – the most common include sequence data (nucleotide or protein), molecular structures, expression data, and functional annotations. Data can be stored and retrieved either as structured text, table formats, semantic web formats (such as RDP, OWL, or XML), or non-structured text. Depending on the target data format, retrieval can be performed by direct extraction, parsing, text mining, or manual extraction. Text mining, manual extraction, or a combination of these two is common in extracting the high-level data, such as functional annotations. Data availability and individual entry size vary between different data types, presenting a computational challenge in terms of retrieval, handling, analysis, and storage. Additional factors that affect the complexity of the updating task are data heterogeneity, integration of multiple data types after retrieval, as well as provenance tracking for quality assessment [7]. To address these issues we have formalized a number of common tasks pertaining to knowledgebase updating into a conceptual framework for updating biological KBS, shown in Figure 1.

Page 200: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

178

Figure 1: Flow chart of tasks in conceptual framework for semi-automated updating of knowledgebases. Step 1: Produce status report of current knowledgebase build. This report will serve as the filter for the two main updating tasks: update of existing entries and update of data body by introduction of new entries. Step 2: Automatic download of data from selected sources. Most biological data repositories enable full download of latest database build and most allow automated retrieval via GNU Wget or FTP clients. If automatic download is not possible, this step can be performed manually. Step 3: Automatic data pre-processing. Depending on the data format, pre-processing steps can be automated in various ways. For simple syntax-based formats such as XML, parsing of desired data is possible, where for non-standardized formats, such as raw text, pre-processing

Page 201: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

179

involves tasks derived from text mining, such as word stemming, stop word removal, and generation of document-term matrix (DTM) [8]. Step 4: Text categorization. If the desired information is not available in a standardized format - for example that it is only available in primary scientific literature, the text mining or machine learning methods can be applied to direct and streamline the manual extraction. A text corpus may contain documents that fall into two or more categories, of which only one or a few are of interest for a given task. To maximize the efficiency of manual data extraction, it is helpful to classify documents before embarking on data extraction. Options for classification using machine learning methods include: unsupervised methods such as clustering and blind signal separation, or supervised methods such as artificial neural networks, support vector machines, nearest neighbor methods, Naive Bayes, decision trees, among others [6]. For some of these algorithms, feature extraction using matrix factorization methods, such as principal component analysis (singular value decompression) can be useful to reduce dimensionality of DTM, which can become quite large. Step 5: Manually extract data and information from categorized texts. Some higher-level data types, such as functional annotations, are often found in tables, figures, legends, or supplementary materials of primary scientific articles, making automated extraction of this information highly complex or practically impossible [9]. A manual extraction step may therefore be needed and simultaneously allow for quality control. Step 6: Submission of new or updated entries to the KBS. Submission of extracted data to the KBS should be standardized to the highest degree possible in order to ensure the adherence to standardized format and quality of an entry. The use of a standardized submission form allows non-experts to perform the task of updating. Automated extraction of related data from primary databases can minimize the manual entry of data and mismatches between existing entries addition to entries, provide automated error detection to be manually addressed. Step 7: Refining categorization by increasing the training corpus. Each manually inspected document (classified either as relevant or irrelevant) represents a new addition to the training data used for documentation categorization. In addition to refining the model and improving performance, a feedback loop to the classification module reduces the need for a large initial training corpus.

Page 202: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

180

2.2 Case study: TANTIGEN tumor T cell antigen database Selection of useful tumor T cell antigens represents a major bottleneck to the study and design of cancer immunotherapies. The methods of selecting immunotherapy targets involve the selection of antigens and the analysis of their immune epitopes. This process has been greatly enhanced by the use of computational immunology methods [10]. However, as computational efforts produce vast amounts of potential targets, the bottleneck is shifted to the wet lab, where the vaccine target candidates must be validated for both relevance and immunogenicity before they are included in potential vaccine constructs. Great advances have been made in techniques for high-throughput epitope validation [11, 12], but as computational methods grow ever more powerful, so does the need for post-analysis verification of results. Efficient cataloguing of experimentally validated epitopes for cross-referencing of new predictions with past experimental data is a valuable resource that could reduce the need for and streamline further experimentation. Several specialized resources for this and similar purposes have been established, for example: IRD [4], The HIV databases (http://www.hiv.lanl.gov), Human Papillomavirus T cell Antigen Database for HPV (http://cvc.dfci.harvard.edu/hpv/index.html), as well as general HLA binder repositories such as SYFPEITHI [13] and the Immune Epitope Database (IEDB) [14]. The TANTIGEN database was established in 2007 as a tumor-specific T cell antigen database. It provides the scientific community with a curated repository of experimentally validated tumor T-cell antigens, and matched T-cell epitopes and HLA binders. Each antigen entry contains detailed information about somatic mutations from the Catalogue of Somatic Mutations in Cancer (COSMIC) [15], splice isoforms from UniProt/Swiss-Prot, gene expression profiles from UniGene, and known T-cell epitopes from secondary databases or literature. Additionally, TANTIGEN is equipped with a number of analysis tools such as BLAST search [16], multiple sequence alignment using MAFFT [17], T-cell epitope/HLA ligand prediction [18, 19] and visualization, and tumor antigen classification [20].

2.2.1 Updating TANTIGEN Keeping up-to-date data in a KBS represents a major bottleneck in the maintenance of TANTIGEN. In 2012, 7,322 articles responding to the keywords "tumor antigen" were indexed in PubMed. Although many of these articles may not contain tumor T cell antigens, the

Page 203: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

181

growing quantities of literature represents a major bottleneck in the maintenance of curated databases [9]. The data types to be updated in TANTIGEN are experimentally characterized T cell epitopes and HLA ligands, and expression and variability information for the proteins that harbor them. In build 1 of TANTIGEN, these data were collected from six different sources: manual collection from the literature, the Peptide database: T cell-defined tumor antigens (http://www.cancerimmunity.org/peptide/), the listing of human tumor antigens recognized by T cells by Parmiani and colleagues [21, 22], and parsing from IEDB, as well as four other public databases that are outdated or unavailable at present. The primary resource for these data remains manual collection from the literature, as no primary database is actively collecting or curating tumor antigen data. IEDB offers some curated cancer data (2.7% of available data curated as of November 2009 [14]), but in their February 2011 newsletter they announced that they will no longer curate cancer tumor epitope data. Table 1: Examples of PubMed results from a selection of keyword searches (publication data from December 1, 2009 - March 29, 2013).

Keyword PubMed hits cancer OR tumor OR antigen OR epitope 552,309

(tumor OR cancer) AND (antigen OR epitope)

45,517

tumor AND antigen 40,525

tumor antigen 22,264

tumor AND antigen AND epitope 3,057

tumor AND antigen AND epitope AND T cell 852

"tumor antigen" 642

2.2.1.1 Preliminary filtering of literature A simple keyword search for the terms “cancer OR tumor OR antigen OR epitope” in PubMed, yielded >552,000 results (from December 1, 2009 - March 29 2013). When keyword stringency increased the number of useful publications decreased to a workable level (Table 1). For this task, we decided to use the search term "(tumor OR cancer) AND (antigen OR epitope)", which yields 48,130 hits in PubMed. Keyword search terms could be further expanded or refined, by reiterating either manually or using feature extraction of discriminative terms using machine-learning methods. Manually sorting of these

Page 204: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

182

articles is extremely laborious task. PubMed is currently growing at approximately 4% per year [23], so the issue will only increase. It is therefore advantageous to automate the classification of publication content before manually extracting relevant information. For this task, we employed an adapted version of the conceptual framework to update TANTIGEN.

2.2.1.2 Formal approach to updating Step 1: Status report of TANTIGEN build 1. The status report for TANTIGEN lists 251 unique proteins and corresponding UniProt accession numbers. Many of these proteins have multiple splice isoforms for which UniProt accession numbers are also listed. All UniProt accession numbers are listed as these entries are subject to updating by direct parsing from UniProt data downloads. Similarly, PubMed IDs are listed for all referenced articles. These articles represent relevant literature and corresponding abstracts can be directly parsed from the PubMed abstract download to the training document set. The build 1 of TANTIGEN has 4,006 curated antigen entries. Step 2: Automatic data download. The latest versions of UniProt and COSMIC are downloadable as XML files from the database web sites. PubMed results can be narrowed down by search term, in this case we used "(cancer OR tumor) AND (antigen OR epitope)", but this can be refined in later iterations if suitable. Due to the very high volume of abstracts in PubMed, query results can also be filtered by date, and we here filtered out articles published before the last TANTIGEN update. Search results are downloadable in XML format. Step 3: Automatic data pre-processing. The COSMIC and UniProt XML downloads needed no further pre-processing for parsing. The PubMed abstracts were extracted from the XML and parsed into a text corpus format for pre-processing. The following tasks were performed on the corpus: lower case transformation, removal of stop words, removal of general punctuation, word stemming, and white space stripping. The numbers are usually removed in text mining preprocessing, but it was not done here because we needed to preserve the terms defining HLA alleles, CD receptors, and other immunologically relevant descriptors. Step 4: Abstract categorization. The resulting DTM was Tf-Idf transformed, and each abstract was classified using a k-Nearest Neighbor (k-NN) classifier trained on 226 manually pre-classified abstracts. Iterative refinement of the algorithm showed that a six

Page 205: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

183

nearest neighbors model yielded the best results. Each abstract in the corpus was given a probability score based on the ratio of relevant neighbors in the model. The output list was ordered from most probable to least, thus eliminating the need to define a static threshold. Step 5: Manually extract antigen data from literature. The articles corresponding to each abstract classified as relevant were accessed through PubMed or publishing journal. Epitopes, HLA ligands and related data, such as HLA restriction and protein of origin, were extracted. For TANTIGEN build 2, we manually searched the top 273 articles out of classified 48,130 articles. The cutoff of 273 articles was chosen when article relevance started decreasing drastically in the ordered list during manual data extraction. Step 6: Submission of data. Submission was done by filling out a standardized TANTIGEN submission form for each antigen. Additional information was parsed directly from the downloaded UniProt XML, based on the protein of origin. Similarly, mutation entries and splice variants were automatically linked by cross-referencing with COSMIC XML. Entries in TANTIGEN were automatically linked to each other where applicable (splice isoforms, mutation entries, etc.). Updating of existing entries was performed by automated parsing form UniProt XML, as some entries were removed, assigned new accession, updated with more splice isoforms. This step also serves as a error detection: if an existing entry in TANTIGEN does not match the information entered in the standardized submission form, the user is notified and prompted to determine whether the existing entry, the submission, or both are erroneous. Similarly, if protein information extracted from UniProt does not match that in COSMIC, the user will be prompted to resolve the issue, thus increasing data quality. Step 7: Refine training set with new entries. The TANTIGEN submission form has an addition field, where the curator performing the manual submission is prompted to classify the article as "relevant" or "irrelevant". This feature was used to feed manually inspected abstracts back into the training corpus, to increase its size and thus performance. The false positives and false negatives were fed back, but only a randomly selected fraction of true positives and true negatives were fed back into the training corpus, as these may further bias a potentially already biased model.

Page 206: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

184

2.2.2 Results of TANTIGEN update

2.2.2.1 Accuracy of classification The average accuracy in the five-fold cross-validation training of the k-NN model with 6 nearest neighbors was 0.95 with sensitivity of 0.96 and specificity of 0.93. Model performance is likely to increase with the increase of training set size, and particularly the addition of false positives from the manual extraction step. True positive should also be added to the training corpus, but including all true positives may further bias a potentially biased model. Special care should be taken in initial classification rounds to extract and include false negatives, as low sensitivity is highly detrimental to the quality and completeness of the update. Wrongfully discarding relevant literature will not only lead to, potentially permanent, loss of valuable data, but also negatively affect classifier performance, when misclassified training data is fed back into the model.

2.2.2.2 Results of manual extraction of tumor T-cell antigens Manual extraction of new antigenic proteins and tumor T-cell antigens was performed from the classified literature. Since classification was based on the six nearest neighbors, the body of classified abstracts was divided in seven groups, corresponding to whether an abstract had from zero to six relevant neighbors in the training set. Out of the 48,130 classified abstracts, 117 had six relevant neighbors, 156 had five, 212 had four, 859 had three, 3,489 had two, 12,738 had one, and 30,856 abstracts had zero relevant neighbors. We manually examined the top 273 scoring papers in which we found 13 new antigenic proteins harboring 32 new tumor T-cell epitopes. Additionally, we found more than 100 new T-cell epitopes discovered in proteins already recorded as tumor antigens in TANTIGEN.

2.2.2.3 Training set refinement iteratively increase classification accuracy

The performance of the document classification model is expected to gradually increase as the size of the training corpus is increased with each database update. Learning curves for accuracy, sensitivity, and specificity constructed by gradually increasing the training corpus for a test corpus fixed to 50 abstracts (25 relevant and irrelevant, respectively) supports this notion (Figure 2). Although the sensitivity and specificity show some fluctuations, accuracy is observed to steadily increase as the training set size is increased in increments of 26 abstracts (13 relevant and irrelevant, respectively). The learning curves will likely plateau with the addition of further training abstracts, although any increase in sensitivity will add to data completeness, and increased specificity will minimize labor intensity.

Page 207: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

185

Figure 2: Learning curve for training sets of increasing size. Initial training set consisted of 13 relevant and 13 irrelevant abstracts. Training set was increased to 260 abstracts in increments of 26 additional abstracts. Test set was fixed at 50 abstracts, 25 relevant and 25 irrelevant.

2.2.2.4 Abstract category signatures The DTM of the training corpus contains more than 5,600 terms. Most are very rare terms present in only one or a few abstracts, and have very little influence on abstract classification as corresponding to either relevant or irrelevant articles. Rare terms can be removed by setting a sparsity threshold if DTM dimensions become too large. Examining the top ten terms, most discriminative between abstracts of relevant and irrelevant articles (determined by t test), show a distinct signature and reveal particular emphasis on such terms as "immunotherapy", "epitope", "T cell", and "CTL" (Figure 3). These terms are likely the main drivers of classification and may very well be sufficient to support the main task of classification. Notable is the fact that all discriminating terms are predominant in relevant abstracts, which may explain that sensitivity of classification is higher than specificity. This is most likely due to the highly specific nature of the relevant abstracts, whereas irrelevant abstracts are a much broader class. However, these terms are still represented in the

Page 208: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

186

corpus of irrelevant literature, so a machine learning approach to classification is highly likely to outperform a simple keyword search.

Figure 3: Average frequency of the top ten most discriminative terms between relevant (above x axis) and irrelevant abstracts (below x axis). Significance of difference is based on t test of term frequency between corpora and p-values are listed between bars. Terms are stemmed to ensure completeness in term count. 3 Conclusion

Specialized biological databases are gradually moving from data repositories towards knowledge-based systems. Enriching basic biological data with higher-level functional annotations and facilitating specialized analyses in organized workflows enables extraction of higher-level knowledge. Currently, however, functional annotations are primarily stored in the literature, rather than in standardized formats of primary biological databases. As the quantity of this information increases, easy access to multiple layers of biological data and information enables improved extraction of knowledge, thus increasing the value to the user. We here present a conceptual framework for automating the process of updating biological databases and knowledgebases with standardized non-standardized data from both primary and secondary data repositories, as well as literature. We deployed a text mining-based approach to categorize literature, based on defining term signatures of freely available article abstracts, which enable

Relevant literature

Irrelevant literature

4.4E-186.5E-18 2.3E-164.1E-16 1.4E-15 1.1E-103.0E-10 3.1E-107.3E-10 3.5E-9p va

lAv

erag

e te

rm fr

eque

ncy

Page 209: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

187

significantly faster manual extraction of relevant data. We have applied this conceptual framework to literature for updating the TANTIGEN KBS of tumor T cell antigens. Training of a k-NN classifier on 260 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. 4 Methods

4.1 Data sources Data for updating TANTIGEN were extracted from three primary databases: UniProt/Swiss-Prot for protein data and information, COSMIC for data about somatic mutations, and PubMed for published literature about tumor antigens. All three databases are extensive repositories for their respective data types, and the quantity of data is increasing steadily (Figure 4).

Figure 4: Number of entries in PubMed, UniProt/Swiss-Prot, and COSMIC. Entries in PubMed were filtered by the search term “(tumor OR cancer) AND (antigen OR epitope)“. All three databases offer download in XML format, where the desired information was directly parsable from UniProt/Swiss-Prot and COSMIC, but only abstracts were available for PubMed entries and protein information and epitopes from these entries required manual extraction. To aid the process of KB update, text mining tools and machine learning tools were employed to filter text entries as either relevant (containing T cell epitopes) or irrelevant (not containing T cell epitopes).

0

100

200

300

1940

1960

1980

2000

Entr

ies (

thousands)

PubMed

0

20000

40000

1985

1990

1995

2000

2005

2010

Year

8QL3URW�6ZLVVï3URW

0

250

500

750

2004

2006

2008

2010

2012

COSMIC

Page 210: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

188

4.2 Classification of literature abstracts

4.2.1 Corpus A corpus for classification was extracted from PubMed using the search terms “(tumor OR cancer) AND (antigen OR epitope)“. Each entry in the corpus contains the article abstract, the titles, and the MeSH terms. Before classification, a number of term transformation steps were taken: lower case transformation, removal of numbers, removal of stop words, removal of punctuation, word stemming, synonym consolidation using the WordNet database [26], and white space removal. Term pre-processing of text corpus was done using the R package tm [24, 25]. After term counting, term frequency–inverse document frequency (Tf-Idf) transformation was applied for background correction [26].

4.2.2 Classification Abstracts were classified using the k-NN algorithm [27] from the R package class. The classifier was trained and performance evaluated for 1-155 nearest neighbors using five-fold cross-validation on a set of 310 abstracts (155 abstracts of irrelevant articles and 155 abstracts of relevant articles). This training set was manually assembled for initial training. Classification was done based on 6 neighbors in the k-NN algorithm, since this number of neighbors proved most accurate.

4.2.3 Abstract category signatures A signature of the top ten terms most discriminating between relevant and irrelevant literature was extracted by t-test of differential term occurrence in relevant and irrelevant abstracts. The average term count was calculated for the ten most discriminating terms, i.e. the terms with the lowest p-values. 5 References

1. Fernández-Suárez XM, Galperin MY: The 2013 Nucleic Acids Research Database Issue and the online molecular biology database collection. Nucleic acids research 2013, 41:D1–7. 2. Magrane M: UniProt Knowledgebase: a hub of integrated protein data. Database  : the journal of biological databases and curation 2011, 2011:bar009.

Page 211: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

189

3. Benson D a, Karsch-Mizrachi I, Clark K, Lipman DJ, Ostell J, Sayers EW: GenBank. Nucleic acids research 2012, 40:D48–53. 4. Squires RB, Noronha J, Hunt V, García-Sastre A, Macken C, Baumgarth N, Suarez D, Pickett BE, Zhang Y, Larsen CN, Ramsey A, Zhou L, Zaremba S, Kumar S, Deitrich J, Klem E, Scheuermann RH: Influenza Research Database: an integrated bioinformatics resource for influenza research and surveillance. Influenza and other respiratory viruses 2012. 5. Olsen LR, Zhang GL, Reinherz EL, Brusic V: FLAVIdB: A data mining system for knowledge discovery in flaviviruses with direct applications in immunology and vaccinology. Immunome research 2011, 7:1–9. 6. Sebastiani F: Machine learning in automated text categorization. ACM Computing Surveys 2002, 34:1–47. 7. Zhao J, Miles A, Klyne G, Shotton D: Linked data and provenance in biological data webs. Briefings in bioinformatics 2009, 10:139–52. 8. Mierswa I, Wurst M, Klinkenberg R, Scholz M: YALE  : Rapid Prototyping for Complex Data Mining Tasks. Proceeding KDD ’06 Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining 2006:935–940. 9. Schönbach C, Nagashima T, Konagaya A: Textmining in support of knowledge discovery for vaccine development. Methods 2004, 34:488–95. 10. Brusic V, August JT, Petrovsky N: Information technologies for vaccine research. Expert review of vaccines 2005, 4:407–17. 11. Wulf M, Hoehn P, Trinder P: Identification of human MHC class I binding peptides using the iTOPIA- epitope discovery system. Methods in molecular biology 2009, 524:361–7. 12. Andersen RS, Kvistborg P, Frøsig TM, Pedersen NW, Lyngaa R, Bakker AH, Shu CJ, Straten PT, Schumacher TN, Hadrup SR: Parallel detection of antigen-specific T cell responses by combinatorial encoding of MHC multimers. Nature protocols 2012, 7:891–902.

Page 212: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

190

13. Schuler MM, Nastke M-D, Stevanovikć S: SYFPEITHI: database for searching and T-cell epitope prediction. Methods in molecular biology 2007, 409:75–93. 14. Vita R, Zarebski L, Greenbaum J a, Emami H, Hoof I, Salimi N, Damle R, Sette A, Peters B: The immune epitope database 2.0. Nucleic acids research 2010, 38:D854–62. 15. Forbes SA, Bhamra G, Bamford S, Dawson E, Kok C, Clements J, Menzies A, Teague JW, Futreal PA, Stratton MR: The Catalogue of Somatic Mutations in Cancer (COSMIC). Current protocols in human genetics 2008, Chapter 10:Unit 10.11. 16. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. Journal of molecular biology 1990, 215:403–10. 17. Katoh K, Toh H: Recent developments in the MAFFT multiple sequence alignment program. Briefings in bioinformatics 2008, 9:286–98. 18. Nielsen M, Lundegaard C, Lund O: Prediction of MHC class II binding affinity using SMM-align, a novel stabilization matrix alignment method. BMC bioinformatics 2007, 8:238. 19. Lundegaard C, Lamberth K, Harndahl M, Buus S, Lund O, Nielsen M: NetMHC-3.0: accurate web accessible predictions of human, mouse and monkey MHC class I affinities for peptides of length 8-11. Nucleic acids research 2008, 36:W509–12. 20. Van den Eynde BJ, van der Bruggen P: T cell defined tumor antigens. Current opinion in immunology 1997, 9:684–93. 21. Renkvist N, Castelli C, Robbins PF, Parmiani G: A listing of human tumor antigens recognized by T cells. Cancer immunology, immunotherapy  : CII 2001, 50:3–15. 22. Novellino L, Castelli C, Parmiani G: A listing of human tumor antigens recognized by T cells: March 2004 update. Cancer immunology, immunotherapy  : CII 2005, 54:187–207. 23. Lu Z: PubMed and beyond: a survey of web tools for searching biomedical literature. Database  : the journal of biological databases and curation 2011, 2011:baq036.

Page 213: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

191

24. Feinerer I: Introduction to the tm Package Text Mining in R. R vignette 2011:1–8. 25. Feinerer I, Hornik K, Meyer D: Text Mining Infrastructure in R. Journal of Statistical Software 2008, 25. 26. Jones KS: A statistical interpretation of term specificity and its application in retrieval. Journal of Documentation 1972, 28:11–21. 27. Cover TM, Hart PE: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 1967, 13:21–27.

Page 214: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University
Page 215: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

193

Paper VI Tumor antigens as proteogenomic biomarkers in invasive ductal carcinomas

Lars Rønn Olsen1,2,3 , Benito Campos4, Ole Winther1,5, and Vladimir Brusic3,6,7

1Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark 2Biotech Research and Innovation Center (BRIC), University of Copenhagen, Copenhagen, Denmark 3Cancer Vaccine Center, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA 4Division of Experimental Neurosurgery, Department of Neurosurgery, Heidelberg University Hospital, Heidelberg, Germany 5Cognitive Systems, DTU Compute, Technical University of Denmark, Lyngby, Denmark 6Department of Medicine, Harvard Medical School, Boston, MA, USA 7Department of Computer Science, Metropolitan College, Boston University, Boston MA, USA

Page 216: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

194

Abstract

The majority of genetic biomarkers for human cancers are elucidated by statistical screening of high-throughput genomics data. A large number of genetic biomarkers have been proposed for diagnostic and prognostic applications and a small number of these biomarkers are regularly applied in the clinic. Similarly, the use of proteomics methods for the discovery of cancer biomarkers is increasing. The emerging field of proteogenomics seeks to enrich the value of genomics and proteomics approaches by studying the intersection of the two data types. This is a task that is challenging due to the complex nature of transcriptional and translation regulatory mechanisms and disparities between genomic and proteomic data from the same samples. Limited understanding of the underlying genetic mechanisms and the role of their products in cancer development is a major pitfall of biomarker discovery studies that use large-scale data-driven statistical analyses. In this work, we have studied tumor antigens - proteins that are altered or overexpressed in cancers as potential biomarkers in breast cancer. We applied a proteogenomic analysis to study the genetic aberrations of 32 tumor antigens. We found that tumor antigens aberrantly expressed at the genetic level and expressed at the protein level, are likely to be involved in perturbing pathways directly linked to the hallmarks of cancer.

Page 217: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

195

1 Introduction

Cancer cells differ from normal cells by genetic and epigenetic aberrations in a number of cellular functions. The genetic hallmarks of cancer include self-sufficiency in growth signals, insensitivity to growth-inhibition signals, unlimited replicative potential, resistance to apoptosis, sustained angiogenesis, and local tissue invasion and metastasis. In addition, cancer cells display deregulated cell energetics, instability and mutation of the genome, avoidance of immune destruction, and tumor-promoting inflammation [1]. Epigenetic changes include DNA methylation, histone modifications, nucleosome positioning, and miRNA expression [2]. These hallmarks define functional profiles and aberrations that distinguish cancer from normal tissue. The accumulation of critical mutations results in measurable changes from normal cells including gain, loss, or functional alterations of genes and proteins expressed in cancer cells. Molecular aberrations in cancers may include loss of transcripts or proteins that are normally expressed on cells. Immune evasion employed by tumors involves multiple cellular and molecular mechanisms. Examples include blocking of the STAT-3 signaling pathway [3], toll-like receptor (TLR) activation [4], production of immunosuppressive cytokines [5], infiltration of myeloid suppressor cells (MDSCs) or regulatory T cells (TREG) [6, 7], activation of immunosuppressive networks [8], and down-regulation of human leukocyte antigen (HLA) or impairment of antigen processing and presentation [9, 10]. Inflammation promotes multiple hallmark capabilities through a spectrum of bioactive molecules that are supplied to the tumor microenvironment in support of the hallmarks of cancer [11–16]. These include growth factors that support proliferative signaling, and survival factors that reduce cell death. Inflammation also promotes the transformation of epithelial cells that enables cancer cells to invade, to resist apoptosis, and to spread to other tissues. Inflammation also promotes factors that stimulate angiogenesis, local tissue invasion, and metastasis. Tumor antigens (TAs) are tumor proteins that when expressed in tumors are recognized by the host immune system. They represent markers that are either specific for individual tumors or are overexpressed in tumors as compared to normal tissues [17]. TAs can be neoantigens (tumor-specific antigens) that arise from mutation or RNA splicing. Neoantigens are expressed only by cancer cells and not by normal tissue [18]. Tumor-associated antigens show increase in expression in cancer tissue as compared to normal tissues (e.g. IDO1 [19], HER2 [20], or survivin [21]). Tissue-specific antigens that are

Page 218: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

196

recruited and expressed by cancers in specific tissues (e.g. Cyclin-A1 [22] or Cancer/Testis Antigen 1B [23]). TAs are targets for cancer diagnostic and therapy that have been studied extensively – more than 1400 clinical trials focusing on TAs have been reported in clinicaltrials.gov as of October 2013 (www.clinicaltrials.gov). A biomarker is a distinctive measurable molecular or biochemical feature that can be used to evaluate normal and pathological biological processes, biological responses, and the outcomes of such processes [24]. Biomarkers often serve as diagnostic, prognostic or drug targets. They are useful for early disease detection, accurate diagnosis, evaluation of the characteristics and severity of the disease, selection of effective treatments, and post-treatment disease monitoring [25]. Most common biomarkers are based on patterns in expression of genes, RNA, proteins, or epigenetic patterns [26]. An ideal biomarker is highly specific to malignant tissues, with little or no presence in normal tissue or tissues relating to other, non-malignant pathologies. Similarly, an ideal biomarkers will be expressed in all patients with a given cancer type, but as our understanding of cancer increases, the heterogeneous nature of the disease is becoming more apparent. Examples of cancer biomarkers routinely used in the clinic include α-fetoprotein (AFP) for diagnostics and management of testicular cancer [27], MUC16 (cancer antigen 125 or CA-125) for ovarian cancer [28], ERBB2 (HER2) protein for breast cancer [29], and prostate specific antigen (PSA) for prostate cancer [30]. The progression of a normal cell towards a neoplastic state comprises a cascade of events that are responsible for inducing tumorigenesis [31]. Each tumor cell is typically characterized by more than one of the hallmarks of cancer – tumors are not homogenous collections of neoplastic cells, but are complex mixes of multiple distinct cell types representing different stages of tumorigenesis [1]. Additionally, conditions of the tumor microenvironment may also contribute to the cellular characteristics of tumor cells, as this microenvironment may induce genetic instability in tumor cells [32]. These properties of tumors render the definition of stable, general biomarkers a highly challenging task, as cancer cells also tend to alter their molecular constitution as they develop. The advent of high-throughput genomics methods has facilitated a dramatic increase in the number of proposed genetic biomarkers [33]. However, very few new genetic biomarkers have been added to the clinical toolbox in recent years [34]. Approximately 95% of human protein coding genes produce splice variant transcripts increasing the number of gene expression biomarker candidates [35]. Individual

Page 219: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

197

genetic biomarkers are rarely informative as genetic redundancy provides for high genetic flexibility without necessarily affecting the biological phenotype [36]. The expression of many genes correlate with disease progression in individual patients, but only a small subset of these is consistently observed in larger cohorts. Expression of genes may also prove inconsistent over time, since cells respond to changes in environment and undergo transcriptomic changes in different stages of development. However, although transcriptomic difference are apparent between normal and cancerous tissues, the expression of TAs remains relatively constant during progression through different cancer stages, suggesting that most defining genetic alterations conferring cancerous potential occur at the early stages of tumorigenesis [37]. In contrast, similar studies of the tumor microenvironment reveal extensive gene expression changes in tumor stromal tissue during cancer progression [38]. The number of protein biomarkers has been growing, owing largely to the advances in mass spectrometry techniques [39]. However, protein biomarkers, just like genetic biomarkers, are rarely uniquely expressed or overexpressed in malignant tissues [34]. More than 260,000 protein variants resulting from alternative splicing and have been annotated to date [40]. Additionally, a variety of post-translational modifications (PTMs) change the structure and function of proteins. Common PTMs include phosphorylation, ubiquitination, glycosylation, methylation, and oxidation, among many others [41]. Most of the proposed protein biomarkers are not commonly used in clinical application as the cost of assays outweigh their prognostic value [42]. Other types of biomarkers with clinical potential include somatic mutations. For example, EGFR mutations have been associated with responses to the growth factor inhibitor gefitinib in non-small cell lung cancer patients [43]. Epigenetic alterations, such as methylation patterns, are being used as biomarkers for susceptibility to chemotherapy treatments of glioblastoma multiforme patients [44]. Structural variations such as copy number variations of TOP2A are being used for predictive markers of the effect of adjuvant epirubicin treatment in breast cancer patients [45]. Fusion of TMPRSS2 and ETS transcription factors serve as biomarkers for prostate cancer [46]. Similar to gene expression analyses, protein expression profiling has proven useful for stratification of cancer versus normal tissues for invasive ductal carcinoma, where differentially expressed proteins or protein sets can be used as biomarkers for carcinogenesis [47].

Page 220: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

198

The emerging field of proteogenomics seeks to synergistically enrich the value of genomics and proteomics by studying the intersection of the two data types [48]. True integrative analysis of genomics and proteomics data is a non-trivial task, as expression of mRNA does not always correlate with expression of the corresponding protein [49]. The mechanisms of transcriptional and translational regulation have been extensively studied, but are not yet fully understood. However, if expression of individual genes or pathways correlate with protein expression this may provide insights into transcriptional effects and functional basis of a statistically derived biomarker [50]. More advanced approaches to proteogenomic profiling have been applied to invasive ductal carcinomas (IDCs) to reveal biological events not previously associated with the cancer type. This approach involves building networks of differentially expressed proteins and identification of networks of differentially expressed genes [51]. TAs serve as direct targets for biomarker discovery in tumors. Autologous antibodies against tumor associated antigens (TAAs) are often observed in cancer patient sera [52]. In addition to their diagnostic and prognostic value as biomarkers, TAAs and tumor specific antigens (TSAs) have direct therapeutic applications as targets of cancer immunotherapies [53]. The analysis of naturally processed TAs as proteogenomic biomarkers may also reveal additional insight into the dynamics of their transcription, which may aid in diagnosis, prognosis, and optimization of treatment. In this study we combined the analysis of protein expression and mRNA expression of a selection of well-described TAs [54, 55] as proteogenomic biomarkers for IDCs. We analyzed the expression of TAs on protein level in normal and IDC samples [47]. The results were compared to the expression levels of TA mRNA in normal, preinvasive, and invasive tissue [38]. Aberrantly expressed TAs and functional interactants were analyzed for their involvement in biological pathways, and potentially perturbed pathways were interpreted in the context of known molecular changes in IDC. This approach enabled us to account for a large proportion of the hallmarks of cancer, by the analysis of a very limited set of genes. 2 Materials and methods

2.1 Data

2.1.1 Protein expression data Nine samples of estrogen receptor positive ER+ IDC tissue were analyzed for expression of 1623 proteins using liquid chromatography

Page 221: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

199

coupled with mass spectrometry [47]. The tissue samples were obtained using laser capture microdissection. In addition to the nine IDC samples, nine (non-paired) normal tissue samples were analyzed for protein expression.

2.1.2 mRNA expression data The mRNA expression data were extracted from breast cancer biopsies, collected at the Massachusetts General Hospital between 1998 and 2001 [37]. Three different types of samples were collected: normal tissue, preinvasive tissue, and invasive tissue of ER+ IDC. Tissue consists of cells from the epithelial and stromal compartments of the normal terminal ductal lobular unit. Nine samples of normal and preinvasive cancer were paired, nine samples of normal and invasive tissue were paired, and four samples were paired for all three tissue types. Tissue samples were obtained using laser capture microdissection and were analyzed for mRNA expression using the Affymetrix whole genome array U133X3P [38].

2.1.3 Tumor antigen data A list of known TAs was extracted from the TANTIGEN database of TAs (http://cvc.dfci.harvard.edu/tadb/). TANTIGEN contains 4245 T cell epitopes found in 258 unique protein TAs (November 2013) collected from the literature. Protein expression was measured for 32 of these TAs by Karger and colleagues [47]. Of these 32 proteins, gene expression was meassured for 30 TAs [37]. We further analyzed these 30 TAs in a proteogenomic setting.

2.1.4 mRNA and protein expression in IDC tissue For the analysis of correlation between gene expression and protein expression, we examined 404 IDC tissue samples collected by The Cancer Genome Atlas (TCGA) consortium. Paired mRNA expression and protein expression data was available for 86 gene/protein pairs, of which 13 are known TAs. mRNA expression was extracted using Agilent mRNA expression microarrays, and protein expression was extracted using Reverse Phase Protein Arrays [56]. 2.2 Analyses

2.2.1 Protein expression analysis The TAs were extracted from the protein expression data. The nine normal samples were averaged and compared with expression in the nine invasive tissue samples. Log2 ratios of expression of each protein compared with the normal tissue average (normalized to 1 if expressed, or kept at 0 if not expressed) were calculated for each patient.

Page 222: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

200

2.2.2 mRNA expression analysis Raw probe intensities were background corrected using rma, quantile normalized [57, 58], and the probe sets were indexed relative to the median. Fold changes were calculated as individual mRNA expression compared to the median mRNA expression and log2 transformed. TA genes were extracted from the expression data. Fold changes were calculated between the three types of cancer tissue (normal, preinvasive, and invasive). Genes were examined for consistently differentially expressed genes in the patient cohort using a paired t test. Additionally, gene expression dynamics in the three cancer tissue types were examined in each individual patient.

2.2.3 Annotation Each TA was characterized for its potential functions in tumorigenesis in each patient to gain an insight into the underlying biology of the observed expression profiles. Information about protein function and role in disease was extracted from OMIM [59], UniProt [60], GeneCards [61], the Human Protein Atlas [62] and UniGene (http://www.ncbi.nlm.nih.gov/unigene).

2.2.4 Pathway analysis Functional neighbors to aberrantly regulated TAs were extracted from the STRING database of protein-protein interactions (version 9.05) [63]. Interactions with > 0.5 confidence score were considered and the resulting protein groups were analyzed for their overlapping involvement in canonical pathways using the molecular signatures database, MSigDB [64].

2.2.5 Correlating mRNA and protein expression To determine correlation between mRNA expression and protein expression, we calculated Spearman's rank correlation coefficient. The threshold for correlation was set at ρ > 0.455 to compare our results with a previous study of correlation between the mRNA and protein expression by Gry et al. [65]. The authors used the ρ > 0.455 threshold by determining that the average correlation coefficient of expression in 1000 randomly chosen gene/product pairs from 23 cancer cell lines was 0.001, and that the selected value of ρ of 0.455 provided the 95% confidence interval of the ρ distribution.

Page 223: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

201

3 Results and discussion

3.1 Translation of mRNA in IDC tissue The central dogma of molecular biology states that DNA is transcribed into RNA, which is in turn translated into protein. From this general rule, it is often assumed that a certain amount of DNA makes an equal, or at least proportional, amount of RNA, which in turn makes an equal or proportional amount of protein. However, it is becoming increasing clear that a plethora of transcriptional and translational regulatory mechanisms affect the dynamics of the central dogma. Examples of transcriptional regulation include: miRNA interference [66]; epigenetic factors, such as methylation [67]; and alternative splicing [68]. Translational regulation affects the rates of degradation for different proteins [69] as well as PTMs [70]. Many of these regulatory mechanisms have been used in efforts to characterize new biomarkers for various cancers [71–74]. Although most of these mechanisms are not yet fully understood, they modulate transcription and translation in all cells. The correlation of expression between 1066 gene/product pairs from 23 human cancer cell lines of various origins and cancer types was examined [65]. This study reported 169 genes for which mRNA and protein expression correlated (at threshold of ρ > 0.455). The study involved the analysis of biological function ontologies and it was found that within the set of 169 genes, ontologies relating to the cytoskeleton and adherent junctions in the cellular compartment, cellular motility, and other maintenance-related categories were significantly enriched [65]. Dysregulation of cellular processes is a feature of cells undergoing tumorigenesis, and transcriptional and translational mechanisms are likely to be dysregulated as well. Given the heterogeneous nature of cancer, it is unlikely that transcription and translation modulation is homogenous in different cancers. This prompted us to examine the correlation between gene expression and protein expression in the TCGA IDC tissues, to provide an IDC specific context for our analysis of the unpaired sets of mRNA and protein expression data between normal, noninvasive and invasive breast cancer tissue. We calculated Spearman's ρ for the mRNA/protein expression pairs of the 86 genes examined by TCGA. Twentynine genes were found to have a ρ > 0.455, six of which are TAs (Figure 1, top panel). Although the number of examined proteins is too small to draw global conclusions (global patterns of translation in IDC tissue still remain to

Page 224: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

202

be elucidated in comprehensive analyses), it is noticeable that out of 13 examined TAs, mRNA and protein expressions only correlate in six. This observation is consistent with the results reported by Gry et al. [65] and TCGA [56] and it is therefore insufficient to study only the expression of mRNA when searching for TAs as potential immunotherapy targets. The mean correlation of expression of the 86 mRNA/protein IDC pairs was calculated to be 0.35. Randomly paired mRNA and protein expressions yield a mean correlation of approximately zero, indicating no apparent bias in the data. Distributions of correlation coefficients in the normal and randomized data set are shown in Figure 1 (bottom panel)

Figure 1: Top panel: Spearman's rank correlation between mRNA expression and protein expression of 86 genes in 404 IDC patients. The TAs are highlighted in red. Bottom panel: density distribution of correlation coefficients of mRNA vs protein expression (aqua) and density distribution of correlation coefficients of randomized mRNA vs protein expression (pink). The dashed red lines mark the mean correlation in each distribution.

0.0

0.2

0.4

0.6

0.8

GAT

A3IG

FBP2

AR ASNS

INPP

4BCCNB1

ERBB2

EIF4EBP1

KIT

CHEK

2SM

AD3

ANXA

1PA

RK7

GAB

2IR

S1EG

FRSM

AD1

LCK

SYK

TP53

BP1

SRC

ITG

A2AC

ACA

MAP

2K1

CAV1

STAT

5ABRAF

CLDN

7CD

H3M

APK9

CDH1

PEA1

5PR

DX1

ARID

1APT

ENEE

F2K

NOTC

H1PI

K3CA

EIF4

ENF

2FN1

CHEK

1AK

T1FO

XO3

RAD5

0AT

MXR

CC1

PCNA

PRKC

ARB

M3

MYC

DVL3

BCL2

L11

ANLN

PXN

STK1

1BCL2L1

STM

N1SC

DCO

L6A1

ERBB

3TS

C2TP53

SMAD

4ER

RFI1

CDH2

PDCD

4M

APK1

YWHA

EPR

KAA1

XRCC

5PS

MD9

MRE

11A

CTNN

A1BI

DXBP1

YBX1

KRAS

EEF2

RPS6

BAK1

BECN

1M

AN1B

1HS

PBP1

CTNNB1

KDR

genes

Spearman's ѩ

Page 225: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

203

Regulation of transcription and translation by various mechanisms provide high expression variability, so it is not surprising that two studies of different cancer cells types do not produce similar results. A proteomic approach is a transcript-centric characterization of cancer rather than gene centric, since mRNA expression rarely correlates with protein expression. However, since many clinically useful diagnostic and prognostic biomarkers have been elucidated from mRNA expression data, ideally the approach to detailed cancer characterization should be proteogenomic, and the integration of regulatory factors such as miRNA and methylation patterns should be included. 3.2 Expression of tumor antigens on protein level Although not paired like tissue samples used for the analysis of gene expression, the protein expression data offer valuable information for prescreening and filtering of TA genes for further analysis. With the exception of OAS3, BST2, and SCRN1, all TAs were expressed in at least one of the normal tissue samples. Not all TAs were expressed in the IDC tissue, and only a fraction was consistently expressed in all nine patient samples (Figure 4). These results were not unexpected, since different expression patterns can sometimes be observed in different cells of the same tumor [75]. Within the small cohort examined here, 30 of 32 TAs were expressed as proteins in at least two of nine patients. Furthermore immunotherapies could be defined by targeting multiple TAs at one time. The two TAs, MYO1B and SART3, found to not be expressed in the IDC patient cohort were removed from the list and the remaining 30 TAs were further analyzed. 3.3 Tumor antigen gene expression patterns of normal,

preinvasive and invasive IDC tissues We examined differential expression of TA mRNAs between normal and preinvasive IDC tissue for the TAs expressed as proteins in the IDC tissue. Of the 30 TAs expressed in IDC tissue (excluding MYO1B and SART3), mRNA expression data was measured for 28 TAs (not measured for RPSA and HSPA1B). Five TAs displayed consistent down regulation between the normal and preinvasive tissues, nine TAs displayed up regulation and 16 showed no significant difference between the two types of tissue (p < 0.05). Fold change to median expression of significantly differentially expressed TAs is shown in Figure 2 (left panel). A comparison of TA expression between normal and invasive tumor tissues revealed ten up regulated TAs and two down regulated TAs (Figure 2, right panel).

Page 226: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

204

Figure 4: Ratio of protein expression in IDC tissue to mean expression in normal tissue of 32 measured TAs. The cytoplasmic ribosomal subunits RPS2 and RPL10A were down regulated in both preinvasive and invasive tissues (Figure 2). These patterns, although not consistent with the increased protein synthesis by tumor cells, have been observed previously in breast cancer and colorectal cancer [38, 76] studies. The gene expression of these TAs that is typically up regulated in cancer, but it was down regulated in the IDC tissue. Sgroi and colleagues noted that the mechanisms by which ribosomal proteins contribute to tumorigenesis are relatively poorly understood [38]. The down regulation of these particular ribosomal subunits may be indicative of qualitative rearrangement of

mea

n no

rmal inv

inv

inv

inv

inv

inv

inv

inv

inv

FN1STAT1MUC1LGALS3BPATICEFTUD2COTL1ENAHTPI1HSPA1BSNRPD1NPM1PGK1ANXA2RPS2RPA1RPL10AEEF2BCAP31TPM4RPSAACTN4PRDX5JUPPA2G4PPIBSCRN1BST2CTNNB1OAS3SART3MYO1B

mean normal vs. IDC

Page 227: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

205

the ribosomal proteins as a whole, meaning that one should be careful in interpreting these observation points as potentially causal.

Figure 2: The heat map based on log2 transformed fold change from median expression of TA mRNA. Shown in these heat maps are TAs that are significantly differentially expressed between normal tissue and non-invasive tissue (left) and between normal tissue and invasive tissue (right) (p < 0.05). Red corresponds to upregulated and blue corresponds to downregulated transcripts. Seven TA genes were up regulated in both preinvasive and invasive tissues as compared to the normal tissue. The genes of the STAT protein family encode a series of signal transducers and transcription activators. STAT1, which is up regulated in both preinvasive and invasive IDC tissues as compared with the normal tissue. This pattern has been associated with the faster progression from ductal carcinoma in situ to invasive carcinoma, most likely by inducing immunosuppression in the tumor microenvironment [77]. STAT1, which is typically dormant in normal tissue, is also associated with aggressive growth and chemotherapy resistance [78]. Interestingly, there is very little difference between fold changes of STAT1 expression between preinvasive and invasive tissues compared with normal samples, indicating that the effects of STAT1 are already present in the preinvasive tissue. Likewise, the bone marrow stromal cell antigen 2 (BST2) is up regulated in the IDC. The up regulated BST2 gene has been proposed as a biomarker for bone metastasis of breast cancer [79]. Similar patterns have been observed in tamoxifen resistant breast cancer cells, where up regulation of BST2 gene expression correlated to increased invasiveness and metastasis, regulated and activated by STAT3 [80].

norm

alno

rmal

norm

alno

rmal

norm

alno

rmal

norm

alno

rmal

norm

al inv

inv

inv

inv

inv

inv

inv

inv

inv

RPL10ARPS2ATICBCAP31STAT1SNRPD1PGK1PA2G4OAS3MUC1BST2ENAH

norm

alno

rmal

norm

alno

rmal

norm

alno

rmal

norm

alno

rmal

norm

alni

nvni

nvni

nvni

nvni

nvni

nvni

nvni

nvni

nv

BST2MUC1BCAP31ANXA2ATICPGK1TPI1STAT1ENAHNPM1EEF2RPL10ARPS2CTNNB1

normal vs noninvasive normal vs invasive

Page 228: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

206

Increased expression of phosphoglycerate kinase 1 (PGK1) was also observed in both preinvasive and invasive tissues. PGK1 is a major enzyme in glycolysis and facilitates ATP production under hypoxic conditions [81]. The elevated levels of PGK1 have previously been associated with an increased invasiveness of gastric cancer [82]. PGK1 was reported to facilitate the release of the anti-angiogenic enzyme, angiostatin [83]. Mucin 1 (MUC1) is an oncoprotein that is overexpressed in 90% of breast cancer patients and its gene is amplified in 40% of the patients. Overexpression of this protein has been linked to tamoxifen resistance [84]. The activities of MUC1 leading to tamoxifen resistance are many: it contributes to the activation of the PI3K-AKT pathway known to be involved in apoptosis [85]. It activates the MEK/ERK pathway, a regulator of cellular growth and apoptosis in breast cancer [86]. It also activates the Wnt/β-catenin pathway involved in cell proliferation and migration as well as formation and maintenance of cancer stem cells [87] and it activates STAT pathways associated with cellular growth and inflammation [88]. We also found the cytoskeleton regulatory protein, ENAH, to be overexpressed in both preinvasive and invasive tissues. EHAH is non-detectable in normal tissues, but was found to be weakly expressed in the low risk benign lesions, and was overexpressed in the high risk benign breast lesions [89]. ATIC, the product of the purH gene, is involved in the final steps of de novo synthesis of purine [90]. Imbalances in the biosynthesis and metabolism of purine is linked with progression of a number of cancer types [91]. ATIC inhibition has been explored as a therapeutic strategy in breast cancer patients [92], as has the potential of ATIC as a TA [93]. The function of B-cell receptor-associated protein 31 (BCAP31) is relatively unknown. It has been proposed to be involved in CASP8-mediated apoptosis, where its cleavage product is a strong inducer of apoptosis. However, caspase resistant types of BCAP31 have been observed and the lack of cleavage during apoptosis leads to reduced apoptotic potential [94]. BCAP31 has previously been reported to be up regulated in the breast cancer tissue as compared to normal tissue [95].

Page 229: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

207

3.3.1 Individual expression profiles Although there are similarities between TA genes differentially expressed in preinvasive and invasive tissues, there were also discrepancies between patients, even in this small cohort. The analysis of four patients for whom expression was measured in normal, preinvasive, and invasive tissues revealed that there are only a few clear patterns of expression (Figure 3). Only four TA genes, ANXA2, ENAH, ATIC, and STAT1, were consistently up regulated in both IDC tissue types in all four patients. Two TA genes, CTNBB1 and RPL10, were consistently down regulated in both IDC tissues in all four patients. No complex patterns, i.e. genes up regulated in one IDC tissue type and down regulated in the other, or vice versa, were observed. However, the expression of some TA genes were consistent across tissue types, but not across patients; for example, LGALS3BP is strongly expressed in patient 1, but down regulated in patient 4.

Figure 3: Heat map based on log2 transformed fold change of gene expression in noninvasive and invasive IDC tissue compared with expression in normal tissue in four individual patients.

ninv inv

ninv inv

ninv inv

ninv inv

EFTUD2TPM4TPI1STAT1SNRPD1SCRN1RPS2RPL10ARPA1ATICPRDX5PPIBJUPPGK1PA2G4OAS3NPM1MUC1LGALS3BPFN1ENAHEEF2CTNNB1COTL1BST2BCAP31ANXA2ACTN4

patient 1 patient 2 patient 3 patient 4

Page 230: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

208

3.4 Pathway-based assessment of expression patterns To evaluate the potential biological impact of aberrantly expressed TAs, we examined their involvement in canonical pathways [96]. From the analyses of differential mRNA expression in the four individual expression profiles, it clear that heterogeneity of mRNA expression exists even when examining a very limited number of proteins. Although only approximately half of the TA genes were significantly differentially expressed in the full cohort, all 30 TAs were aberrantly expressed in at least one of the four individual profiles, and all 30 were expressed at protein level. The functional relationship between the 28 TAs was examined using the STRING database. An interaction confidence threshold of 0.5 yielded four functionally related groups as well as nine non-related TAs (Figure 5).

Figure 5: Confidence view of protein-protein interactions within the 28 examined TAs, generated using STRING database. Nodes correspond to TAs and edges correspond to functional interactions. Thicker edges signify higher confidence in the interaction. Only interactions with a confidence score higher than 0.5 were included.

A) B)

C)

D)

E)

Page 231: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

209

The four functional groups were further examined for their involvement in the canonical pathways related to the hallmarks of cancer. One cycle of expansion to include related proteins was applied to each of the four groups to yield four functional modules (Figure 6). Further examination of each of the four functional modules and their direct interactants using MSigDB [64], reveals dysregulation of canonical pathways related to the hallmarks of cancer.

Figure 6. Confidence view of expanded protein-protein interactions within the four functional groups of TAs (highlighted in gray) generated using STRING database. Interacting proteins were added to the TAs using one cycle of expansion. Nodes correspond to the proteins and edges correspond to their functional interactions. The thicker edges signify higher confidence in the interaction. Interactions with a confidence score higher than 0.5 are shown.

A)# B)#

C)# D)#

Page 232: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

210

3.4.1 Module 1 Expansion of group 1 (Figure 5A) yielded 10 direct functional neighbors at confidence score of 0.5 (Figure 6A). The resulting PPI is heavily interconnected indicating tight functional homology within the module. The proteins of functional module 1 overlaps in two canonical pathways: PGK1, TPI1, GAPDH, and GAPDHS overlap in the glycolysis pathway and expression of twelve of the eighteen proteins are known to be positively correlated with BRCA1 expression in BRCA1mut tumors [97]. Both pathways have been extensively studied as potential therapeutic targets in cancer. Aerobic glycolysis has profound effects on proliferating cells, conferring the capacity to generate ATP, as opposed to differentiated cells, which rely primarily on mitochondrial oxidative phosphorylation for energy production [98]. Abnormal cellular metabolism is a defining feature of cancer [1], and the phenomenon known as the Warburg effect, has both diagnostic and therapeutic potential [99]. For example, tumor cells can be monitored using positron emission tomography (PET) imaging, which utilizes a radioactive glucose analog for tracing cells with higher glucose metabolism [100]. Numerous enzymes are involved in maintaining elevated rates of glycolysis, some of which could, in theory, be used as targets of therapeutic agents [101]. Mutations in the BRCA1 gene are associated with faster progression of breast cancer and other cancers [102]. BRCA1 plays a role DNA repair and genomic stability maintenance. It is a known tumor suppressor, which, when mutated, is linked with the early onset of breast cancer. Network modeling strategies revolving around BRCA1mut revealed a number of genes associated with centrosome dysfunction and thereby increased cancer aggressiveness [97]. Twelve genes in module 1 (six of these are TAs) were involved in the network functionally associated with BRCA1 mutations and centrosome dysfunctions.

3.4.2 Module 2 Module 2 consists of seven TAs (Figure 5B), which when expanded by one cycle in the STRING database has ten interaction partners in a highly connected PPI (Figure 6B). Seven proteins (CTNNB1, APC, GSK3B, AXIN1, LEF1, TCF7L2, PSEN1, CTNNBIP1) overlap in the Wnt pathway, and a subset of these additionally overlaps in the β-catenin pathway. A smaller subgroup consisting of TAs ACTN4, ENAH, and FN1 overlaps in an actin cytoskeleton regulation

Page 233: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

211

pathway. Both these pathways are generally regarded to play a role in cancer, and are known to be connected through shared regulators [103]. The Wnt pathway is involved in a number of cancer-relevant cellular processes such as cell division and migration, and cell fate decisions [104]. Elevated levels of β-catenin in the nucleus or cytoplasm of cells, indicates activation of the Wnt pathway, which correlates with poor prognosis in breast cancer patients [105], and detection of β-catenin levels using immunohistochemical staining is an important diagnostic tool [106]. Molecules of the Wnt pathway have been extensively targeted in anticancer treatment modalities, involving small interfering molecules, blocking antibodies, and peptide-based therapies [107].

3.4.3 Module 3 Module 3 (Figure 6C) is centered on proteins relating to the immune response, as ten of the proteins in this module overlap in immune system related pathways. The predominant immune system pathways in this module are cytokine signaling (specifically IL-6), interferon signaling, and JAK/STAT signaling. All three pathways have been associated with cancer development. The JAK/STAT pathway is recognized as a modulator of cytokine signaling, and has therefore been associated with a large number of malignancies [108]. In cancer, the JAKs and STATs (particularly STAT3) are often observed to be constitutively activated, which is thought to induce cell proliferation and prevent apoptosis [109]. Additionally, the STATs are known to induce pro-oncogenic inflammation in the tumor microenvironment by promoting pathways such as NF-κB and IL-6-GP130-JAK pathways [110]. A number of NF-κB encoded inflammatory factors, such as IL-6, are activators of STAT3, in turn creating a positive feedback loop, leads to dysregulated action of immune modulating pathways [111]. Another commonly observed immune deficiency observed in cancer cells, is impaired interferon-signaling. Interferons are important modulators of immune response, and defects in interferon signaling are highly detrimental to immune control of cancer cells [112]. Cytokine secretion is critically modulated by JAK/STAT activity [113].

Page 234: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

212

3.4.4 Module 4 Finally, module 4 (Figure 6D) consists of the TAs EFTUD2 and SNRPD1, and their ten closest interaction partners. Seven of these proteins overlap in the spliceosome pathway, which plays a central role in pre-mRNA processing and splicing. Splicing in tightly regulated in different tissues and different stages of development, and dysregulation of the spliceosome function may lead to incorrect assembly of exons and nonfunctional translation. As such, alternative splicing plays a significant role in cancer and other malignancies [114]. The spliceosome has therefor been examined for its potential as a therapy target [115]. Spliceosome-related therapy can be directed towards the products of alternative splicing [116] or the spliceosome modulators [117].

3.4.5 Non-interacting TAs In addition to the four modules, we also observed 11 TAs that do not connect with any other TAs. These are not necessarily less valuable as targets, but the analyses performed here, are much less comprehensive for these TAs. Their interactants and involvement in molecular pathways relating to cancer is summarized in Table S1. 4 Conclusion

Statistical testing for patterns in high throughput mRNA expression data has long been the primary method for elucidating biomarkers in human cancers. The experimental methods are constantly refined with inclusion of epigenetic experimentation, measurements of ncRNAs, and protein expression, and the analyses are expanded with ontology enrichment analyses, pathway analyses, and co-analyses of different data types. Additionally, as experimental methods increase in efficiency and resolution, the bodies of data examined keep growing. A large number of diagnostic and prognostic biomarkers have been reported, and a small number are regularly utilized in clinics, but it is believed by some that the majority of reported biomarker candidates are the result of stochastic noise within data sets [118]. In order to extract meaningful knowledge about the etiology of cancer from these data, it is generally desired to contextualize data derived biomarkers in the respective biological pathways, to provide clues for the role of these biomarkers in disease progression. If the role of genes in the diseases for which they serve as markers can be defined, they have the potential to become vastly more useful, as exceptions and variations may more readily be explained and anticipated in clinical models.

Page 235: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

213

TAs are a group of proteins against which the immune system has been recorded to react. Specifically, they are recognized in cells where they are present in larger that usual amounts, or physiochemically altered to a degree at which they no longer resemble native human proteins. As such, their presence or abundance in cancer cells is often unique and their roles and functions are, in many cases, studied extensively. Proteins that are frequently observed (and autologously recognized by the immune system) in tumor cells can be hypothesized to play a significant role in tumorigenesis. They therefore hold the potential to be highly specific biomarkers for the cancers in which they are observed. The challenges pertaining to the utility of TA biomarkers are similar to those we face when we statistically filter out potential biomarkers from vast amounts of high throughput genomics data: our understanding of their function and role in cancer must be elevated to a degree where we can account for outliers and exceptions to the general rules we define in clinical models. To achieve this, we analyzed the mRNA and protein expression of 30 TAs in normal tissue versus IDC tissue. We found that all but two TAs were expressed in IDC on protein level, and a subset of these was aberrantly expressed on mRNA level. We examined their known and proposed roles in cancer by analyzing the TAs and their closest functional counterparts for overlapping participation in canonical pathways. With this approach, we defined four functional modules of TAs and interactants, which overlapped in canonical pathways. The perturbation of these pathways were readily linked to the hallmarks of cancer by querying relevant literature. A previous study of genetic biomarkers in IDC tissue resulted in the identification of approximately 2,000 differentially expressed genes involved in a large number of biological pathways [37]. Among these pathways were the ones related to the hallmarks of cancer that we elucidate from analyzing only 30 TA genes. Currently, 258 TAs are catalogued and annotated in the TANTIGEN database of TAs. Expanding the analysis performed here to the full set of TAs is highly likely to provide additional insights. RNA sequencing is another desirable follow-up study of TAs found to be expressed in a given cancer tissue, as this may reveal known or novel splice isoforms, mutations, or other genetic aberrations. In additional to the diagnostic and prognostic potential of such a study, a catalogue of expressed TAs and variants in a given tumor can be further analyzed for their

Page 236: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

214

potential as therapeutic targets and directly applied in personalized treatment modalities. 5 References

1. Hanahan D, Weinberg R: Hallmarks of cancer: the next generation. Cell 2011, 144:646–74. 2. Sharma S, Kelly TK, Jones PA: Epigenetics in cancer. Carcinogenesis 2010, 31:27–36. 3. Wang T, Niu G, Kortylewski M, Burdelya L, Shain K, Zhang S, Bhattacharya R, Gabrilovich D, Heller R, Coppola D, Dalton W, Jove R, Pardoll D, Yu H: Regulation of the innate and adaptive immune responses by Stat-3 signaling in tumor cells. Nat. Med. 2004, 10:48–54. 4. Huang B, Zhao J, Li H, He K-L, Chen Y, Chen S-H, Mayer L, Unkeless JC, Xiong H: Toll-like receptors on tumor cells facilitate evasion of immune surveillance. Cancer Res. 2005, 65:5009–14. 5. Thomas DA, Massagué J: TGF-beta directly targets cytotoxic T cell functions during tumor evasion of immune surveillance. Cancer Cell 2005, 8:369–80. 6. Zea AH, Rodriguez PC, Atkins MB, Hernandez C, Signoretti S, Zabaleta J, McDermott D, Quiceno D, Youmans A, O’Neill A, Mier J, Ochoa AC: Arginase-producing myeloid suppressor cells in renal cell carcinoma patients: a mechanism of tumor evasion. Cancer Res. 2005, 65:3044–8. 7. Zou W: Regulatory T cells, tumour immunity and immunotherapy. Nat. Rev. Immunol. 2006, 6:295–307. 8. Kim R, Emi M, Tanabe K, Arihiro K: Tumor-driven evolution of immunosuppressive networks during malignant progression. Cancer Res. 2006, 66:5527–36. 9. Cromme F V, Airey J, Heemels MT, Ploegh HL, Keating PJ, Stern PL, Meijer CJ, Walboomers JM: Loss of transporter protein, encoded by the TAP-1 gene, is highly correlated with loss of HLA expression in cervical carcinomas. J. Exp. Med. 1994, 179:335–40.

Page 237: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

215

10. Hicklin DJ, Marincola FM, Ferrone S: HLA class I antigen downregulation in human cancers: T-cell immunotherapy revives an old story. Mol. Med. Today 1999, 5:178–86. 11. Barrallo-Gimeno A, Nieto MA: The Snail genes as inducers of cell movement and survival: implications in development and cancer. Development 2005, 132:3151–61. 12. DeNardo DG, Andreu P, Coussens LM: Interactions between lymphocytes and myeloid cells regulate pro- versus anti-tumor immunity. Cancer Metastasis Rev. 2010, 29:309–16. 13. Grivennikov SI, Greten FR, Karin M: Immunity, inflammation, and cancer. Cell 2010, 140:883–99. 14. Klymkowsky MW, Savagner P: Epithelial-mesenchymal transition: a cancer researcher’s conceptual friend and foe. Am. J. Pathol. 2009, 174:1588–93. 15. Polyak K, Weinberg RA: Transitions between epithelial and mesenchymal states: acquisition of malignant and stem cell traits. Nat. Rev. Cancer 2009, 9:265–73. 16. Yilmaz M, Christofori G: EMT, the cytoskeleton, and cancer cell invasion. Cancer Metastasis Rev. 2009, 28:15–33. 17. Haen SP, Rammensee H-G: The repertoire of human tumor-associated epitopes--identification and selection of antigens and their application in clinical trials. Curr. Opin. Immunol. 2013, 25:277–83. 18. Van den Eynde BJ, van der Bruggen P: T cell defined tumor antigens. Curr. Opin. Immunol. 1997, 9:684–93. 19. Sørensen RB, Berge-Hansen L, Junker N, Hansen CA, Hadrup SR, Schumacher TNM, Svane IM, Becker JC, thor Straten P, Andersen MH: The immune system strikes back: cellular immune responses against indoleamine 2,3-dioxygenase. PLoS One 2009, 4:e6910. 20. Fisk B, Blevins TL, Wharton JT, Ioannides CG: Identification of an immunodominant peptide of HER-2/neu protooncogene recognized by ovarian tumor-specific cytotoxic T lymphocyte lines. J. Exp. Med. 1995, 181:2109–17.

Page 238: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

216

21. Yeung JT, Hamilton RL, Okada H, Jakacki RI, Pollack IF: Increased expression of tumor-associated antigens in pediatric and adult ependymomas: implication for vaccine therapy. J. Neurooncol. 2013, 111:103–11. 22. Ochsenreither S, Majeti R, Schmitt T, Stirewalt D, Keilholz U, Loeb KR, Wood B, Choi YE, Bleakley M, Warren EH, Hudecek M, Akatsuka Y, Weissman IL, Greenberg PD: Cyclin-A1 represents a new immunogenic targetable antigen expressed in acute myeloid leukemia stem cells with characteristics of a cancer-testis antigen. Blood 2012, 119:5492–501. 23. Jäger E, Chen YT, Drijfhout JW, Karbach J, Ringhoffer M, Jäger D, Arand M, Wada H, Noguchi Y, Stockert E, Old LJ, Knuth A: Simultaneous humoral and cellular immune response against cancer-testis antigen NY-ESO-1: definition of human histocompatibility leukocyte antigen (HLA)-A2-binding peptide epitopes. J. Exp. Med. 1998, 187:265–70. 24. Hulka BS: Overview of biological markers. In Biol. markers Epidemiol. edited by Hulka BS, Griffith JD, Wilcosky TC New York: Oxford University Press; 1990:3–15. 25. Ludwig JA, Weinstein JN: Biomarkers in cancer staging, prognosis and treatment selection. Nat. Rev. Cancer 2005, 5:845–56. 26. Brusic V, Marina O, Wu CJ, Reinherz EL: Proteome informatics for cancer research: from molecules to clinic. Proteomics 2007, 7:976–91. 27. Lange PH, McIntire KR, Waldmann TA, Hakala TR, Fraley EE: Serum alpha fetoprotein and human chorionic gonadotropin in the diagnosis and management of nonseminomatous germ-cell testicular cancer. N. Engl. J. Med. 1976, 295:1237–40. 28. Bast RC, Klug TL, St John E, Jenison E, Niloff JM, Lazarus H, Berkowitz RS, Leavitt T, Griffiths CT, Parker L, Zurawski VR, Knapp RC: A radioimmunoassay using a monoclonal antibody to monitor the course of epithelial ovarian cancer. N. Engl. J. Med. 1983, 309:883–7. 29. Slamon DJ, Godolphin W, Jones LA, Holt JA, Wong SG, Keith DE, Levin WJ, Stuart SG, Udove J, Ullrich A: Studies of the HER-2/neu proto-oncogene in human breast and ovarian cancer. Science 1989, 244:707–12.

Page 239: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

217

30. Stamey TA, Yang N, Hay AR, McNeal JE, Freiha FS, Redwine E: Prostate-specific antigen as a serum marker for adenocarcinoma of the prostate. N. Engl. J. Med. 1987, 317:909–16. 31. Foulds L: The experimental study of tumor progression: a review. Cancer Res. 1954, 14:327–39. 32. Reynolds TY, Rockwell S, Glazer PM: Genetic instability induced by the tumor microenvironment. Cancer Res. 1996, 56:5754–7. 33. McDermott U, Downing JR, Stratton MR: Genomics and the continuum of cancer care. N. Engl. J. Med. 2011, 364:340–50. 34. Brooks JD: Translational genomics: the challenge of developing cancer biomarkers. Genome Res. 2012, 22:183–7. 35. Nilsen TW, Graveley BR: Expansion of the eukaryotic proteome by alternative splicing. Nature 2010, 463:457–63. 36. Nowak MA, Boerlijst MC, Cooke J, Smith JM: Evolution of genetic redundancy. Nature 1997, 388:167–71. 37. Ma X-J, Salunga R, Tuggle JT, Gaudet J, Enright E, McQuary P, Payette T, Pistone M, Stecker K, Zhang BM, Zhou Y-X, Varnholt H, Smith B, Gadd M, Chatfield E, Kessler J, Baer TM, Erlander MG, Sgroi DC: Gene expression profiles of human breast cancer progression. Proc. Natl. Acad. Sci. U. S. A. 2003, 100:5974–9. 38. Ma X-J, Dahiya S, Richardson E, Erlander M, Sgroi DC: Gene expression profiling of the tumor microenvironment during breast cancer progression. Breast Cancer Res. 2009, 11:R7. 39. Meng Z, Veenstra TD: Targeted mass spectrometry approaches for protein biomarker verification. J. Proteomics 2011, 74:2650–9. 40. Martelli PL, D’Antonio M, Bonizzoni P, Castrignanò T, D’Erchia AM, D’Onorio De Meo P, Fariselli P, Finelli M, Licciulli F, Mangiulli M, Mignone F, Pavesi G, Picardi E, Rizzi R, Rossi I, Valletti A, Zauli A, Zambelli F, Casadio R, Pesole G: ASPicDB: a database of annotated transcript and protein variants generated by alternative splicing. Nucleic Acids Res. 2011, 39:D80–5.

Page 240: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

218

41. Kamath KS, Vasavada MS, Srivastava S: Proteomic databases and tools to decipher post-translational modifications. J. Proteomics 2011, 75:127–44. 42. Diamandis EP: The failure of protein cancer biomarkers to reach the clinic: why, and what can be done to address the problem? BMC Med. 2012, 10:87. 43. Lynch TJ, Bell DW, Sordella R, Gurubhagavatula S, Okimoto RA, Brannigan BW, Harris PL, Haserlat SM, Supko JG, Haluska FG, Louis DN, Christiani DC, Settleman J, Haber DA: Activating mutations in the epidermal growth factor receptor underlying responsiveness of non-small-cell lung cancer to gefitinib. N. Engl. J. Med. 2004, 350:2129–39. 44. Hegi ME, Diserens A-C, Godard S, Dietrich P-Y, Regli L, Ostermann S, Otten P, Van Melle G, de Tribolet N, Stupp R: Clinical trial substantiates the predictive value of O-6-methylguanine-DNA methyltransferase promoter methylation in glioblastoma patients treated with temozolomide. Clin. Cancer Res. 2004, 10:1871–4. 45. Nielsen KV, Ejlertsen B, Møller S, Jørgensen JT, Knoop A, Knudsen H, Mouridsen HT: The value of TOP2A gene copy number variation as a biomarker in breast cancer: Update of DBCG trial 89D. Acta Oncol. 2008, 47:725–34. 46. Tomlins SA, Rhodes DR, Perner S, Dhanasekaran SM, Mehra R, Sun X-W, Varambally S, Cao X, Tchinda J, Kuefer R, Lee C, Montie JE, Shah RB, Pienta KJ, Rubin MA, Chinnaiyan AM: Recurrent fusion of TMPRSS2 and ETS transcription factor genes in prostate cancer. Science 2005, 310:644–8. 47. Cha S, Imielinski MB, Rejtar T, Richardson EA, Thakur D, Sgroi DC, Karger BL: In situ proteomic analysis of human breast cancer epithelial cells using laser capture microdissection: annotation by protein set enrichment analysis and gene ontology. Mol. Cell. Proteomics 2010, 9:2529–44. 48. Renuse S, Chaerkady R, Pandey A: Proteogenomics. Proteomics 2011, 11:620–30. 49. Greenbaum D, Colangelo C, Williams K, Gerstein M: Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 2003, 4:117.

Page 241: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

219

50. Sigdel TK, Sarwal MM: The proteogenomic path towards biomarker discovery. Pediatr. Transplant. 2008, 12:737–47. 51. Imielinski M, Cha S, Rejtar T, Richardson E a, Karger BL, Sgroi DC: Integrated proteomic, transcriptomic, and biological network analysis of breast carcinoma reveals molecular features of tumorigenesis and clinical relapse. Mol. Cell. Proteomics 2012, 11:M111.014910. 52. Old LJ, Chen YT: New paths in human cancer serology. J. Exp. Med. 1998, 187:1163–7. 53. Van der Bruggen P, Traversari C, Chomez P, Lurquin C, De Plaen E, Van den Eynde B, Knuth A, Boon T: A gene encoding an antigen recognized by cytolytic T lymphocytes on a human melanoma. Science (80-. ). 1991, 254:1643–7. 54. Vigneron N, Stroobant V, Van den Eynde BJ, van der Bruggen P: Database of T cell-defined human tumor antigens: the 2013 update. Cancer Immun. 2013, 13:15. 55. Parmiani G, De Filippo A, Novellino L, Castelli C: Unique human tumor antigens: immunobiology and use in clinical trials. J. Immunol. 2007, 178:1975–9. 56. The Cancer Genome Atlas (TCGA) Research Network: Comprehensive molecular portraits of human breast tumours. Nature 2012, 490:61–70. 57. Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP: Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res. 2003, 31:e15. 58. Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 2004, 5:R80. 59. Amberger J, Bocchini CA, Scott AF, Hamosh A: McKusick’s Online Mendelian Inheritance in Man (OMIM). Nucleic Acids Res. 2009, 37:D793–6.

Page 242: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

220

60. The UniProt Consortium: Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Res. 2013, 41:D43–7. 61. Safran M, Dalah I, Alexander J, Rosen N, Iny Stein T, Shmoish M, Nativ N, Bahir I, Doniger T, Krug H, Sirota-Madi A, Olender T, Golan Y, Stelzer G, Harel A, Lancet D: GeneCards Version 3: the human gene integrator. Database (Oxford). 2010, 2010:baq020. 62. Uhlen M, Oksvold P, Fagerberg L, Lundberg E, Jonasson K, Forsberg M, Zwahlen M, Kampf C, Wester K, Hober S, Wernerus H, Björling L, Ponten F: Towards a knowledge-based Human Protein Atlas. Nat. Biotechnol. 2010, 28:1248–50. 63. Franceschini A, Szklarczyk D, Frankild S, Kuhn M, Simonovic M, Roth A, Lin J, Minguez P, Bork P, von Mering C, Jensen LJ: STRING v9.1: protein-protein interaction networks, with increased coverage and integration. Nucleic Acids Res. 2013, 41:D808–15. 64. Subramanian A, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U. S. A. 2005, 102:15545–50. 65. Gry M, Rimini R, Strömberg S, Asplund A, Pontén F, Uhlén M, Nilsson P: Correlations between RNA and protein expression profiles in 23 human cell lines. BMC Genomics 2009, 10:365. 66. Salmena L, Poliseno L, Tay Y, Kats L, Pandolfi PP: A ceRNA hypothesis: the Rosetta Stone of a hidden RNA language? Cell 2011, 146:353–8. 67. Robertson KD: DNA methylation and human disease. Nat. Rev. Genet. 2005, 6:597–610. 68. Skotheim RI, Nees M: Alternative splicing in cancer: noise, functional, or systematic? Int. J. Biochem. Cell Biol. 2007, 39:1432–49. 69. Lu P, Vogel C, Wang R, Yao X, Marcotte EM: Absolute protein expression profiling estimates the relative contributions of transcriptional and translational regulation. Nat. Biotechnol. 2007, 25:117–24.

Page 243: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

221

70. Wickner S, Maurizi MR, Gottesman S: Posttranslational quality control: folding, refolding, and degrading proteins. Science 1999, 286:1888–93. 71. Dhanasekaran SM, Barrette TR, Ghosh D, Shah R, Varambally S, Kurachi K, Pienta KJ, Rubin MA, Chinnaiyan AM: Delineation of prognostic biomarkers in prostate cancer. Nature 2001, 412:822–6. 72. Buyse M, Loi S, van’t Veer L, Viale G, Delorenzi M, Glas AM, D’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris A, Bogaerts J, Therasse P, Floore A, Amakrane M, Piette F, Rutgers E, Sotiriou C, Cardoso F, Piccart MJ: Validation and clinical utility of a 70-gene prognostic signature for women with node-negative breast cancer. J. Natl. Cancer Inst. 2006, 98:1183–92. 73. Spentzos D, Levine DA, Ramoni MF, Joseph M, Gu X, Boyd J, Libermann TA, Cannistra SA: Therapeutic cancer vaccines in combination with conventional therapy. J. Clin. Oncol. 2004, 22:4700–10. 74. Verhaak RGW, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, Alexe G, Lawrence M, O’Kelly M, Tamayo P, Weir BA, Gabriel S, Winckler W, Gupta S, Jakkula L, Feiler HS, Hodgson JG, James CD, Sarkaria JN, Brennan C, Kahn A, Spellman PT, Wilson RK, Speed TP, Gray JW, Meyerson M, Getz G, Perou CM, Hayes DN: Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell 2010, 17:98–110. 75. Sottoriva A, Spiteri I, Piccirillo SGM, Touloumis A, Collins VP, Marioni JC, Curtis C, Watts C, Tavaré S: Intratumor heterogeneity in human glioblastoma reflects cancer evolutionary dynamics. Proc. Natl. Acad. Sci. U. S. A. 2013, 110:4009–14. 76. Kasai H, Nadano D, Hidaka E, Higuchi K, Kawakubo M, Sato T-A, Nakayama J: Differential expression of ribosomal proteins in human normal and neoplastic colorectum. J. Histochem. Cytochem. 2003, 51:567–74. 77. Hix LM, Karavitis J, Khan MW, Shi YH, Khazaie K, Zhang M: Tumor STAT1 transcription factor activity enhances breast tumor growth and immune suppression mediated by myeloid-derived suppressor cells. J. Biol. Chem. 2013, 288:11676–88.

Page 244: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

222

78. Khodarev NN, Roizman B, Weichselbaum RR: Molecular pathways: interferon/stat1 pathway: role in the tumor resistance to genotoxic stress and aggressive growth. Clin. Cancer Res. 2012, 18:3015–21. 79. Cai D, Cao J, Li Z, Zheng X, Yao Y, Li W, Yuan Z: Up-regulation of bone marrow stromal protein 2 (BST2) in breast cancer with bone metastasis. BMC Cancer 2009, 9:102. 80. Yi EH, Yoo H, Noh KH, Han S, Lee H, Lee J-K, Won C, Kim B-H, Kim M-H, Cho C-H, Ye S: BST-2 is a potential activator of invasion and migration in tamoxifen-resistant breast cancer cells. Biochem. Biophys. Res. Commun. 2013, 435:685–690. 81. Daly EB, Wind T, Jiang X-M, Sun L, Hogg PJ: Secretion of phosphoglycerate kinase from tumour cells is controlled by oxygen-sensing hydroxylases. Biochim. Biophys. Acta 2004, 1691:17–22. 82. Zieker D, Königsrainer I, Tritschler I, Löffler M, Beckert S, Traub F, Nieselt K, Bühler S, Weller M, Gaedcke J, Taichman RS, Northoff H, Brücher BLDM, Königsrainer A: Phosphoglycerate kinase 1 a promoting enzyme for peritoneal dissemination in gastric cancer. Int. J. Cancer 2010, 126:1513–20. 83. Lay AJ, Jiang XM, Kisker O, Flynn E, Underwood A, Condron R, Hogg PJ: Phosphoglycerate kinase acts in tumour angiogenesis as a disulphide reductase. Nature 2000, 408:869–73. 84. Kharbanda A, Rajabi H, Jin C, Raina D, Kufe D: Oncogenic MUC1-C promotes tamoxifen resistance in human breast cancer. Mol. Cancer Res. 2013, 11:714–23. 85. Osaki M, Oshimura M, Ito H: PI3K-Akt pathway: its functions and alterations in human cancer. Apoptosis 2004, 9:667–76. 86. Saini KS, Loi S, de Azambuja E, Metzger-Filho O, Saini ML, Ignatiadis M, Dancey JE, Piccart-Gebhart MJ: Targeting the PI3K/AKT/mTOR and Raf/MEK/ERK pathways in the treatment of breast cancer. Cancer Treat. Rev. 2013, 39:935–946. 87. Wend P, Holland JD, Ziebold U, Birchmeier W: Wnt signaling in stem and cancer stem cells. Semin. Cell Dev. Biol. 2010, 21:855–863.

Page 245: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

223

88. Khodarev N, Ahmad R, Rajabi H, Pitroda S, Kufe T, McClary C, Joshi MD, MacDermed D, Weichselbaum R, Kufe D: Cooperativity of the MUC1 oncoprotein and STAT1 pathway in poor prognosis human breast cancer. Oncogene 2010, 29:920–9. 89. Di Modugno F, Mottolese M, Di Benedetto A, Conidi A, Novelli F, Perracchio L, Venturo I, Botti C, Jager E, Santoni A, Natali PG, Nisticò P: The cytoskeleton regulatory protein hMena (ENAH) is overexpressed in human benign breast lesions with high risk of transformation and human epidermal growth factor receptor-2-positive/hormonal receptor-negative tumors. Clin. Cancer Res. 2006, 12:1470–8. 90. Rayl EA, Moroson BA, Beardsley GP: The human purH gene product, 5-aminoimidazole-4-carboxamide ribonucleotide formyltransferase/IMP cyclohydrolase. Cloning, sequencing, expression, purification, kinetic analysis, and domain mapping. J. Biol. Chem. 1996, 271:2225–33. 91. Weber G: Enzymes of purine metabolism in cancer. Clin. Biochem. 1983, 16:57–63. 92. Martin M, Spielmann M, Namer M, DuBois A, Unger C, Dodwell D, Vodvarka P, Lind M, Calvert H, Casado A, Zelek L, Lluch A, Carrasco E, Kayitalire L, Zielinski C: Phase II study of pemetrexed in breast cancer patients pretreated with anthracyclines. Ann. Oncol. 2003, 14:1246–52. 93. Shichijo S, Azuma K, Komatsu N, Ito M, Maeda Y, Ishihara Y, Itoh K: Two proliferation-related proteins, TYMS and PGK1, could be new cytotoxic T lymphocyte-directed tumor-associated antigens of HLA-A2+ colon cancer. Clin. Cancer Res. 2004, 10:5828–36. 94. Nguyen M, Breckenridge DG, Ducret A, Shore GC: Caspase-resistant BAP31 inhibits fas-mediated apoptotic membrane fragmentation and release of cytochrome c from mitochondria. Mol. Cell. Biol. 2000, 20:6731–40. 95. Li E, Bestagno M, Burrone O: Molecular Cloning and Characterization of a Transmembrane Surface Antigen in Human Cells. Eur. J. Biochem. 1996, 238:631–638.

Page 246: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

224

96. Lawlor K, Nazarian A, Lacomis L, Tempst P, Villanueva J: Pathway-based biomarker search by high-throughput proteomics profiling of secretomes. J. Proteome Res. 2009, 8:1489–503. 97. Pujana MA, Han J-DJ, Starita LM, Stevens KN, Tewari M, Ahn JS, Rennert G, Moreno V, Kirchhoff T, Gold B, Assmann V, Elshamy WM, Rual J-F, Levine D, Rozek LS, Gelman RS, Gunsalus KC, Greenberg RA, Sobhian B, Bertin N, Venkatesan K, Ayivi-Guedehoussou N, Solé X, Hernández P, Lázaro C, Nathanson KL, Weber BL, Cusick ME, Hill DE, Offit K, Livingston DM, Gruber SB, Parvin JD, Vidal M: Network modeling links breast cancer susceptibility and centrosome dysfunction. Nat. Genet. 2007, 39:1338–49. 98. Vander Heiden MG, Cantley LC, Thompson CB: Understanding the Warburg effect: the metabolic requirements of cell proliferation. Science 2009, 324:1029–33. 99. Scatena R, Bottoni P, Pontoglio A, Giardina B: Revisiting the Warburg effect in cancer cells with proteomics. The emergence of new approaches to diagnosis, prognosis and therapy. Proteomics. Clin. Appl. 2010, 4:143–58. 100. Pauwels EK, Sturm EJ, Bombardieri E, Cleton FJ, Stokkel MP: Positron-emission tomography with [18F]fluorodeoxyglucose. Part I. Biochemical uptake mechanism and its implication for clinical studies. J. Cancer Res. Clin. Oncol. 2000, 126:549–59. 101. Nam SO, Yotsumoto F, Miyata K, Shirasu N, Miyamoto S, Kuroki M: Possible therapeutic targets among the molecules involved in the Warburg effect in tumor cells. Anticancer Res. 2013, 33:2855–60. 102. Futreal PA, Liu Q, Shattuck-Eidens D, Cochran C, Harshman K, Tavtigian S, Bennett LM, Haugen-Strano A, Swensen J, Miki Y: BRCA1 mutations in primary breast and ovarian carcinomas. Science 1994, 266:120–2. 103. Akiyama T, Kawasaki Y: Wnt signalling and the actin cytoskeleton. Oncogene 2006, 25:7538–44. 104. Howe LR, Brown AMC: Wnt signaling and breast cancer. Cancer Biol. Ther. 2004, 3:36–41.

Page 247: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

225

105. Lin SY, Xia W, Wang JC, Kwong KY, Spohn B, Wen Y, Pestell RG, Hung MC: Beta-catenin, a novel prognostic marker for breast cancer: its roles in cyclin D1 expression and cancer progression. Proc. Natl. Acad. Sci. U. S. A. 2000, 97:4262–6. 106. Montgomery E, Folpe AL: The diagnostic value of beta-catenin immunohistochemistry. Adv. Anat. Pathol. 2005, 12:350–6. 107. Anastas JN, Moon RT: WNT signalling pathways as therapeutic targets in cancer. Nat. Rev. Cancer 2013, 13:11–26. 108. Sansone P, Bromberg J: Targeting the interleukin-6/Jak/stat pathway in human malignancies. J. Clin. Oncol. 2012, 30:1005–14. 109. Bowman T, Garcia R, Turkson J, Jove R: STATs in oncogenesis. Oncogene 2000, 19:2474–88. 110. Yu H, Pardoll D, Jove R: STATs in cancer inflammation and immunity: a leading role for STAT3. Nat. Rev. Cancer 2009, 9:798–809. 111. Yu H, Kortylewski M, Pardoll D: Crosstalk between cancer and immune cells: role of STAT3 in the tumour microenvironment. Nat. Rev. Immunol. 2007, 7:41–51. 112. Critchley-Thorne RJ, Simons DL, Yan N, Miyahira AK, Dirbas FM, Johnson DL, Swetter SM, Carlson RW, Fisher GA, Koong A, Holmes S, Lee PP: Impaired interferon signaling is a common immune defect in human cancer. Proc. Natl. Acad. Sci. U. S. A. 2009, 106:9010–5. 113. Darnell JE, Kerr IM, Stark GR: Jak-STAT pathways and transcriptional activation in response to IFNs and other extracellular signaling proteins. Science 1994, 264:1415–21. 114. Wang Z, Burge CB: Splicing regulation: from a parts list of regulatory elements to an integrated splicing code. RNA 2008, 14:802–13. 115. Van Alphen RJ, Wiemer EAC, Burger H, Eskens FALM: The spliceosome as target for anticancer treatment. Br. J. Cancer 2009, 100:228–32. 116. Laverman P, Roosenburg S, Gotthardt M, Park J, Oyen WJG, de Jong M, Hellmich MR, Rutjes FPJT, van Delft FL, Boerman OC:

Page 248: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

226

Targeting of a CCK(2) receptor splice variant with (111)In-labelled cholecystokinin-8 (CCK8) and (111)In-labelled minigastrin. Eur. J. Nucl. Med. Mol. Imaging 2008, 35:386–92. 117. Tazi J, Bakkour N, Soret J, Zekri L, Hazra B, Laine W, Baldeyrou B, Lansiaux A, Bailly C: Selective inhibition of topoisomerase I and various steps of spliceosome assembly by diospyrin derivatives. Mol. Pharmacol. 2005, 67:1186–94. 118. Ratain MJ, Glassman RH: Biomarkers in phase I oncology trials: signal, noise, or expensive distraction? Clin. Cancer Res. 2007, 13:6545–8.

Page 249: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

227

Contributions

The following section details the contributions of the authors on each of the six research articles presented here. Paper I: Bioinformatics in cancer immunotherapy

The study was conceived by LRO and MHA. The cancer-related sections in the manuscript was written by LRO, BC, MSB, MHA, and VB. Workflow, database, and bioinformatics-related sections in article were written by LRO. Example applications were performed by LRO. Example applications section was written by LRO and VB. The final manuscript was prepared by LRO, BC, MSB, MHA, and VB. Paper II: Identification of T cell vaccine targets from multiple sequence alignments

The study, concepts, and example applications were conceived by LRO and VB. Example applications were performed by LRO with contributions from FOB. Software integration was performed by LRO, CS, and UJK, supervised by GLZ. Webserver was set up by CS and UJK, and supervised by OW. The final manuscript was prepared by LRO and VB with contributions from ELR. Paper III: BlockLogo: visualization of peptide and sequence motif conservation

The study and concepts were conceived by LRO and VB. The example applications were conceived by LRO, JS, CSc, and VB. Example applications were performed by LRO and VB. Software

Page 250: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

228

integration was performed by LRO, UJK, CSi, and GLZ. Webserver was set up by UJK, CSi, and GLZ. The final manuscript was prepared by LRO and VB with contributions from ELR. Paper IV: TANTIGEN: a tumor antigens database and analysis platform for vaccine target discovery

The study and concepts were conceived by VB and GLZ. The database was assembled by ST and GLZ, and updated by LRO and GLZ. The manuscript was prepared by LRO, TS, GLZ, HHL, and VB with contributions from ELR. Paper V: Literature classification for semi-automated updating of biological knowledgebases

The study and concepts were conceived by LRO. Application was performed by LRO with contributions from UJK and OW. The manuscript was prepared by LRO and VB. Paper IV: Tumor antigens as proteogenomic biomarkers of invasive breast carcinomas

The study and concepts were conceived by VB and LRO. Application of methods was performed by LRO with contributions from BC and OW. The manuscript was prepared by LRO and VB.

Page 251: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

229

Supplementary materials Paper I

Page 252: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

230

Table S1: Annotation of survivin for assessing potential as tumor antigen. Information type Information Source (ID/accession)

Protein name Survivin GeneCards (GC17P076210)

Gene name BIRC5 GeneCards (GC17P076210)

Full name Baculoviral IAP Repeat Containing 5 GeneCards (GC17P076210)

Synonyms API4, EPR-1, IAP4, survivin GeneCards (GC17P076210)

Function BIRC5 inhibits caspase activity, thereby negatively regulating apoptosis.

OMIM (603352)

Role in disease BIRC5 is selectively overexpressed in many common cancers, supporting proliferation of cancer cells by inhibiting programmed cell death.

OMIM (603352)

Localization Mitochondria, cytoplasm, nucleus, and extracellular space OMIM (603352)

Isoforms Isoform 1: Canonical sequence Isoform 2: 74: I → IGPGTVAYACNTSTLGGRGGRITR Isoform 3: 74-142: IEEHKKHSSG...RAIEQLAAMD → MQRKPTIRRK...SLPVGPLAMS Isoform 4: 114-142: AKETNNKKKEFEETAKKVRRAIEQLAAMD → ERALLAE Isoform 5: 105-142: DRERAKNKIAKETNNKKKEFEETAKKVRRAIEQLAAMD → VRETLPPPRSFIR Isoform 6: 74-142: IEEHKKHSSGCAFLSVKKQFEELTLGEFLKLDRERAKNKIAKETNNKKKEFEETAKKVRRAIEQLAAMD → MRELC Isoform 7: 74-142: IEEHKKHSSGCAFLSVKKQFEELTLGEFLKLDRERAKNKIAKETNNKKKEFEETAKKVRRAIEQLAAMD → M

UniProt (O15392)

Mutations Pos: 26, P→S Pos: 70, D→Y Pos: 82, S→P Pos: 116, E→K

COSMIC (COSG625)

Gene expression, normal tissue

No expression in normal, adult, non-dividing tissues. UniGene (6280372)

Gene expression, tumor tissue

Highly expressed in most human tumors. UniGene (6280372)

Protein expression, normal tissue

Distinct nuclear expression in a fraction of cells in squamous epithelia, lymphoid tissues, gastrointestinal tract, testis and glandular cells of female genitalia.

The Human Protein Atlas

Page 253: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

231

Protein expression, tumor tissue

Strong expression observed in some endometrial cancer, liver cancer, lung cancer, skin cancer, stomach cancer, and testis cancer. Moderate expression observed in most cancers, with the exception of prostate cancer and gliomas.

The Human Protein Atlas

T cell epitopes 3 reported CD8+ T cell epitopes 102 reported HLA class I ligands

TANTIGEN (Ag000002)

Page 254: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

232

Table S2: Peptides from survivin known or predicted to bind HLA A*11:01, A*15:01, or A*03:01. All peptides in this table are found in all isoforms and mutated types of survivin. Peptides with a predicted binding affinity of less than 50 nM are strong binders, less than 500 nM are weak binders. All predictions done using netMHC 3.4.

Position Number of variants on position

Peptide HLA alleles (predicted binding

affinity)

Experimental status

Reference

5 1 TLPPAWQPF B*15:01 (86 nM) HLA ligand [1]

15 1 KDHRISTFK A*03:01 (N/A) HLA ligand [1]

37 1 RMAEAGFIH B*15:01 (462 nM) HLA ligand [1]

54 1 LAQCFFCFK A*11:01 (21 nM) N/A -

Page 255: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

233

Figure S1: Protein-protein interaction (PPI) network for BIRC5 and interacting known tumor antigens. Nodes represent proteins and edges correspond to functional interactions. Thicker edges signify higher confidence in the interaction. Image was generated using the STRING database.

Page 256: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

234

Table S3: Annotation of IDO for assessing potential as tumor antigen. Information type Information Source (ID/accession)

Protein name IDO GeneCards (GC08P039771)

Gene name IDO1 GeneCards (GC08P039771)

Full name Indoleamine 2,3-dioxygenase 1 GeneCards (GC08P039771)

Synonyms IDO, INDO, IDO-1 GeneCards (GC08P039771)

Function Catalyzes degradation of L-tryptophan to N-formulkynurenine.

OMIM (147435)

Role in disease IDO has an antiproliferative effect on a number of cells by tryptophan depletion. Inhibits T cell proliferation and acts as an immunosuppressant. IDO is constitutively expressed in most human tumors.

OMIM (147435)

Localization Intracellular OMIM (147435)

Isoforms None characterized. UniProt (P14902)

Mutations Pos: 4, A→T Pos: 12, S→N Pos: 58, R→S Pos: 61, K→E Pos: 77, R→C Pos: 109, V→F Pos: 110, P→T Pos: 116, K→N Pos: 155, R→C

COSMIC (COSG5556)

Gene expression, normal tissue

Expressed at high levels in the human placenta during fetal development.

UniGene (130856)

Gene expression, tumor tissue

Highly expressed in a number of human tumors. UniGene (130856)

Protein expression, normal tissue

High cytoplasmic expression in lymphoid cells. Expressed in a few other cell types but at lower levels.

The Human Protein Atlas

Protein expression, tumor tissue

Strong antibody staining in 30% of cancers, including colorectal, ovarian, cervical, endometrial,

The Human Protein Atlas

Page 257: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

235

lung, stomach, and pancreatic cancers.

T cell epitopes 3 reported CD8+ T cell epitopes 1 reported CD4+ T cell epitope

TANTIGEN (Ag004236)

Page 258: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

236

Table S4: Peptides from IDO predicted to bind HLA A*02:01, A*0301, A*11:01, A*24:02, B*07:01, B*08:01, or B*15:01. All peptides in this table are either found in all mutated variants of IDO, or are peptides from a position on which all peptides are predicted to bind the same HLA.

Position Number of variants on position

Peptide HLA alleles predicted to bind

Experimental status

Reference

3 2 HAMENSWTI A*02:01 N/A -

2 HTMENSWTI A*02:01 N/A -

5 2 MENSWTISK A*03:01, A*11:01 N/A -

2 MENSWTINK A*03:01, A*11:01 N/A -

34 1 DFYNDWMFI A*24:02 N/A -

35 1 FYNDWMFIA A*02:01 N/A -

38 1 DWMFIAKHL A*24:02 N/A -

41 1 FIAKHLPDL A*02:01 N/A -

63 1 NMLSIDHLT A*02:01 N/A -

66 1 SIDHLTDHK A*11:01 N/A -

81 1 LVLGCITMA A*02:01 N/A -

82 1 VLGCITMAY A*03:01, B*15:01 N/A -

86 1 ITMAYVWGK A*03:01, A*11:01 N/A -

119 1 ELPPILVYA A*02:01 N/A -

124 1 LVYADCVLA A*02:01 N/A -

129 1 CVLANWKKK A*03:01, A*11:01 N/A -

144 1 TYENMDVLF A*24:02 N/A -

164 1 FLVSLLVEI A*02:01 T cell epitope [2]

167 1 SLLVEIAAA A*02:01 N/A -

171 1 EIAAASAIK A*11:01 N/A -

Page 259: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

237

176 1 SAIKVIPTV A*02:01 N/A -

179 1 KVIPTVFKA A*02:01 N/A -

188 1 MQMQERDTL A*15:01 N/A -

189 1 QMQERDTLL A*02:01 N/A -

190 1 MQERDTLLK A*11:01 N/A -

195 1 TLLKALLEI A*02:01 N/A -

199 1 ALLEIASCL A*02:01 T cell epitope [2]

216 1 QIHDHVNPK A*03:01, A*11:01 N/A -

222 1 NPKAFFSVL B*07:02, B*08:01 N/A -

241 1 PQLSDGLVY B*15:01 N/A -

262 1 GSAGQSSVF B*15:01 N/A -

265 1 GQSSVFQCF B*15:01 N/A -

275 1 VLLGIQQTA A*02:01 N/A -

279 1 IQQTAGGGH B*15:01 N/A -

288 1 AAQFLQDMR A*11:01 N/A -

289 1 AQFLQDMRR A*11:01 N/A -

291 1 FLQDMRRYM A*02:01 N/A -

298 1 YMPPAHRNF B*15:01 N/A -

299 1 MPPAHRNFL B*07:01 N/A -

302 1 AHRNFLCSL B*07:01 N/A -

313 1 NPSVREFVL B*07:01, B*08:01 N/A -

315 1 SVREFVLSK A*03:01, A*11:01 N/A -

320 1 VLSKGDAGL A*02:01 N/A -

Page 260: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

238

335 1 CVKALVSLR A*11:01 N/A -

341 1 SLRSYHLQI A*02:01 N/A -

355 1 LIPASQQPK A*03:01, A*11:01 N/A -

358 1 ASQQPKENK A*11:01 N/A -

381 1 GTDLMNFLK A*03:01, A*11:01 N/A -

384 1 LMNFLKTVR A*03:01 N/A -

389 1 KTVRSTTEK A*03:01, A*11:01 N/A -

393 1 STTEKSLLK A*03:01, A*11:01 N/A -

Page 261: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

239

Supplementary materials Paper II

Page 262: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

240

Figure S1: Visualization of conservation and binding predictions of all DENV blocks to all HLA alleles for which predictions are available. The bars show the minimum number of peptides in a block (Y axis) at a given starting position in the MSA (X axis) required to fulfill the user defined coverage threshold, yx. The heat map below the bar show the percentage of peptides in the block predicted to bind to each of the HLA alleles predicted for in these examples. The color of each position in the heat map matrix ranges from blue (0% accumulated conservation by predicted binders in the block for the given allele) to red (blocks predicted to bind to the given allele with a minimum binding affinity of 500 nM represents 99% conservation in the block). Alleles have been clustered to reflect similarity in binding properties. Results of clustering are summarized to the right of the heat map.

Page 263: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

241

Figure S2: WebLogo (top) and BlockLogo (bottom) visualizing the conservation and variability in block 1347 of the panFive polyprotein.

position

0

1

2

3

4

bits

1347

LIA

1348

AS

1349

LMVI

1350

GA

1351

IMLW

1352

MI

L

1353

IVAMT

L

1354

LV

1355

KR

position

0

9

18

27

36

bits

1347

LLILLII

A

1348

AAAAAAS

A

1349

MLLMVLL

I

1350

GGGGAGG

A

1351

IMLIWLL

W

1352

MMMMMMI

M

1353

IVAMITL

I

1354LLLLLLL

V

1355

KKKKRKK

R

Page 264: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

242

Table SI: List of GenBank accession numbers of reference sequences used for annotation of proteome sequences into protein products.

Species GenBank accession

Dengue virus type 1 NC_001477

Dengue virus type 2 NC_001474

Dengue virus type 3 NC_001475

Dengue virus type 4 NC_002640

West Nile virus NC_001563

Yellow fever virus NC_002031

Japanese encephalitis virus NC_001437

Tick-borne encephalitis virus NC_001672

Norovirus NC_008311

Page 265: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

243

Table S2: Details of conservation and variability and predicted HLA binding affinities (nM) of peptides in block 1346 of the panFive polyprotein MSA.

# Peptide Frequency Accumulated frequency

A*02:01 A*02:03 A*11:01 A*24:02 B*07:02 B*08:01 B*15:01

1 LAMGIMILK 30.74 30.74 19260 15331 8 20356 25905 23219 21395

2 LALGMMVLK 25.76 56.50 22639 20608 22 22618 25827 23610 23101

3 IALGLMALK 14.03 70.52 22596 21311 21 21125 25178 25841 23979

4 LAMGIMMLK 8.92 79.44 20172 12521 6 17585 26331 21290 21155

5 LAVAWMILR 5.17 84.62 21999 16557 352 27347 29339 26245 21140

6 IALGLMTLK 4.58 89.19 22986 21346 25 23077 26899 25806 24354

7 ISLGLILLK 3.42 92.61 22017 21513 8 15330 26165 26469 22417

8 AAIAWMIVR 2.16 94.76 22426 20416 100 30477 25584 25379 16102

9 LAMGALIFR 0.99 95.76 18667 19613 82 25768 25770 22006 19711

10 LALGMMALK 0.86 96.62 23026 20368 29 23398 25093 23848 22845

11 VSLCILTIN 0.76 97.38 23143 18051 23641 31798 27724 22837 21002

12 LSVAWMILR 0.60 97.98 22429 18085 73 25033 27120 27366 19713

13 IALGIMVLK 0.53 98.51 21574 20595 16 18885 27541 25057 23910

14 LAIGIMMLK 0.36 98.87 21920 14536 10 20384 28756 24118 21275

15 LALGMMILK 0.36 99.24 22334 20595 14 23596 25855 24346 23577

16 VALGLMALK 0.10 99.34 23087 22825 19 24167 23946 26053 24075

17 VSLCVLTIN 0.10 99.44 23274 19441 22866 32167 27281 23604 21209

Page 266: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

244

18 LALSMMVLK 0.07 99.50 22675 20725 17 22478 27612 22029 22392

19 LAMGALILR 0.07 99.57 19917 20248 212 24770 25451 22875 21042

20 FAMGIMILK 0.03 99.60 11135 11048 10 19607 25251 23027 21663

21 LSLGLILLK 0.03 99.64 22045 20448 10 19111 26239 25529 21129

22 AAMAWMIVR 0.03 99.67 20985 18939 26 28340 23879 23127 14799

23 AAIAWMIIK 0.03 99.70 21679 17191 11 27307 26989 25900 19090

24 IALGLILLK 0.03 99.73 21021 20011 14 17897 27141 24879 23474

25 WALGMMVLK 0.03 99.77 21509 20846 129 21520 24781 24553 23836

26 LALGMMVLR 0.03 99.80 23355 23431 643 27203 26931 23155 23472

27 IALGLMVLK 0.03 99.83 22165 21716 17 20678 26110 25563 24093

28 LAAAWMILR 0.03 99.87 20535 16992 205 25178 28577 25247 20812

29 IALGMMVLK 0.03 99.90 22737 21372 14 19170 23983 24763 24008

30 LALGIMILK 0.03 99.93 20760 19716 16 23575 27955 24790 23533

31 LAVAWMVLR 0.03 99.97 22666 16665 748 25812 29203 25733 19876

32 LAIAWMILR 0.03 100.00 18785 15159 113 24687 28046 25823 19598

Page 267: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

245

Supplementary materials Paper VI

Page 268: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

246

Table S1: Non-interacting tumor antigens, their functional partners, and overlap in canonical pathways.

Tumor antigen

Interactants Overlapping in pathway Relation to hallmarks of cancer

Reference

PPIB LEPRE1 CRTAP SERPINH1 AJAP1 BSG SDC1 PRL STAT5A CAMLG IRF3

Collagen formation Associated with tumor invasiveness in breast cancer

[3]

TPM4 VIM TNNT2 TMOD1 MYL2 TNNI2 TRIP6 VCL MYL6 CALD1 MYH11

Muscle contraction Pathway not related. TPM4 has been associated with metastasis in breast cancer.

[4]

ANXA2 DYSF TRPV5 S100A10 PLG PLAT CEACAM1 SRC CDC42 S100A4 S100A6

No significant overlap in canonical pathways

ANXA2 has been associated with enhanced invasiveness of the multidrug resistant human breast cancer cells

[5]

BST2 IGS15 IFI35 ARHGAP44 RNF115 APOBEC3G KITLG CST2 PLVAP CAMLG AGTRAP

No significant overlap in canonical pathways

BST2 has been associated with an interferon response pathway believed to be associated with inflammation in breast cancer

[6]

COTL1 ACTA1 LTA4H ALOX5 MMP2 CXCL2 ALOXE3 FXYD5

No significant overlap in canonical pathways

Genes up-regulated in invasive ductal carcinoma (IDC) relative to ductal carcinoma in situ (DCIS, non-invasive).

[7]

RPA1 TP53 BLM RAD52 MCM2 MCM6 ORC6L PCNA RPA2 RPA3 RPA4

Cell cycle, DNA replication

Associated with uncontrolled growth, and metastasis of, for example, primary malignant melanoma

[8]

PRDX5 GLRX MSRA PRDX6 CAT

Nucleotide metabolism, peroxisome

Proliferation and metastasis of breast cancer

[9]

Page 269: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

247

GSR TXN GPX4 PRDX5 TXN2 SOD2 PRDX1

SCRN1 HGS - - -

BCAP31 CASP8 ABCD1 CANX TFRC VAMP3 VCP SEC61A1 SSR2 SSR3 SSR4

Metabolism of proteins, translation, apoptosis

[10]

Page 270: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

248

References

1. Bachinsky MM, Guillen DE, Patel SR, Singleton J, Chen C, Soltis DA, Tussey LG: Mapping and binding analysis of peptides derived from the tumor-associated antigen survivin for eight HLA alleles. Cancer Immun. 2005, 5:6. 2. Sørensen RB, Berge-Hansen L, Junker N, Hansen CA, Hadrup SR, Schumacher TNM, Svane IM, Becker JC, thor Straten P, Andersen MH: The immune system strikes back: cellular immune responses against indoleamine 2,3-dioxygenase. PLoS One 2009, 4:e6910. 3. Kauppila S, Stenbäck F, Risteli J, Jukkola A, Risteli L: Aberrant type I and type III collagen gene expression in human breast cancer in vivo. J. Pathol. 1998, 186:262–8. 4. Li D-Q, Wang L, Fei F, Hou Y-F, Luo J-M, Zeng R, Wu J, Lu J-S, Di G-H, Ou Z-L, Xia Q-C, Shen Z-Z, Shao Z-M: Identification of breast cancer metastasis-associated proteins in an isogenic tumor metastasis model using two-dimensional gel electrophoresis and liquid chromatography-ion trap-mass spectrometry. Proteomics 2006, 6:3352–68. 5. Zhang F, Zhang L, Zhang B, Wei X, Yang Y, Qi RZ, Ying G, Zhang N, Niu R: Anxa2 plays a critical role in enhanced invasiveness of the multidrug resistant human breast cancer cells. J. Proteome Res. 2009, 8:5041–7. 6. Einav U, Tabach Y, Getz G, Yitzhaky A, Ozbek U, Amariglio N, Izraeli S, Rechavi G, Domany E: Gene expression analysis reveals a strong signature of an interferon-induced pathway in childhood

Page 271: 20R%F8nn%20Olsen.pdf · The prognostic, diagnostic, and therapeutic potential of tumor antigens This thesis has been submitted to the PhD School of The Faculty of Science, University

249

lymphoblastic leukemia as well as in breast and ovarian cancer. Oncogene 2005, 24:6367–75. 7. Schuetz CS, Bonin M, Clare SE, Nieselt K, Sotlar K, Walter M, Fehm T, Solomayer E, Riess O, Wallwiener D, Kurek R, Neubauer HJ: Progression-specific genes identified by expression profiling of matched ductal carcinomas in situ and invasive breast tumors, combining laser capture microdissection and oligonucleotide microarray analysis. Cancer Res. 2006, 66:5278–86. 8. Kauffmann A, Rosselli F, Lazar V, Winnepenninckx V, Mansuet-Lupo A, Dessen P, van den Oord JJ, Spatz A, Sarasin A: High expression of DNA repair pathways is associated with metastasis in melanoma patients. Oncogene 2008, 27:565–73. 9. Chang X-Z, Li D-Q, Hou Y-F, Wu J, Lu J-S, Di G-H, Jin W, Ou Z-L, Shen Z-Z, Shao Z-M: Identification of the functional role of peroxiredoxin 6 in the progression of breast cancer. Breast Cancer Res. 2007, 9:R76. 10. Chandra D, Choy G, Deng X, Bhatia B, Daniel P, Tang DG: Association of active caspase 8 with the mitochondrial membrane during apoptosis: potential roles in cleaving BAP31 and caspase 3 and mediating mitochondrion-endoplasmic reticulum cross talk in etoposide-induced cell death. Mol. Cell. Biol. 2004, 24:6592–607.