ISB Cancer Genomics Cloud Documentation · ISB Cancer Genomics Cloud Documentation, Release 1.0.0...

53
ISB Cancer Genomics Cloud Documentation Release 1.0.0 the ISB-CGC team March 03, 2016

Transcript of ISB Cancer Genomics Cloud Documentation · ISB Cancer Genomics Cloud Documentation, Release 1.0.0...

ISB Cancer Genomics CloudDocumentation

Release 100

the ISB-CGC team

March 03 2016

Contents

1 Contents 3

i

ii

ISB Cancer Genomics Cloud Documentation Release 100

Welcome to the ISB-CGC Documentation on Read the Docs

Here you will find information describing the features of the ISB-CGC platform tips on how to use it and detailsabout the data that we are hosting on the Google Cloud Platform

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs and to algorithm developers who wish to spin up thousands of virtual machines to analyze hundreds of terabytesof sequence data

This documentation is a work-in-progress please let us know how we can improve it feedbackisb-cgcorg

ndash the ISB-CGC team

Contents 1

ISB Cancer Genomics Cloud Documentation Release 100

2 Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

Contents

1 Contents 3

i

ii

ISB Cancer Genomics Cloud Documentation Release 100

Welcome to the ISB-CGC Documentation on Read the Docs

Here you will find information describing the features of the ISB-CGC platform tips on how to use it and detailsabout the data that we are hosting on the Google Cloud Platform

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs and to algorithm developers who wish to spin up thousands of virtual machines to analyze hundreds of terabytesof sequence data

This documentation is a work-in-progress please let us know how we can improve it feedbackisb-cgcorg

ndash the ISB-CGC team

Contents 1

ISB Cancer Genomics Cloud Documentation Release 100

2 Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ii

ISB Cancer Genomics Cloud Documentation Release 100

Welcome to the ISB-CGC Documentation on Read the Docs

Here you will find information describing the features of the ISB-CGC platform tips on how to use it and detailsabout the data that we are hosting on the Google Cloud Platform

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs and to algorithm developers who wish to spin up thousands of virtual machines to analyze hundreds of terabytesof sequence data

This documentation is a work-in-progress please let us know how we can improve it feedbackisb-cgcorg

ndash the ISB-CGC team

Contents 1

ISB Cancer Genomics Cloud Documentation Release 100

2 Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Welcome to the ISB-CGC Documentation on Read the Docs

Here you will find information describing the features of the ISB-CGC platform tips on how to use it and detailsabout the data that we are hosting on the Google Cloud Platform

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs and to algorithm developers who wish to spin up thousands of virtual machines to analyze hundreds of terabytesof sequence data

This documentation is a work-in-progress please let us know how we can improve it feedbackisb-cgcorg

ndash the ISB-CGC team

Contents 1

ISB Cancer Genomics Cloud Documentation Release 100

2 Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

2 Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

CHAPTER 1

Contents

11 About the ISB Cancer Genomics Cloud

The ISB-CGC provides interactive and programmatic access to the TCGA data leveraging many aspects of the GoogleCloud Platform including BigQuery Compute Engine App Engine Cloud Datalab and Google Genomics Open-access clinical and biospecimen information for all TCGA patients and samples combined with the Level-3 TCGAdata and genomic reference and platform-annotation sources are stored in BigQuery enabling fast SQL-like queriesagainst the entire dataset Controlled-access DNA and RNA sequence data is available to dbGaP-authorized users inthe original BAM and FASTQ file formats

The ISB-CGC aims to serve the needs of a broad range of cancer researchers ranging from scientists or clinicianswho prefer to use an interactive web-based application to access and explore the rich TCGA dataset to computationalscientists who want to write their own custom scripts using languages such as R or Python accessing the data throughAPIs to algorithm developers who want to spin up thousands of virtual machines to rapidly analyze hundreds ofterabytes of sequence data The ISB-CGC allows scientists to interactively define and compare cohorts examine theunderlying molecular data for specific genes or pathways of interest and share insights with collaborators around theglobe

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets

The ISB-CGC platform hosts the majority of the TCGA data set as well as other reference and annotation datasets indifferent appropriate Google Cloud technologies

121 About the TCGA Data

The ISB-CGC hosts approximately 1 petabyte of TCGA data in Google Cloud Storage (GCS) and in BigQuery

The data being hosted by the ISB-CGC was obtained from the two main TCGA data repositories

bull TCGA DCC the TCGA Data Coordinating Center which provides a Data Portal from which users may down-load open-access or controlled-access data This portal provides access to all TCGA data except for the low-levelsequence data

bull CGHub the Cancer Genomics Hub is NCIrsquos current secure data repository for all TCGA BAM and FASTQsequence data files

3

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

The ISB-CGC platform is one of NCIrsquos Cancer Genomics Cloud Pilots and our mission is to host the TCGA data inthe cloud so that researchers around the world may work with the data without needing to download and store the dataat their own local institutions

The vast majority (over 99) of this petabyte of data consists of low-level sequence data currently stored as filesin Google Cloud Storage Over the course of the TCGA project this low-level (ldquoLevel 1rdquo) data has been processedthrough a set of standardized pipelines and the the resulting high-level (ldquoLevel 3rdquo) data is frequently the data that isused in most downstream analyses The ISB-CGC platform aims to make these different types of data accessible tothe widest possible variety of users within the cancer research community using the most appropriate Google CloudPlatform technologies

More details about the TCGA data can be found in the sections below

Understanding the TCGA Data Types

The TCGA dataset is unique in that the tumor samples were assayed using a standard set of platforms and pipelines inorder to produce a comprehensive dataset including

bull DNA sequencing of tumor samples and matched-normals (typically blood samples) in order to detect somaticmutations

bull SNP array based DNA copy-number and genotyping analysis of tumor samples and matched-normals

bull DNA methylation of tumor samples

bull messenger RNA (mRNA) expression analysis of the tumor samples to capture the gene expression profile

bull micro-RNA (miRNA) expression profiling of the tumor samples

In addition protein expression for a significant fraction (~20) of all tumor samples was obtained using RPPA (reversephase protein array)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Levels

TCGA Data Levels

For each type of data there are typically three levels of data Level 1 typically represents raw un-normalized data Level 2 typically represents an intermediate level of processing andor normalization of the data Level 3 typicallyrepresents aggregated normalized andor segmented data

The results of integrative or pan-cancer analyses are sometimes referred to as ldquoLevel 4rdquo data More information aboutData Level Classification can be found on the NCI wiki

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding the TCGA Data Platforms

When working with any of the data types it is important to also be aware of both the platform that was used to generatethe underlying raw data as well as the pipeline that was used to process the data For example over the course of theTCGA study DNA methlyation data was obtained using first the Illumina HumanMethylation27 platform and laterusing the HumanMethylation450 platform Any analysis that combines data from these two platforms across a cohortof samples should take this into consideration Another example where multiple platforms andor pipelines were used

4 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

to produce a single data type is the Level-3 gene expression data most tumor samples were processed at UNC andthe normalized gene-expression values are based on the RSEM method while some tumor samples were processed atBCGSC and the normalized gene-expression values are based on RPKM

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

TCGA Data Reports

A number of useful Data Reports are available directly from TCGA There are several different reports that you canaccess from that page including these nice dashboards

bull Data Statistics this dashboard provides high-level statistics describing TCGA data content and usage

bull Project Case Overview this dashboard provides a high-level snapshot of TCGA project progress through themultiple phases of sample analysis

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Understanding Data Access

bull Public Data Sometimes the word ldquopublicrdquo is misinterpreted as meaning ldquoopenrdquo All of the TCGA data is publicdata and much of it is open meaning that it is accessible and available to all users while some low-level TCGAdata is controlled and restricted to authorized users

bull Open-Access Data Depending on how you categorize the data most of the TCGA data is open-access data Thisincludes all de-identified clinical and biospecimen data as well as all Level-3 molecular data including geneexpression data DNA methylation data DNA copy-number data protein expression data somatic mutationcalls etc

bull Controlled-Access Data All low-level sequence data (both DNA-seq and RNA-seq) the raw SNP array data(CEL files) germline mutation calls and a small amount of other data are treated as controlled data and requirethat a user be properly authenticated and have dbGaP-authorization prior to accessing these data

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Data Use Certification (DUC)

Investigator(s) requesting and receiving Genomic data in accordance with the NIH Genomic Data Sharing (GDS)Policy are expected to

bull Submit a description of the proposed research project

bull Submit a data access request (DAR) the DAR requires NIH log-in

Note Requesters and institutional SOs must have an NIH eRA User ID and password to access the DAR Visitelectronic Research Administration (eRA) for more information on registering for a NIH eRA account NIH staff mayutilize their NIH log-in (See additional instructions at Data Access Request Instructions) dbGap Data Access RequestPortal

Additionally they must

bull Submit a Data Use Certification (DUC) co-signed by the designated Institutional Official(s) at their sponsoringinstitution Sample DUC form

12 Cloud-Hosted Data Sets 5

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protect data confidentiality (Any data used in a study which was initially listed as ldquoControlledrdquo or TCGA remainsas controlled data and MUST be protected accordingly unless prior release authorization is obtained from NCIdata custodian)

bull Ensure that data security measures are in place (Only project members authorized to receive controlled data should be listed in a project using controlled data for exampleProject 1- has only users authorized to access Controlled data and a DUC in place

Project 2 - has only open access users NO Controlled Data Access Allowed fromby Project 2 mem-ber(s))

Remember ndash YOU and YOUR Institution are accountable for ensuring the security of this data not the cloud service provider Securing controlled data and protecting it should be thought of in the same manner as any legacy system or server you used in the past Your responsibilities for data protection are the same in a cloud environment (For more information on this requirement see - NIH Security Best Practices for Controlled-Access Data)

bull The Investigator and their associated institution assume the responsibility for the security of the dbGaPdata As such NIH has tried to provide as much information as possible for PIs institutional signingofficials (SOs) and the IT staff who will be supporting these projects to make sure they understand theirresponsibilities (Ref The Cloud dbGaP and the NIH blog post 03272015)

Finally they must

bull Notify the appropriate Data Access Committee of policy violations and

bull Submit annual progress reports detailing significant research findings See more at Policy for Sharing of DataObtained in NIH Supported or Conducted Genome-Wide Association Studies (GWAS)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

122 Hosted TCGA Data

All TCGA metadata is considered open-access In other words information about controlled-access data files isopen-access Metadata can be obtained programmatically using the ISB-CGC programmatic API

An overview of the TCGA data currently hosted on the ISB-CGC platform is provided in the two sections belowThe first section breaks the data down by access class (open vs controlled) and the second section breaks it down byoriginal source repository (DCC and CGHub)

TCGA Data by Access Class

Open-Access TCGA Data

The open-access TCGA data hosted by the ISB-CGC Platform includes

bull Clinical (de-identified) and Biospecimen data these data were originally provided in XML files (Level-1) bythe DCC

bull Somatic mutation data these data were originally provided in MAF files (Level-2) by the DCC

bull DNA copy-number segments these data were originally provided as segmentation files (Level-3) by the DCC

bull DNA methylation data these data were originally provided as TSV files (Level-3) by the DCC

bull Gene (mRNA) expression data these data were originally provided as TSV files (Level-3) by the DCC

bull microRNA expression data these data were originally provided as TSV files (Level-3) by the DCC

6 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Protein expression data these data were origially provided as TSV files (Level-3) by the DCC and

bull TCGA Annotations data annotations were obtained from the TCGA Annotations Manager

in Google Cloud Storage (GCS) The data files described above are available to all ISB-CGC users in an open-accessGCS bucket (gsisb-cgc-open)

in BigQuery The information scattered over tens of thousands of XML and TSV files at the DCC is provided in amuch more accessible form in a series of BigQuery tables For more details including tutorials and code examples inPython or R please see our github repositories

This introductory tutorial gives a great overview of all of the tables and pointers on how to get started exploring themBe sure to check it out

Controlled-Access TCGA Data

The controlled-access TCGA data hosted by the ISB-CGC Platform includes

bull SNP array CEL files these Level-1 data files were provided by the DCC and include over 22000 files for bothtumor and matched-normal samples

bull VCF files these Level-2 data files were provided by the DCC and include over 15000 files produced by severaldifferent centers (primarily Broad and BCGSC)

bull MAF files these ldquoprotectedrdquo mutation files (Level-2) were provided by the DCC (note that these files were notgenerated uniformly for all tumor types)

bull DNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 37000 of these files are available in Google Cloud Storage (GCS)

ndash roughly 90 of these BAM files containe exome data the remaining 10 contain whole-genomedata

ndash BAM index (BAI) files are also available for all BAM files

bull mRNA- and microRNA-seq BAM files these Level-1 data files were provided by CGHub

ndash over 13000 mRNA-seq BAM files are available in GCS

ndash over 16000 miRNA-seq BAM files are available in GCS

bull mRNA-seq FASTQ files these Level-1 data files were provided by CGHub and include over 11000 tar files

in Google Cloud Storage At this time all of these controlled-access data files are stored in GCS in the originalform as obtained from the data repository

In order to access these controlled data a user of the ISB-CGC must first be authenticated by NIH (via the ISB-CGCweb-app) Upon successful authentication the usersrsquos dbGaP authorization will be verified These two steps arerequired before the userrsquos Google identity is added to the access control list (ACL) for the controlled data At thistime this access must be renewed every 24 hours

in Google Genomics In the future BAM and VCF data will also be available in other forms in order to allow othermodes of data access (eg using the GA4GH API) This will open up new faster more ldquocloud-awarerdquo approaches toworking with these data (as illustrated by some of these Google Genomics Cookbooks)

12 Cloud-Hosted Data Sets 7

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

TCGA Data by Source Repository

TCGA Data at the DCC

Complete sets of open-access and controlled-access data archives were copied from the DCC on October 4th 2015into Google Cloud Storage

Note that for every archive at the DCC there may be multiple revisions of an archive A list of the current latest archivescan be obtained from the DCC The archive naming convention includes the disease code the platformpipeline namethe archive type (eg data level) the serial index (which is often the batch number) and the revision number If youwant to check whether there is a newer version of a specific archive at the DCC than what we currently have on theISB-CGC platform you can check the date column in the latest archive report mentioned above or you can comparethe archive name to these lists of open-access archives and controlled-access archives based on our most recent upload

Note that all ldquobiordquo archives (containing clinical biospecimen and other types of XML files) were recently migratedto a new XSD which is not backwards compatible with the previous XSD This update took place over the course ofthe month of December 2015 and none of these new archives are included in any of the current ISB-CGC BigQuerytables or files in GCS

TCGA Data at CGHub

The complete listing of the TCGA data files from CGHub that are currently available in Google Cloud Storage (GCS)contains the following three columns of information

bull unique CGHub id for the file

bull the partial GCS object path and

bull the size of the file in bytes

The latest complete CGHub manifest can be downloaded directly from CGHub (67 MB spreadsheet)

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

123 ETL for BigQuery Tables

The open-access TCGA data has been uploaded into a set of consistent tables in the publicly-accessible BigQuerydataset called isb-cgctcga_201510_alpha tables which can be accessed via the BigQuery web interface (byanyone with an active GCP project)

In general the data in the BigQuery tables is identical to the information that you can also access via the TCGA DataCoordinating Center (DCC) Data Portal but for users interested in the nitty-gritty details information is provided hereabout the ETL (extract transform and load) steps that were performed for each of the data types

Before we go into data-type-specific details a few general notes on formatting and data curation

bull All data uploaded into ISB-CGC BigQuery tables use a consistent UTF-8 character set If the encoding of acharacter from the original file could not be detected that character was ignored Character encodings weredetected using the Python library Chardet

bull All missing information value strings such as none None NONE null Null NULL NA__UNKNOWN__ ltblankgt and are represented as NULL values in the BigQuery tables (or maynot appear at all depending on the table schema)

bull Numbers are stored as integer or floating point values The original ASCII files sometimes used scientificnotation or included comma separators but these are not preserved in the BigQuery tables

8 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull End of File (EOF) and End of Line (EOL) delimiters including CTRL-M characters were all removed whenthe raw files were originally parsed

bull Single and double quotes around the values were removed but in cases where there were quotation marks withina string they were not removed

bull Whenever necessary the SDRF file (in the mage-tab archive associated with each data archive) was parsed tofind the correct association between the aliquot barcode and the Level-3 data file(s)

Major Data Types

Clinical

The Clinical table contains one row per TCGA participant (aka patient or donor) Each TCGA participant is uniquelyrepresented by a TCGA barcode of length 12 eg TCGA-2G-AAM4 (For more information on how TCGA barcodeswere created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

Clinical Feature Selection In the first pass any XML features with the tagprocurement_status=Completed which were found to exist in at least 20 of the participants in anyone Study (aka tumor-type) were considered for selection A few important features related to smoking pregnancyetc were added to the list during a manual-curation pass

Selected fields from the both the clinical and auxiliary XML files were then extracted and loaded into the BigQuerytable

Additionally only the most recent follow-up information was included (for patients where multiple follow-up sectionsexisted in the clinical XML file)

XML Parsing Each clinical XML file is divided into admin and patient blocks and each of these were pro-cessed separately

While iterating through the patient block of information all elements (XML tags) and their values were collected Forfollow-up blocks only the most recent (based on sequence number) sub-block elements were kept

In the final pass patient elements and follow-up elements were carefully merged with preference given to follow-upelements

Transforms Different survival-related fields are completed based on the value of the vital_status field

bull for all patients with vital_status=Alive

ndash days_to_last_known_alive should not be NULL

ndash days_to_last_known_alive is set to days_to_last_followup

ndash days_to_death is set to NULL

bull for all patients with vital_status=Dead

ndash days_to_death should not be NULL (if it is NULL and days_to_last_followup is not NULL then vi-tal_status is set to ldquoAliverdquo

ndash days_to_last_known_alive is set to days_to_death

ndash days_to_last_followup is set to NULL

The following fields were extracted from the cqcf block of the XML file

bull gleason_score_combined

12 Cloud-Hosted Data Sets 9

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull country

bull history_of_prior_malignancy

bull frozen_specimen_anatomic_site

When an auxiliary XML file exists for a participant and the batch numbers in both the clinical XML and the auxiliaryXML file match the following fields are extracted from the auxiliary XML file and added to the Clinical table

bull hpv_calls

bull hpv_status

bull mononucleotide_and_dinucleotide_marker_panel_analysis_status

bull mononucleotide_marker_panel_analysis_status

Finally the patient BMI was calculated based on the height and weight values (when both were present) and wasadded to the Clinical table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Biospecimen

The Biospecimen_data table contains one row per TCGA sample Each TCGA sample is uniquely represented by aTCGA barcode of length 16 eg TCGA-2G-AAM4-10A (For more information on how TCGA barcodes were createdand how to ldquoreadrdquo a TCGA barcode click on the preceding link)

XML Parsing The TCGA data at the DCC exists in XML files which have been uploaded into Google CloudStorage Selected fields from these XML files were then extracted and loaded into the ldquoBiospecimen_datardquo table inBigQuery

Some of the biospecimen values in the XML files are available on a per-slide andor per-portion basis and these havebeen aggregated and averaged The number of slides and the number of portions per sample is also included in thetable

Filters

bull Samples for which is_ffpe=True were removed

bull Patients or Samples for which Project value was not TCGA were removed

Transforms

bull pregnancies and total_number_of_pregnancies were merged into a single pregnancies fieldCounts above four are represented as 4+ (eg [01234+])

bull number_of_lymphnodes_examined and lymph_node_examined_count were mergedinto a single number_of_lymphnodes_examined field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

10 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Somatic Mutations

The Somatic Mutations table in BigQuery contains somatic mutation calls collected from the open-access MAF filesfrom 30 tumor types

For each MAF file some simple data-cleaning performed it was then annotated using Oncotator and then furtherprocessed to remove duplicates before being merged into a single table

Data-Cleaning

bull Remove any lines where the build is not 37

bull Remove any lines where the chr is not in [1-22 X Y]

bull Remove any lines where the Mutation_Status is not Somatic

bull Remove any lines where the Sequencer is not an Illumina platform

bull Change the column labels to match what Oncotator expects (eg ncbi_build becomes buildchromosome chr etc

Oncotator Annotation Each file was then annotated using Oncotator version 151 with the Jan2015 databaseand the options --input_format=MAFLITE --output_format=TCGAMAF

The outputs of Oncotator were lightly processed to change the column labels and to remove certain special charactersfrom strings

Duplicate Removal Because many tumor types have several ldquocurrentrdquo MAF files and deciding which one is theldquobestrdquo is a non-trivial process and also because some tumor samples may have had mutations called relative to atissue normal and also relative to a blood normal it is possible that the same mutation has been called multipletimes In order to eliminate over-counting of mutations we sought to remove these duplicate calls from the result ofconcatenating all of the annotated MAF files using the following rules

bull if a mutation in the same position is called in a particular tumor sample with respect to multiple matched normalswe prefer the ldquoblood derived normalrdquo over the ldquosolid tissue normalrdquo

bull if a mutation in the same position is called in multiple aliquots for one tumor sample weprefer the ldquoDrdquo analyte over the ldquoWrdquo analyte (eg TCGA-B0-5695-01A-11D-1534-10 overTCGA-B0-5695-01A-11W-1584-10)

bull if both aliquots are ldquoDrdquo (or both are ldquoWrdquo) analytes then we choose based on the data-generating-center (thefinal two characters in the aliquot barcode) preferring first

ndash 01 08 or 14 (all of which refer to broadmitedu)

ndash 09 21 or 30 (all of which refer to genomewustledu)

ndash 10 or 12 (both of which refer to hgscbcmedu)

ndash 13 or 31 (both of which refer to bcgscca)

ndash 18 or 25 (both of which refer to ucscedu)

bull finally in the event that a mutation in the same position was called by the same center with the same type ofmatched normal and the same type of analyte then we choose the aliquot with the larger value in the final4-digit sequence in the barcode (positions 2125)

In addition any exact duplicates (ie all fields describing a mutation are the same) in the merged file are removed andthe final result uploaded into BigQuery The result is a single table containing over 58 million mutations called on8435 tumor samples from 8373 patients

12 Cloud-Hosted Data Sets 11

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Copy-Number Segments

The Copy_Number_segments table contains one row per copy-number segment per TCGA aliquot Each TCGAaliquot is uniquely represented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (Formore information on how TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the precedinglink)

Platform DNA Copy-Number data was generated for the TCGA project using the Affymetrix GenomeWide HumanSNP 60 Array

Pipeline DNA Copy-Number data was generated for the TCGA project at the Broad Genome Characterization Cen-ter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

ETL Details Each Level-3 data archive contains 4 output files per sample assayed two based on the hg18 ref-erence and two based on the hg19 reference The BigQuery table is populated only with the files ending withnocnv_hg19segtxt The num_probes and segment_mean fields in the raw files are sometimes rep-resented using Exponential Scientific Notation (eg 87E+07) and were interpreted as integer or floating-point valuesrespectively

The mapping between TCGA aliquot barcodes and Level-3 data files was obtained from the SDRF file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

DNA Methylation

The DNA Methylation table contains one row per CpG probe and TCGA aliquot Each TCGA aliquot is uniquelyrepresented by a TCGA barcode of length 24 eg TCGA-04-1517-01A-01D-0533-01 (For more information onhow TCGA barcodes were created and how to ldquoreadrdquo a TCGA barcode click on the preceding link)

The platform annotation information needed to analyze this data is also available in a BigQuery table For moreinformation see the Reference Data section of this documentation

Platform DNA Methylation data was generated for the TCGA project using the Illlumina HumanMethylation27BeadChip and its successor the HumanMethylation450 BeadChip

Pipeline DNA Methylation data was generated for the TCGA project at the JHU-USC genome characterizationcenter A DESCRIPTIONtxt file is included with each data archive at the DCC describing the algorithms methodsand protocols used to produce the Level-1 Level-2 and Level-3 data

12 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

ETL Details The BigQuery table is populated only with the files matching the patternHumanMethylationtxt The data from both 27k and 450k platform have been merged together intoa single table A few samples were run on both platforms and for those samples the 450k data takes precedence Thetable includes a platform column indicating the source of each data value

In addition

bull any CpG probes for which the Level-3 Beta_Value is NA or NULL are left out

bull only the Probe_Id and Beta_Value fields from the Level-3 data files are stored in the BigQuery table

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

mRNA Expression

Gene expression data for the TCGA project has been produced by two different centers using several different plat-forms and fundamentally different pipelines Most of the data from each center was produced using the IlluminaHiSeq platform and for that reason the first two BigQuery tables containing gene expression data are based on thosespecific subsets of the TCGA mRNA expression data

bull the majority of the data was produced by the UNC LCCC and the resulting normalized RSEM values are storedin one table

bull and a subset of the data was produced by the BC GSC and the resulting normalized RPKM values are stored inanother table

UNC RNAseqV2 Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patternrsemgenesnormalized_results These raw ldquoRSEM genes normalized resultsrdquo files have twocolumns both of which are stored in the BigQuery table The first column contains the gene_id which contains twoparts separated by a | eg TP53|7157 The second column contains the normalized_count representing theexpression value for that gene

The gene_id column is split into two components and stored as separate columnsoriginal_gene_symbol and gene_id Based on the gene_id the current HGNC approved genesymbol is looked up and added as a third column HGNC_gene_symbol

BCGSC RNAseq Pipeline A DESCRIPTIONtxt file describing the algorithms methods and protocols used toproduce the Level-1 Level-2 and Level-3 data can be obtained from the TCGA DCC

The BigQuery table was populated using the values in files matching the patterngenequantificationtxt These raw ldquogene quantificationrdquo files have four columns generaw_counts median_length_normalized and RPKM From these the gene and the RPKM val-ues are stored in the BigQuery table The gene string contains either two or three parts similarly separated by a |eg TP53|7157_calculated or Mir_1302||3of7_calculated

The gene string is split into two or three components and stored as separate columns original_gene_symboland gene_id and if there is a third component a gene_addenda column If one component is simply thatcharacter string is replaced by a NULL value Finally the current HGNC approved gene symbol is looked up andadded as an additonal column HGNC_gene_symbol

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

12 Cloud-Hosted Data Sets 13

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

microRNA Expression

The current ISB TCGA data pipeline uses a Perl script expression_matrix_mimatpl provided by BCGSCwhich reads the isoform data files and outputs expression values for ldquomature microRNAsrdquo This output matrix containsa consistent number of mature microRNAs referred to using a combination of the microRNA gene name and theunique accession number eg ldquohsa-mir-21MIMAT0000076rdquo During ETL this string is split into two parts andstored as separate columns in the BigQuery table The entire matrix is then melted into a flat structure (known as thetidy data format) and loaded into the table

Only the isoform files matching the pattern isoformquantificationtxt and containing hg19 data wereused The aliquot barcode information was obtained from the SDRF file associated with the Level-3 isoform data file

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Protein

The raw protein data file contains just two columns The ldquoComposite Element REFrdquo which corresponds to the thirdcolumn in the antibody annotation file and the estimated expression value for that particular protein The ldquoCompositeElement REFrdquo was parsed to generate additional information(see details in the formatting section) The BigQuery tablewas populated with all TCGA Level-3 RPPA data matching the pattern - ldquo_RPPA_Coreprotein_expressiontxtrdquo

The antibody annotation files are parsed to get the relationship between the antibody name and the associated proteinsand genes Below is the detailed explanation about the generation of the antibody gene protein map

Generation of Composite_element_ref gene and protein name map (Manual Curation of the gene andprotein names)

bull Check the antibody annotation files for missing columns

bull If ldquoprotein_namerdquo is missing generate one from ldquocomposite_element_refrdquo

bull Make a map of lsquocomposite_element_refrsquorsquo gene_namersquo lsquoprotein_namersquo values

bull Check any other variant of the gene and protein symbols in the table

bull HGNC Validation

bull If the gene symbol is in the HGNC approved symbols lsquoApprovedrsquo Gene_symbol = Gene_symbol

bull If not check the Alias symbols If found Gene_symbol = Alias_symbol

bull If not check the Previous symbols If found Gene_symbol = ldquoApprovedrdquo Gene_symbol

bull If not Gene_symbol = Gene_symbol

bull The file generated is manually curated and fed back into the algorithm

Formatting

bull Duplicate the rows if there are multiple genes concatenated in the ldquogene_namerdquo value For examplelsquogene_namersquo with value like lsquoAKT1 AKT2 AKT3rsquo is stored as three separate rows with each gene in a row

bull lsquoProtein_Namersquo is split into lsquoProtein_Basenamersquo Phosphorsquo and are stored as separate columns

bull lsquoComposite element refrsquo is parsed to get lsquovalidationStatusrsquo and lsquoantibodySourcersquo ndash both are stored as separatecolumns in the BigQuery table

bull Data from both Illumina GA and HiSeq platforms are stored in the same table

14 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

124 Reference Data

ISB-CGC Hosted Reference Data

In order to facilitate working with the TCGA data tables that the ISB-CGC is hosting in BigQuery additional referencedata tables have also been created others are hosted by Google Genomics and suggestions for more are welcome atfeedbackisb-cgcorg

Platform Reference Data

Some reference data is necessary in order to work with data generated by specific platforms such as the Illumina DNAMetylation array or the Affymetrix Genome-Wide Human SNP Array 60 This section will provide links to existingsources of information elsewere on the web or will describe additional resources that are hosted by the ISB-CGCIf there are additional platform reference sources that you would like to see hosted in BigQuery tables please let usknow at feedbackisb-cgcorg

DNA Methylation Platform Most of the DNA Methylation data produced by the TCGA project was obtainedusing the Illumina Infinium HumanMethylation450 (aka 450k) BeadChip array Some of the earlier tumor types wereassayed on the older 27k array

Although additional details can be found at the Illumina webpage we have uploaded the platform annotation informa-tion into the BigQuery table isb-cgcplatform_referencemethylation_annotation

Each CpG locus is uniquely identified as described in this technical note and this unique identifier can be used to lookup and cross-reference data between the TCGA DNA methylation data table and the platform annotation table

Genome-Wide SNP Array The technical documentation for the Affymetrix Genome-Wide Human SNP Array 60array can be found here

Genome Reference Data

Reference data that describes or annotates the human (or other) genome(s) is described in this section Reference datahosted by the ISB-CGC in BigQuery tables are available in the isb-cgcgenome_reference dataset

GENCODE Release 19 the final build of the GENCODE geneset mapped to GRCh37 has been uploaded as aBigQuery table called GENCODE_r19 This table can be used to find the genomic coordinates for a gene of interestin combination with queries against molecular tables such as the TCGA copy-number data

miRBase The human portion of version 20 of the miRBase database has been uploaded as a BigQuery table calledmiRBase_v20 This database can be used to map between MIMAT accession IDs miR names and mature miRnames The miR sequence cal also be retrieved from this table

12 Cloud-Hosted Data Sets 15

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

miRTarBase The recently updated miRTarBase database (release 61) is available as a BigQuery table isb-cgcgenome_referencemiRTarBase

Other Reference Data Sources

Google Genomics maintains a list of publicly available datasets including Reference Genomes the Illumina Plat-inum Genomes information about the Tute Genomics Annotation table etc

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

125 Data Releases and Future Plans

Release Notes

bull September 21 2015 first set of BigQuery tables (not publicly released)

ndash isb-cgctcga_201507_alpha dataset containing clinical biospecimen somatic mutation callsand Level-3 TCGA data available at the TCGA DCC as of July 2015

bull October 4 2015 complete data upload from TCGA DCC including controlled-access data

bull November 2 2015 first public release of TCGA open-access data in BigQuery tables

ndash isb-cgctcga_201510_alpha dataset contains updated set of BigQuery tables based on dataavailable at the TCGA DCC as of October 2015

ndash includes Annotations table with information about redacted samples etc

ndash isb-cgcplatform_reference contains annotation information for the Illumina DNA Methy-lation platform

bull November 16 2015 initial upload of data from CGHub into Google Cloud Storage complete (not publiclyreleased)

bull December 26 2015 public release of new isb-cgcgenome_reference dataset with miRTarBasetable

bull January 10 2016 GENCODE_r19 and miRBase_v20 tables added to isb-cgcgenome_referencedataset

Future Plans

We expect that our future plans will continually evolve based on user feedback research priorities and the dynamicnature of the Google Cloud Platform Tell us what is important to you at feedbackisb-cgcorg

Near-Term

bull Enable access to controlled data in GCS by authorized users (January)

bull Upload new data from CGHub into GCS (February)

bull New set of BigQuery tables based on new data at the TCGA DCC (March)

bull Upload TCGA MC3 VCF files from TCGA DCC into GCS ()

16 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Longer-Term

bull Import a subset of VCF files and sequence-level data into Google Genomics

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access

Programmatic access to the data and metadata is provided through a combination of ISB-CGC APIs and Google APIsThe majority of the TCGA data in BigQuery tables and in Google Cloud Storage is accessed directly via Google Cloudtools and interfaces Access to ISB-CGC metadata and user-data such as cohort definitions is provided through theISB-CGC programmatic API described below

A growing set of tutorials and programming examples illustrating how you can work with these TCGA data usinga variety of programming environments such as Python and R or grid computing systems such as GridEngine areprovided in our github repositories also described below

131 Computational System Model

There are two primary ways in which users can interact with ISB-CGC data The first method is through the ISB-CGCweb application which provides users a convenient web-based interface from which it is easy to create and visualizecollections of data hosted by the ISB-CGC

The second method is through the ISB-CGC programmatic API or through other Google Cloud APIs The ISB-CGCAPI provides access to much of the same computational functionality as the web application and the other GoogleAPIs can be used to access the hosted data sets depending on which technology is used to host them

bull the BigQuery Web UI Command-Line Tool or REST API for the data stored in BigQuery tables

bull the Google Cloud Storage (GCS) JSON API or gsutil for the data stored in GCS objects or

bull the Genomics REST API for data stored in Google Genomics

For users interested in performing custom analyses accessing the data directly using these APIs will provide greaterflexibility

The Cloud Paradigm

In addition to hosting the TCGA data in the cloud one of the main goals of the ISB-CGC is to ldquobring the computationto the datardquo There are many ways that this can be done using legacy tools cloud-native tools or a combination ofthe two Regardless of the details of the particular solution the single most important difference between the ISB-CGC computational system model and traditional HPC models is that there is no single ldquomonolithicrdquo system that isdoing the computational work Cloud-native solutions instead abstract the configuration management process fromthe allocation of physical hardware making it very easy to programmatically request an arbitrary number of identicalmachines which can then be easily ldquotorn downrdquo (and regenerated) whenever necessary The configuration state ofthese machines will always be identical on startup and can be parametrized according to your algorithmrsquos resourceneeds

One important implication to understand about this new computational paradigm is that the burden of system admin-istration is partially shifted to the users of the cloud researchers and developers While numerous tools exist to help

13 Programmatic Access 17

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

simplify these tasks there is no IT department managing your cloud-computing This means that researchers will needto learn a new skill-set

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

132 R and Python Tutorials

For ISB-CGC users who want to perform custom analyses by writing R or Python scripts we have begun to assemblea set of examples in two public github repositories examples-Python and examples-R R users can work from thefamiliar environment of RStudio and Python programmers can enjoy the richness available in IPython notebooks bytaking advantage of the newly released Cloud Datalab (note that Cloud Datalab is a beta release)

These repositories contain numerous examples that will help you learn to access and analyze the TCGA data inBigQuery as well as examples showing how to use our APIs to query the metadata and discover where to find the datathat you are looking for in Google Cloud Storage

We encourage the community to provide feedback on these tutorials and also to add your own examples to enrich thispublic resource

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

133 Programmatic Interfaces

Programmatic access to molecular data in BigQuery Google Cloud Storage or Google Genomics is based directlyon the interfaces provided by the Google Cloud Platform as illustrated throughout the ISB-CGC code repositories ongithub

In order to query the ISB-CGC metadata or to get information such as details regarding a cohort that a user may havesaved during an interactive session a series of APIs based on Google Cloud Endpoints have been defined Detailsabout these APIs can be found here and examples illustrating how to use these endpoints from Python can be foundin our examples-Python lthttpsgithubcomisb-cgcexamples-Pythonpythongt repository

The Google APIs Explorer can be used to see each API and try it out through your web browser Each API may bundleseveral endpoints that are functionally related

Cohorts are the primary organizing principle for subsetting and working with the TCGA data A cohort is a list ofsamples and a list of patients (TCGA samples are identified using a 16-character ldquobarcoderdquo while patients are identi-fied using the 12-character prefix of the sample barcode Other datasets such as CCLE may use other less standardizednaming conventions) Users may create and share cohorts using the ISB-CGC web-app and then programmaticallyaccess these cohorts using this API

The Cohort API currently bundles several different cohort-related endpoints

preview

Takes a JSON object of filters in the request body and returns a ldquopreviewrdquo of the cohort that would result from passinga similar request to the cohort save endpoint This preview consists of two lists the lists of participant (aka patient)barcodes and the list of sample barcodes Authentication is not required Example

$ curl httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort -d lsquoldquoStudyrdquo ldquoBRCAOVrdquorsquo -HldquoContent-Type applicationjsonrdquo

Access control To call this method you must have the following roles

18 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull None

Request

HTTP request

POST httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1preview_cohort

Parameters

None

Request body

In the request body supply a metadata resource

adenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringavg_percent_lymphocyte_infiltration floatavg_percent_monocyte_infiltration floatavg_percent_necrosis floatavg_percent_neutrophil_infiltration floatavg_percent_normal_cells floatavg_percent_stromal_cells floatavg_percent_tumor_cells floatavg_percent_tumor_nuclei floatbatch_number integerbcr stringclinical_M stringclinical_N stringclinical_stage stringclinical_T stringcolorectal_cancer stringcountry stringcountry_of_procurement stringdays_to_birth integerdays_to_collection integerdays_to_death integerdays_to_initial_pathologic_diagnosis integerdays_to_last_followup integerdays_to_submitted_specimen_dx integerStudy stringethnicity stringfrozen_specimen_anatomic_site stringgender stringheight integerhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymph_node_examined_count integerlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present string

13 Programmatic Access 19

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

max_percent_lymphocyte_infiltration integermax_percent_monocyte_infiltration integermax_percent_necrosis integermax_percent_neutrophil_infiltration integermax_percent_normal_cells integermax_percent_stromal_cells integermax_percent_tumor_cells integermax_percent_tumor_nuclei integermenopause_status stringmin_percent_lymphocyte_infiltration integermin_percent_monocyte_infiltration integermin_percent_necrosis integermin_percent_neutrophil_infiltration integermin_percent_normal_cells integermin_percent_stromal_cells integermin_percent_tumor_cells integermin_percent_tumor_nuclei integermononucleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined integernumber_of_lymphnodes_positive_by_he integerParticipantBarcode stringpathologic_M stringpathologic_N stringpathologic_stage stringpathologic_T stringperson_neoplasm_cancer_status stringpregnancies stringpreservation_method stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringProject stringpsa_value floatrace stringresidual_tumor stringSampleBarcode stringtobacco_smoking_history stringtotal_number_of_pregnancies integertumor_tissue_site stringtumor_pathology stringtumor_type stringweiss_venous_invasion stringvital_status stringweight integeryear_of_initial_pathologic_diagnosis stringSampleTypeCode stringhas_Illumina_DNASeq stringhas_BCGSC_HiSeq_RNASeq stringhas_UNC_HiSeq_RNASeq stringhas_BCGSC_GA_RNASeq stringhas_UNC_GA_RNASeq stringhas_HiSeq_miRnaSeq stringhas_GA_miRNASeq stringhas_RPPA stringhas_SNP6 string

20 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

has_27k stringhas_450k string

Parameter name Value Descriptionadenocarcinoma_invasion stringage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed (in years)anatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumoravg_percent_lymphocyte_infiltration float Average in the series of numeric values to represent the percentage of lymphocyte infiltration in a malignant tumor sample or specimenavg_percent_monocyte_infiltration float Average in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenavg_percent_necrosis float Average in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenavg_percent_neutrophil_infiltration float Average in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenavg_percent_normal_cells float Average in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenavg_percent_stromal_cells float Average in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenavg_percent_tumor_cells float Average in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenavg_percent_tumor_nuclei float Average in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenbatch_number integer groups samples by the batch they were processed inbcr string A TCGA center where samples are carefully catalogued processed quality-checked and stored along with participant clinical informationclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis clinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentcolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancercountry string Text to identify the name of the state province or country in which the sample was procuredcountry_of_procurement string Text to identify the name of the state province or country in which the sample was procureddays_to_birth integer Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_collection integerdays_to_death integer Time interval from a personrsquos date of death to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_initial_pathologic_diagnosis integer Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerdays_to_last_followup integer Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysdays_to_submitted_specimen_dx integer Time interval from the date of diagnosis of the submitted sample to the date of initial pathologic diagnosis represented as a calculated number of dStudy string A disease study is the sum of results from all experiments for a specific cancer type (or tumor type) that TCGA is tasked to study Within the projecethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesfrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue samplegender string Text designations that identify gender Gender is described as the assemblage of properties that distinguish people on the basis of their societal roheight integer The height of the patient in centimetershistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosishistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)history_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorhistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrencehpv_calls string Results of HPV testshpv_status string Current HPV statusicd_10 string The tenth version of the International Classification of Disease (ICD) published by the World Health Organization in 1992_A system of numbered cateicd_o_3_histology string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries foicd_o_3_site string The third edition of the International Classification of Diseases for Oncology published in 2000 used principally in tumor and cancer registries folymph_node_examined_count integerlymphatic_invasion string a yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementlymphnodes_examined string the yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaselymphovascular_invasion_present string the yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenmax_percent_lymphocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmax_percent_monocyte_infiltration integer Maximum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimen

Continued on next page

13 Programmatic Access 21

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionmax_percent_necrosis integer Maximum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmax_percent_neutrophil_infiltration integer Maximum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmax_percent_normal_cells integer Maximum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmax_percent_stromal_cells integer Maximum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmax_percent_tumor_cells integer Maximum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmax_percent_tumor_nuclei integer Maximum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheamin_percent_lymphocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of lymphcyte infiltration in a malignant tumor sample or specimenmin_percent_monocyte_infiltration integer Minimum in the series of numeric values to represent the percentage of monocyte infiltration in a malignant tumor sample or specimenmin_percent_necrosis integer Minimum in the series of numeric values to represent the percentage of cell death in a malignant tumor sample or specimenmin_percent_neutrophil_infiltration integer Minimum in the series of numeric values to represent the percentage of neutrophil infiltration in a malignant tumor sample or specimenmin_percent_normal_cells integer Minimum in the series of numeric values to represent the percentage of normal cells in a malignant tumor sample or specimenmin_percent_stromal_cells integer Minimum in the series of numeric values to represent the percentage of stromal cells in a malignant tumor sample or specimenmin_percent_tumor_cells integer Minimum in the series of numeric values to represent the percentage of tumor cells in a malignant tumor sample or specimenmin_percent_tumor_nuclei integer Minimum in the series of numeric values to represent the percentage of tumor nuclei in a malignant tumor sample or specimenmononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelmononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessnew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentnumber_of_lymphnodes_examined integer the total number of lymph nodes removed and pathologically assessed for diseasenumber_of_lymphnodes_positive_by_he integer Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyParticipantBarcode string The barcode assigned by TCGA to the Participantpathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regpathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Cancepathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American person_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timepregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedpreservation_method stringprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaprimary_therapy_outcome_success string Measure of Successprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceProject string The study for which the data was generatedpsa_value float The lab value that represents the results of the most recent (post-operative) prostatic-specific antigen (PSA) in the bloodrace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionSampleBarcode string The barcode assigned by TCGA to a sample from a Participanttobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patienttotal_number_of_pregnancies integertumor_tissue_site string Text term that describes the anatomic site of the tumor or diseasetumor_pathology stringtumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriavital_status string the survival state of the person registered on the protocolweight integer the weight of the patient measured in kilogramsyear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancerSampleTypeCode string the type of the sample tumor or normal tissue cell or blood sample provided by a participanthas_Illumina_DNASeq string Indicates if a sample has gene sequencing data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Continued on next page

22 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 11 ndash continued from previous pageParameter name Value Descriptionhas_UNC_HiSeq_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaHiSeq platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_BCGSC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the BCGSC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_UNC_GA_RNASeq string Indicates if a sample has RNA sequencing data from the IlluminaGA platform and the UNC pipeline ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_HiSeq_miRnaSeq string Indicates if a sample has microRNA data from the IlluminaHiSeq platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_GA_miRNASeq string Indicates if a sample has microRNA data from the IlluminaGA platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_RPPA string Indicates if a sample has protein array data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_SNP6 string Indicates if a sample has copy number data ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_27k string Indicates if a sample has methylation data from the Illumina 27k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquohas_450k string Indicates if a sample has methylation data from the Illumina 450k platform ldquoTruerdquo ldquoFalserdquo or ldquoNonerdquo

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItempatient_count stringpatients [string]sample_count stringsamples [string]

Property name Value Descriptionkind cohort_apicohortsItem The resource typepatient_count string Number of participants in this cohortpatients[] list List of participant barcodes in this cohortsample_count string Number of samples in this cohortsamples[] list List of sample barcodes in this cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

patient_details

Returns information about a specific participant including a list of samples and aliquots derived from this patientTakes a participant barcode (of length 12 eg TCGA-B9-7268) as a required parameter User does not need to beauthenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1patient_details

Parameters

Parameter name Value DescriptionPath parameterspatient_barcode string Barcode of the patient to get information about Required

Response

If successful this method returns a response body with the following structure

13 Programmatic Access 23

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

kind cohort_apicohortsItemaliquots [string]clinical_data ParticipantBarcode stringProject stringStudy stringage_atinitial_pathologic_diagnosis stringanatomic_neoplasm_subdivision stringbatch_number stringbcr stringclinical_M stringclinical_N stringclinical_T stringclinical_stage stringcolorectal_cancer stringcountry stringdays_to_birth stringdays_to_initial_pathologic_diagnosis stringdays_to_last_followup stringethnicity stringfrozen_specimen_anatomic_site stringgender stringhistological_type stringhistory_of_colon_polyps stringhistory_of_neoadjuvant_treatment stringhistory_of_prior_malignancy stringhpv_calls stringhpv_status stringicd_10 stringicd_o_3_histology stringicd_o_3_site stringlymphatic_invasion stringlymphnodes_examined stringlymphovascular_invasion_present stringmenopause_status stringmononcleotide_and_dinucleotide_marker_panel_analysis_status stringmononucleotide_marker_panel_analysis_status stringneoplasm_histologic_grade stringnew_tumor_event_after_initial_treatment stringnumber_of_lymphnodes_examined stringnumber_of_lymphnodes_positive_by_he stringpathologic_M stringpathologic_N stringpathologic_T stringpathologic_stage stringperson_neoplasm_cancer_status stringpregnancies stringprimary_neoplasm_melanoma_dx stringprimary_therapy_outcome_success stringprior_dx stringrace stringresidual_tumor stringtobacco_smoking_history stringtumor_tissue_site stringtumor_type stringvital_status stringweiss_venous_invasion string

24 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

year_of_initial_pathologic_diagnosis stringsamples []

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantclinical_data nested object The clinical data about the participantclinical_dataParticipantBarcode string Participant barcodeclinical_dataProject string Project name eg ldquoTCGArdquoclinical_dataStudy string Tumor type abbreviation eg ldquoBRCArdquoclinical_dataage_at_initial_pathologic_diagnosis string Age at which a condition or disease was first diagnosed in yearsclinical_dataanatomic_neoplasm_subdivision string Text term to describe the spatial location subdivisions andor anatomic site name of a tumorclinical_databatch_number string Groups samples by the batch they were processed inclinical_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquoclinical_dataclinical_M string Extent of the distant metastasis for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_N string Extent of the regional lymph node involvement for the cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_T string Extent of the primary cancer based on evidence obtained from clinical assessment parameters determined prior to treatmentclinical_dataclinical_stage string Stage group determined from clinical information on the tumor (T) regional node (N) and metastases (M) and by grouping cases with similar prognosis for cancerclinical_datacolorectal_cancer string Text term to signify whether a patient has been diagnosed with colorectal cancerclinical_datacountry string Text to identify the name of the state province or country in which the sample was procuredclinical_datadays_to_birth string Time interval from a personrsquos date of birth to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_datadays_to_initial_pathologic_diagnosis string Numeric value to represent the day of an individualrsquos initial pathologic diagnosis of cancerclinical_datadays_to_last_followup string Time interval from the date of last followup to the date of initial pathologic diagnosis represented as a calculated number of daysclinical_dataethnicity string The text for reporting information about ethnicity based on the Office of Management and Budget (OMB) categoriesclinical_datafrozen_specimen_anatomic_site string Text description of the origin and the anatomic site regarding the frozen biospecimen tumor tissue sampleclinical_datagender string Text designations that identify genderclinical_datahistological_type string Text term for the structural pattern of cancer cells used to define a microscopic diagnosisclinical_datahistory_of_colon_polyps string YesNo indicator to describe if the subject had a previous history of colon polyps as noted in the historyphysical or previous endoscopic report(s)clinical_datahistory_of_neoadjuvant_treatment string Text term to describe the patientrsquos history of neoadjuvant treatment and the kind of treament given prior to resection of the tumorclinical_datahistory_of_prior_malignancy string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datahpv_calls string Results of HPV testsclinical_datahpv_status string Current HPV statusclinical_dataicd_10 string The tenth version of the International Classification of Disease (ICD)clinical_dataicd_o_3_histology string The third edition of the International Classification of Diseases for Oncologyclinical_dataicd_o_3_site string The third edition of the International Classification of Diseases for Oncologyclinical_datalymphatic_invasion string A yesno indicator to ask if malignant cells are present in small or thin-walled vessels suggesting lymphatic involvementclinical_datalymphnodes_examined string The yesnounknown indicator whether a lymph node assessment was performed at the primary presentation of diseaseclinical_datalymphovascular_invasion_present string The yesno indicator to ask if large vessel (vascular) invasion or small thin-walled (lymphatic) invasion was detected in a tumor specimenclinical_datamenopause_status string Text term to signify the status of a womanrsquos menopause the permanent cessation of menses usually defined by 6 to 12 months of amenorrheaclinical_datamononucleotide_and_dinucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing at using a mononucleotide and dinucleotide microsatellite panelclinical_datamononucleotide_marker_panel_analysis_status string Text result of microsatellite instability (MSI) testing using a mononucleotide microsatellite panelclinical_dataneoplasm_histologic_grade string Numeric value to express the degree of abnormality of cancer cells a measure of differentiation and aggressivenessclinical_datanew_tumor_event_after_initial_treatment string YesNoUnknown indicator to identify whether a patient has had a new tumor event after initial treatmentclinical_datanumber_of_lymphnodes_examined string The total number of lymph nodes removed and pathologically assessed for diseaseclinical_datanumber_of_lymphnodes_positive_by_he string Numeric value to signify the count of positive lymph nodes identified through hematoxylin and eosin (HampE) staining light microscopyclinical_datapathologic_M string Code to represent the defined absence or presence of distant spread or metastases (M) to locations via vascular channels or lymphatics beyond the regclinical_datapathologic_N string The codes that represent the stage of cancer based on the nodes present (N stage) according to criteria based on multiple editions of the AJCCrsquos Canceclinical_datapathologic_stage string The extent of a cancer especially whether the disease has spread from the original site to other parts of the body based on AJCC staging criteriaclinical_datapathologic_T string Code of pathological T (primary tumor) to define the size or contiguous extension of the primary tumor (T) using staging criteria from the American

Continued on next page

13 Programmatic Access 25

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 12 ndash continued from previous pageProperty name Value Descriptionclinical_dataperson_neoplasm_cancer_status string The state or condition of an individualrsquos neoplasm at a particular point in timeclinical_datapregnancies string Value to describe the number of full-term pregnancies that a woman has experiencedclinical_dataprimary_neoplasm_melanoma_dx string Text indicator to signify whether a person had a primary diagnosis of melanomaclinical_dataprimary_therapy_outcome_success string Measure of Successclinical_dataprior_dx string Text term to describe the patientrsquos history of prior cancer diagnosis and the spatial location of any previous cancer occurrenceclinical_datarace string The text for reporting information about race based on the Office of Management and Budget (OMB) categoriesclinical_dataresidual_tumor string Text terms to describe the status of a tissue margin following surgical resectionclinical_datatobacco_smoking_history string Category describing current smoking status and smoking history as self-reported by a patientclinical_datatumor_tissue_site string Text term that describes the anatomic site of the tumor or diseaseclinical_datatumor_type string Text term to identify the morphologic subtype of papillary renal cell carcinomaclinical_datavital_status string The survival state of the person registered on the protocolclinical_dataweiss_venous_invasion string The result of an assessment using the Weiss histopathologic criteriaclinical_datayear_of_initial_pathologic_diagnosis string Numeric value to represent the year of an individualrsquos initial pathologic diagnosis of cancersamples[] list List of barcodes of samples taken from this participant

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

sample_details

given a sample barcode (of length 16 eg TCGA-B9-7268-01A) this endpoint returns all available ldquobiospecimenrdquoinformation about this sample the associated patient barcode a list of associated aliquots and a list of ldquodata_detailsrdquoblocks describing each of the data files associated with this sample

Returns information about a specific sample Takes a sample barcode as a required parameter User does not need tobe authenticated

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1sample_details

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Barcode of the sample to get information about Required

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemaliquots [string]biospecimen_data ParticipantBarcode stringProject stringSampleBarcode stringStudy stringavg_percent_lymphocyte_infiltration integer

26 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

avg_percent_monocyte_infiltration integeravg_percent_necrosis integeravg_percent_neutrophil_infiltration integeravg_percent_normal_cells integeravg_percent_stromal_cells integeravg_percent_tumor_cells integeravg_percent_tumor_nuclei integerbatch_number stringbcr stringdays_to_collection stringmax_percent_lymphocyte_infiltration stringmax_percent_monocyte_infiltration stringmax_percent_necrosis stringmax_percent_neutrophil_infiltration stringmax_percent_normal_cells stringmax_percent_stromal_cells stringmax_percent_tumor_cells stringmax_percent_tumor_nuclei stringmin_percent_lymphocyte_infiltration stringmin_percent_monocyte_infiltration stringmin_percent_necrosis stringmin_percent_neutrophil_infiltration stringmin_percent_normal_cells stringmin_percent_stromal_cells stringmin_percent_tumor_cells stringmin_percent_tumor_nuclei string

data_details [CloudStoragePath stringDataCenterName stringDataCenterType stringDataFileName stringDataFileNameKey stringDataLevel stringDatafileUploaded stringDatatype stringGenomeReference stringPipeline stringPlatform stringProject stringRepository stringSDRFFileName stringSampleBarcode stringSecurityProtocol stringplatform_full_name string

]data_details_count stringpatient string

Property name Value Descriptionkind cohort_apicohortsItem The resource typealiquots[] list List of barcodes of aliquots taken from this participantbiospecimen_data nested object Biospecimen data about the sample

Continued on next page

13 Programmatic Access 27

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptionbiospecimen_dataParticipantBarcode string Participant barcodebiospecimen_dataProject string Project name eg ldquoTCGArdquobiospecimen_dataSampleBarcode string Sample barocdebiospecimen_dataStudy string Tumor type abbreviation eg ldquoBRCArdquobiospecimen_dataavg_percent_lymphocyte_infiltration integer Average percent lymphocyte infiltrationbiospecimen_dataavg_percent_monocyte_infiltration integer Average percent monocyte infiltrationbiospecimen_dataavg_percent_necrosis integer Average percent necrosisbiospecimen_dataavg_percent_neutrophil_infiltration integer Average percent neutrophil infiltrationbiospecimen_dataavg_percent_normal_cells integer Average percent normal cellsbiospecimen_dataavg_percent_stromal_cells integer Average percent stromal cellsbiospecimen_dataavg_percent_tumor_cells integer Average percent tumor cellsbiospecimen_dataavg_percent_tumor_nuclei integer Average percent tumor nucleibiospecimen_databatch_number string Batch number in which the sample was processedbiospecimen_databcr string Biospecimen core resource eg ldquoNationwide Childrenrsquos Hospitalrdquo ldquoWashington Universityrdquobiospecimen_datadays_to_collection string Days to collectionbiospecimen_datamax_percent_lymphocyte_infiltration string Maximum percent lymphocyte infiltrationbiospecimen_datamax_percent_monocyte_infiltration string Maximum percent monocyte infiltrationbiospecimen_datamax_percent_necrosis string Maximum percent necrosisbiospecimen_datamax_percent_neutrophil_infiltration string Maximum percent neutrophil infiltrationbiospecimen_datamax_percent_normal_cells string Maximum percent normal cellsbiospecimen_datamax_percent_stromal_cells string Maximum percent stromal cellsbiospecimen_datamax_percent_tumor_cells string Maximum percent tumor cellsbiospecimen_datamax_percent_tumor_nuclei string Maximum percent tumor nucleibiospecimen_datamin_percent_lymphocyte_infiltration string Minimum percent lymphocyte infiltrationbiospecimen_datamin_percent_monocyte_infiltration string Minimum percent monocyte infiltrationbiospecimen_datamin_percent_necrosis string Minimum percent necrosisbiospecimen_datamin_percent_neutrophil_infiltration string Minimum percent neutrophil infiltrationbiospecimen_datamin_percent_normal_cells string Minimum percent normal cellsbiospecimen_datamin_percent_stromal_cells string Minimum percent stromal cellsbiospecimen_datamin_percent_tumor_cells string Minimum percent tumor cellsbiospecimen_datamin_percent_tumor_nuclei string Minimum percent tumor nucleidata_details[] list List of information about each data file associated with the sample barcodedata_details[]CloudStoragePath string Path to file if it existsdata_details[]DataCenterName string Short name of the contributing data center eg ldquobcgsccardquodata_details[]DataCenterType string Abbreviation of the type of contributing data center eg ldquocgccrdquodata_details[]DataFileName string Name of the datafile stored on the DCC file systemdata_details[]DataFileNameKey string Key into the ISB-CGC GCS bucket for this filedata_details[]DatafileUploaded string Whether the file fit requirements to be uploaded into the projectdata_details[]DataLevel string Level of the type of data depending on where it is stored in the DCC directory structure Data levels are defined by TCGA DCCdata_details[]Datatype string Data type eg ldquoComplete Clinical Set CNV (SNP Array)rdquo ldquoDNA Methylationrdquo ldquoExpression-Proteinrdquo ldquoFragment Analysis Resultsrdquo ldquomiRNASeqrdquo ldquoProtected Mutationsrdquo ldquoRNASeqrdquo ldquoRNASeqV2rdquo ldquoSomatic Mutationsrdquo ldquoTotalRNASeqV2rdquodata_details[]GenomeReference string Allows a center to associate results with a specific genome build that was used as the basis for analysis eg ldquohg19 (GRCh37)rdquodata_details[]Pipeline string A combination of the center and the platform that can distinguish between two ways of performing the sequencing or assay for the same platform eg ldquobcgscca__miRNASeqrdquodata_details[]Platform string A platform (within the scope of TCGA) is a vendor-specific technology for assaying or sequencing that could possibly be customized by a GSC or CGCC eg ldquoIlluminaHiSeq_miRNASeqrdquodata_details[]platform_full_name string The full name of the sequencing platform used eg ldquoIllumina HiSeq 2000rdquo ldquoIon Torrent PGMrdquo ldquoAB SOLiD System 20rdquodata_details[]Project string The study for which the data was generated eg ldquoTCGArdquodata_details[]Repository string A storage location where files are deposited and made available eg ldquoDCCrdquo ldquoCGHubrdquodata_details[]SDRFFileName string Name of SDRF file stored on the DCC file system eg ldquobcgscca_KIRCIlluminaHiSeq_miRNASeqsdrftxtrdquodata_details[]SampleBarcode string Sample barcodedata_details[]SecurityProtocol string An indication of the security protocol necessary to fulfill in order to access the data from the file eg ldquordquoDBGap Protected Accessrdquo ldquoDBGap Open Accessrdquo

Continued on next page

28 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Table 13 ndash continued from previous pageProperty name Value Descriptiondata_details_count string Length of data_details listpatient string Participant barcode

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

datafilenamekey_list_from_sample

Takes a sample barcode as a required parameter and returns cloud storage paths to files associated with that sampleThe user does not need to be authenticated to retrieve a list of open-access file paths only User must be authenticatedand have dbGaP authorization in order to see paths to controlled-access files If the user is not dbGaP authorizedcontrolled-access files will not appear

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1datafilenamekey_list_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required Barcode of the sample to get file paths forplatform string Optional Filter file results by platformpipeline string Optional Filter file results by pipelinetoken string Optional Access token to authenticate user

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemcount stringdatafilenamekeys [string]

Prop-ertyname

Value Description

kind co-hort_apicohortsItem

The resource type

count string Integer representing the length of the datafilenamekeys listdatafile-namekeys[]

list List of cloud storage file paths associated with each sample within the cohort If a filepath is not yet available in the metadata_data table the cloud storage bucket name islisted with ldquofile-path-not-yet-availablerdquo If no file paths are listed (for example ifonly controlled-access files are listed for that sample barcode and the user does nothave dbGaP authorization) the response will not contain this field

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 29

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

google_genomics_from_sample

Takes a sample barcode as a required parameter and returns the Google Genomics dataset id and readgroupset idassociated with the sample if any

Access control To call this method you must have the following roles

bull None

Request

HTTP request

GET httpsapi-dot-isb-cgcappspotcom_ahapicohort_apiv1google_genomics_from_sample

Parameters

Parameter name Value DescriptionPath parameterssample_barcode string Required The sample whose dataset id and readgroupset id will be retrieved

Response

If successful this method returns a response body with the following structure

kind cohort_apicohortsItemitems [

count stringSampleBarcode stringGG_dataset_id stringGG_readgroupset_id string

]

Property name Value Descriptionkind co-

hort_apicohortsItemThe resource type

count string The number of items returned Count will be either ldquo0rdquo or ldquo1rdquoitems[] list If a dataset id and readgroupset id exist for the sample this will be

a list with one objectitems[]SampleBarcode string The sample barcode passed into the requestitems[]GG_dataset_id string The dataset id of the sampleitems[]GG_readgroupset_idstring The readgroupset id of the sample

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

134 Using Google Compute Engine

For those ISB-CGC users whose research goals require the ability to run large compute jobs all of the power andinfrastructure behind Google Compute (Compute Engine Container Engine Dataproc and Dataflow) and GoogleGenomics are at your disposal

30 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Our goal is to help you assemble the tools and data (TCGA data your data reference data etc) that you need to answeryour research questions in the most efficient and cost-effective way possible

Towards that end we have created a github repository called examples-Compute with examples to get you started Thisrepository will continue to grow and we welcome your contributions and suggestions You can also find a number ofuseful recipes in the Google Genomics Cookbook also here on readthedocs

For an introduction to using Google Compute Engine please follow the link below

Introduction to Google Compute Engine

Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component of Google Cloud Platform (GCP)GCE offers scale performance and value letting you easily create and run virtual machines (VMs) on Google infras-tructure

We have tried to put together some basic documentation for ISB-CGC users who are new to the Google Cloud Platformbut your main source of information should generally be the official Google Cloud Platform documentation We havefound that sometimes the wealth of available information can result in information overload so we hope that this briefintroduction will be useful to you If you are still feeling lost please let us know and wersquoll do our best to get youpointed in the right direction

Setting up your GCP project

This setup guide assumes that you are already a member of a GCP project with either ldquoOwnerrdquo or ldquoEditorrdquo rights Ifyou need a GCP project you may request one as part of the ISB-CGC community evaluation phase going on now

Google Developers Console If you are new to the Google Cloud it is a good idea to become familiar with theDevelopers Console (which we will generally refer to simply as the Console) You can get help from within theConsole by clicking on the Help (question mark) icon near the upper right-hand corner The Console provides aconvenient web UI for managing resources within your cloud project and can be useful for obtaining a quick high-level snapshot of the state of your project The ldquoHomerdquo page will list for example the number of buckets you havecreated in Cloud Storage the number of datasets in BigQuery and the number of VMs you have running under AppEngine or Compute Engine It also shows the charges incurred by this project so far this month

Enable the Compute Engine API The Compute Engine API is probably enabled by default on your GCP projectbut you can verify this through the Console click on the menu icon in the upper left hand corner (when you hover overit you will see ldquoProducts and servicesrdquo) and then select the API Manager The API Manager page has two sectionsOverview and Credentials Within the Overview page you can see a list of all ldquoGoogle APIsrdquo and a list of the ldquoEnabledAPIsrdquo

You can check your list of ldquoEnabled APIsrdquo or simply select the ldquoCompute Engine APIrdquo link which should be at thevery top of the list of ldquoPopular APIsrdquo Once you are on the ldquoGoogle Compute Enginerdquo page you should either see ablue button with the word ldquoEnablerdquo or a white ldquoDisable button If the button says Enable click on it This processwill take a minute or two after which you will be prompted to ldquoGo to Credentialsrdquo You should not need to createnew credentials at this time ndash you will typically be using Application Default Credentials (This blog post introducingApplication Default Credentials may also be helpful) The proper use of credentials is frequently one of the mostcomplicated aspects of interacting with the Google Cloud Platform If you are having problems please let us know

You may also find the official Compute Engine Getting Started Guide helpful

Google Cloud SDK Depending on how you choose to interact with the Google Cloud Platform you may want toinstall the Google Cloud SDK on your local workstation The Google Cloud SDK is a set of command-line interface(CLI) tools that you can use to manage resources and applications hosted on GCP (Note that components of the

13 Programmatic Access 31

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

the SDK are updated quite frequently You will be notified when updates are available anytime you use one of theSDK tools The command will still run but you will be notified that ldquoUpdates are available for some Cloud SDKcomponentsrdquo and you will be given instructions on how to update your local copy of the SDK)

Confirm that you have installed the SDK and have access to it by typing gcloud --version at the command lineof your own linux workstation or from the Cloud Shell (for more details about the Cloud Shell see the next section)You should see something like this

Google Cloud SDK 9800

bq 2018bq-nix 2018core 20160222core-nix 20160205gcloudgsutil 416gsutil-nix 415

Google Cloud Shell Google Cloud Shell provides you with command-line access to computing resources hosted onGCP is available from the Console Cloud Shell provides you with a temporary VM running a Debian-based LinuxOS with 5 GB of persistent disk storage per user and the Google Cloud SDK and other tools pre-installed

From the Console you will find the icon for the Cloud Shell in the top-most blue bar near the right-hand cornerbetween your GCP project name and the ldquoSend feedbackrdquo icon If you click on that icon (the hover-card should readldquoActivate Google Cloud Shellrdquo) it will take a minute or two for you VM to be provisioned after which you will see aprompt saying ldquoWelcome to Cloud Shellrdquo in the new window that has appeared at the bottom of your Console pageYou can ldquopoprdquo that window out of your browser page by clicking on the ldquoOpen in new windowrdquo icon in the upperright-hand corner of the shell window

Authenticate with Google Regardless of how you choose to interact with the Google Cloud you will need toauthenticate yourself How this authentication takes place will depend on ldquowhererdquo you are If you have signed intoChrome using your Google identity and you then go to the Console you will already have been authenticated If youare at the Linux prompt of the Cloud Shell you have also already been authenticated because that Shell (and that VM)were launched for you from your Console If you are at the Linux prompt of your local workstation you will need toauthenticate using the gcloud command line utility

There are two approaches

bull gcloud init launches an interactive Getting Started workflow for gcloud

bull gcloud auth login obtains access credentials for your user account via a web-based authorization flow

These approaches may ask you to cut-and-paste a long URL into a browser sign in using your Google credentialsclick ldquoAllowrdquo to allow Google to access certain information about you and may also ask that you cut-and-paste anauthorization token from your browser back into the Linux shell

Once you have authenticated you can see information about your current configuration by typing gcloud configlist You can set additional properties using the gcloud config set command The most common propertiesyou are likely to want to verify (list) or set explicitly are

bull account

bull project

bull computeregion

bull computezone

bull containercluster

32 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Launching a Virtual Machine (VM)

You can launch a virtual machine (which we will generally refer to as a VM) from the Console or from the commandline using the Google Cloud SDK We will describe both of these approaches here

You should already be somewhat familiar with the Console and hopefully you have tried invoking the gcloud com-mand from your command-line The gcloud command-line tool can be used to manage both your development work-flow and your GCP resources (For more details please look at the official gcloud Tool Guide)

Bundled into the gcloud CLI are several commands and groups of sub-commands The group of sub-commands thatallows you to read and manipulate GCE resources is gcloud compute

Launch a VM using the Console After you have enabled the Compute Engine API for you project you can go theCompute Engine section of the Console (Select the menu icon in the far upper-left corner and then choose ldquoComputeEnginerdquo from the flyout panel) The first time you may need to wait a minute or so while ldquoCompute Engine is gettingreadyrdquo

You will now be on the ldquoVM instancesrdquo page (There are may other pages that are accessible from the left side-panel)The first time you visit this page you will see two options ldquoCreate Instancerdquo or ldquoTake the quickstartrdquo After the firsttime you may see a different page with a list of existing (running or stopped) VMs with a CPU utilization graph Atthe top of this page you will see options to ldquoCREATE INSTANCErdquo ldquoCREATE INSTANCE GROUPrdquo ldquoRESETrdquoldquoSTARTrdquo ldquoSTOPrdquo and ldquoDELETErdquo VM instances

After selecting the ldquoCreate Instancerdquo option you will be sent to the ldquoCreate an instancerdquo page where defaults will beselected for the Name Zone Machine type etc

bull Name this name is relatively arbitrary choose something that is meaningful to you

bull Zone choose one of the us-east or us-central zones

bull Machine type you can specify a VM with anywhere between 1 and 16 cores (aka vCPUs) and with up to 100GB of RAM (you can try the ldquoCustomizerdquo view if you prefer a more graphical approach) note that as youchange the specifications of the VM the estimated cost shown on this page will update

bull Boot disk the default boot disk and OS will be shown but you can change this as you wish the ldquoChangerdquobutton will result in a flyout panel where you can choose from a variety of Preconfigured images (DebianCentOS Ubuntu RedHat etc) or previously created images or disks you can also choose between ldquostandarddisksrdquo and faster (and more expensive) solid-state drives (SSDs) and specify the size of the disk (up to 64TB)

Other options below the ldquoManagement disk rdquo line include Preemptibility (default is OFF) Automatic restart (defaultis ON) and what to do during infrastructure maintenance (default is to ldquomigrate VMrdquo so that you will not experienceany downtime)

Once you have all of the options set you can click on the blue Create button You can also see you could use theREST or command-line interfaces to do perform the exact same option (The Console is just a friendlier interfacebetween you and more direct REST-based access to the same functionality)

Creating the VM should take less than a minute after which you will see it listed on the ldquoVM instancesrdquo page withthe Name Zone Disk Network and External IP address shown There is also an SSH button that you can use directlyfrom the Console

13 Programmatic Access 33

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Launch a VM using the CLI The command to create a new GCE VM instance is gcloud computeinstances create The complete documentaiton can be found online or by typing gcloud computeinstances create --help on the command line

Some defaults can be obtained (if available) from your configuration settings For example if you donrsquot want to haveto specify the zone of the instances you can set the computezone property for example lsquo gcloud config setcomputezone us-central1-a lsquo A list of zones can be fetched by running lsquo gcloud compute zoneslist lsquo

Here is a very simple command to create a VM lsquo gcloud compute instances create my-instance--machine-type g1-small lsquo

Accessing your new VM Whether you have created your VM from the Console or using the gcloud CLI you canfind it and ssh to it again using either the Console or the CLI

bull From the Console go to Compute Engine gt VM instances and then click on the SSH button on the far-right ofthe row describing the specific VM you would like to connect to

bull Using the CLI simply use the command gcloud cmopute ssh followed by the instance name

Shutting down your VM Remember that as long as your VM is running whether or not you are actually doinganything with it charges will be incurred It is therefore a good idea to get in the habit of shutting down VMs as soonas you are finished with your work They can easily be restarted an hour day or week later Note that resources thatare attached to a stopped VM (such as persistent disks) will however continue to incur charges Compared to the costof the VM though the cost of a persistent disk is typically negligible a 50 GB standard persistent disk only costs $2per month and 1 TB costs $40

If you know that you wonrsquot never need this specific VM again or you donrsquot want to continue paying for the persistentdisk or you would rather start a fresh VM with an updated OS next time then you can go ahead and delete the VMrather than just stopping it

From the command-line the relevant commands are gcloud compute instances stop and gcloudcompute instances delete

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Creating and Managing Persistent Disks

As described in the previous section you can specify the boot disk when launching a VM from the Console and fromthe command-line There are times when you may want to create and attach additional disks to an instance Thereare three main steps in this process you must first create the disk then you must attach it to the instance and finallyyou must format it When you are finished you may want to detach the disk and when you are done with it you willwant to delete it We will describe each of these steps in a bit more detail below You may also want to see the Googledocumentation on Adding Persistent Disks

Create a Persistent Disk The gcloud command for creating a persistent disk is gcloud compute diskscreate The most common options yoursquoll probably use are --size --type and --zone (see this page formore details) For example

gcloud compute disks create disk-1 --size 500GB

will create a 500 GB disk named ldquodisk-1rdquo using default settings (eg the type will be pd-standard)

34 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Attach a Persistent Disk The gcloud command to attach a newly created disk to a previously created instance lookslike this

gcloud compute instances attach-disk ndashdisk disk-1 ndashdevice-name my-instance

Note that this command is part of the gcloud compute instances group rather than the gcloud compute disks groupDetails about additional options can be found in the documentation For example the default mode is rw (read-write)but you can also specify that a disk be attached ro (read-only)

Format a Persistent Disk In order to format a disk that yoursquove attached to an instance you need to first log on tothat instance

gcloud compute ssh my-instance

For complete details please refer to the Google documentation on formatting and mounting non-root persistent disksbut there are two main steps first you must format the disk using the mkfs tool (note that this will delete any existingdata on the disk) and second you must use the mount tool to mount the disk at a specified mount-point

sudo mkfsext4 -F devdiskby-iddisk-1sudo mkdir mntpd1sudo mount -o discarddefaults devdiskby-iddisk-1 mntpd1

Detach a Persistent Disk Detaching a disk is a two step process first you unmount the disk (using the umountcommand from the instance to which it is attached) and then (after logging out from that instance) you use the gcloudtool

$sudo umount devdiskby-iddisk-1$exit

gcloud compute instances detach-disk my-instance --disk disk-1

Delete a Persistent Disk Note that a boot disk will be deleted if you delete the instance that it is attached to (as longas the auto-delete property for the disk was set to ldquoyesrdquo (the default) when it was created) In all other cases you willneed to delete the disk manually using the gcloud compute disks delete command Note that disks can bedeleted only if they are not being used by any VM instances

You can also see and manage persistent disks from the Console on the Compute Engine gt Disks page

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

13 Programmatic Access 35

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

14 ISB-CGC Web Interface

The documentation contained in this section is for the prototype ISB-CGC web interface

Please understand that the currently deployed web-app is an early prototype and our developers are working on acomplete overhaul We encourage you to explore the TCGA data that we have made available in BigQuery tables andto Contact Us for more information about upcoming releases

The information in the sections below is also directly accessible from the ISB-CGC web application After you sign-inclick on the down-arrow next to your name in the upper-right corner and select ldquoHelprdquo

141 User Dashboard

Create Cohorts and Visualizations

Create a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Create a general visualization

To create a visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoVisualizationrdquo This willtake you to a new visualization with default settings selected

Create a SeqPeek visualization

To create a SeqPeek visualization from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoSeqPeekrdquo Thiswill take you to a new SeqPeek visualization with no default settings selected

Operations on Cohorts and Visualizations

Delete a Cohort or a Visualization

To delete a set of cohorts first check the boxes next to the cohorts you wish to delete The ldquoTrashrdquo button shouldbecome selectable after selecting at least one cohort Click on the ldquoTrashrdquo button to delete the selected cohorts Thisis the same for Visualizations

Share a Cohort or a Visualization

To share a set of cohorts first select the boxes next to the cohorts you wish to share The ldquoSharerdquo button should becomeselectable after selecting at least one cohort Click on the ldquoSharerdquo button to share the selected cohorts A dialoguebox will appear and prompt you to select the users you wish to share with You will be able to remove cohorts you nolonger want to share You will be able to select from a list of users that are already registered in the system and youmay select one or more users you wish to share with Click the ldquoShare Cohortrdquo button when you are done This is thesame for visualizations

Note that when you share a cohort with another user that user will be able to view and comment on the cohort butwill not be able to make changes If you want to make changes to a cohort that has been shared with you first clonethat cohort

36 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Set Operations on Cohorts

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear From here you may choose one of the following operations

bull Enter a name for the resulting cohort you will create

bull Select a set operation

bull Edit cohorts to be operated upon

The intersect and union operations can take any number of cohorts and in any order

The complement operation requires that there be a base cohort from which the other cohort(s) will be subtracted

Click ldquoOkayrdquo to complete the operation and create the new cohort

Your Cohorts

When the ldquoCohortsrdquo tab is selected the system is displaying all the cohorts that you own and all the cohorts that havebeen shared with you

Clicking on the name of a cohort in the list will take you to the cohort details and editing page

Your Visualizations

When the ldquoVisualizationsrdquo tab is selected the system is displaying all the visualizations that you own and all thevisualizations that have been shared with you

Your SeqPeeks

When the ldquoSeqPeekrdquo tab is selected the system is displaying all the SeqPeek visualizations that you own and all theSeqPeek visualization that have been shared with you

Searching

To search for cohorts and visualizations by name you can use the search bar at the top of the page This will producea results page of two lists one for cohorts and one for visualizations This search only does a string matching with thenames of cohorts and visualizations

Sorting

You can sort the listing of cohorts and visualizations using the column headers By clicking on a column it willindicate whether it is sorting that column in ascending order or descending order

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 37

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

142 Cohorts

Cohorts are a way of creating custom groupings of the samples andor participants that you are interested in analyzingfurther For example you can create cohorts that span across multiple projects only contain samples for which certaintypes of data are avaialble or focus on specific phenotypic characteristics

Creating and saving a cohort

To create a cohort from the User Dashboard click on the ldquo+ Createrdquo button and select ldquoNew Cohortrdquo This will takeyou to the cohort creation page

Cohort Creation Page

Using the provided list of filters on the left hand side you can select the attributes and features that you are interestedin By clicking on a feature the field will expand and provide you with additional filtering options For example whenyou click on ldquoVital Statusrdquo it expands and provides a list of ldquoAliverdquo ldquoDeadrdquo and ldquoNonerdquo as options to choose fromSelecting one or more of these will cause the filter(s) to appear in the Selected Filters panel and visualizations on thepage will be updated to reflect that the current cohort that has been filtered by Vital Status The numbers beside theselectable filter values reflect the number of samples that have that attribute based on all other filters that have beenselected

Cohort Filters

Participant Filters List

bull Project

bull Study

bull Vital Status

bull Gender

bull Age At Diagnosis

bull Sample Type Code

bull Tumor Tissue Site

bull Histological Type

bull Prior Diagnosis

bull Pathologic State

bull Tumor Status

bull New Tumor Event After Initial Treatment

bull Histological Grade

bull Residual Tumor

bull Tobacco Smoking History

bull ICD-10

bull ICD-O-3 Histology

bull ICD-O-3 Site

38 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Data Type Filters List

bull DNA-Sequence

bull RNA-Sequence

bull miRNA-Sequence

bull Protein

bull SNP Copy-Number

bull DNA Methylation

Selected Filters Panel This is where selected filters are shown so there is an easy way to see what filters have beenselected Clicking on ldquoClear Allrdquo will remove all selected filters

Clinical Features Panel This panel shows a list of treemaps that give a high level breakdown of the selected samplesfor a handful of features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at InitialPathologic Diagnosis By using the ldquoShow Morerdquo button you can see two more tree maps that are currently available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type (vertical bar) is subdivided accordingto the different platforms that were used to generate this type of data (with ldquoNArdquo indicating samples for which thisdata type is not available) Each sample in the current cohort is represented by a single line that ldquoflowsrdquo horizontallyfrom left to right crossing each vertical bar in the appropriate segment Hovering on a swatch between two verticalbars you will see the number of samples that have data from those two platforms You can also reorder the verticalcategories by dragging the headers left and right and reorder the platforms by dragging the platform names up anddown

Operations on Cohorts

Set Operations

You can create cohorts using set operations on the User Dashboard page

To activate the set operations button you must have at least one cohort selected Upon clicking the ldquoSet Operationsrdquobutton a dialogue box will appear Here you may do the following things Enter in a name for the new cohort yoursquoreabout to create Select a set operation Edit cohorts to be used in the operation

The intersect and union operations can take any number of cohorts and in any order The complement operationrequires that there be a base cohort from which the other cohorts will be subtracted from Click ldquoOkayrdquo to completethe operation and create the new cohort

Editing a Cohort

Details of cohort edit page

Main Menu

bull Add New Filters Selecting this menu item make the filters panel appear And filters selected will be additiveto any filters that have already been selected To return to the previous view you much either save any selectedfilters or choose to cancel adding any new filters

14 ISB-CGC Web Interface 39

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Comments Selecting ldquoCommentsrdquo will cause the Comments panel to appear Here anyone who can see thiscohort can comment on it Comments are shared with anyone who can view this cohort and ordered by neweston the bottom

bull Make a Copy Making a copy will create a copy of this cohort with the same list of samples and patients andmake you the owner of the copy

bull Share with Others This behaves similarly to on the User Dashboard page A dialogue box appears and the useris prompted to select users that are registered in the system to share the cohort with

Selected Filters Panel This panel displays any filters that have been used on the cohort or any of its ancestors Thesecannot be modified and any additional filters applied to this cohort will be appended to the list

Details Panel This panel displays the number of samples and participant in this cohort These vary because someparticipants may have provided multiple samples This panel also displays ldquoYour Permissionsrdquo which can be eitherowner or reader

Clinical Features Panel This panel shows a list of treemaps that give a high level break of the samples for a handfulof features Disease Code Vital Status Sample Type Tumor Tissue Site Gender Age at Initial PathologicDiagnosis

By using the ldquoShow Morerdquo button you can see two more tree maps available

Data Availability Panel This panel shows a parallel sets graph of available data for the selected samples in thecohort The large headers over the vertical bars are data types Each data type is broken up into their differentplatforms and ldquoNArdquo for samples that do not have that data type The bars that flow horizontally indicate the number ofsamples that have that data By hovering on a horizontal segment between the first two bars you will see the numberof data that have both those data type platforms You can also reorder the vertical categories by dragging the headersleft and right and reorder the platforms by dragging the platform names up and down

ldquoView File Listrdquo takes you to a new page where you can view the file list associated to the cohort you are looking atThe file list page provides a paginated list of files available with all samples in the cohort Here ldquoavailablerdquo refersto files that have been uploaded to the ISB-CGC Google Cloud Project and that are open access data You can usethe ldquoPrevious Pagerdquo and ldquoNext Pagerdquo to show more values in the list You may filter on these files if you are onlyinterested in a specific data type and platform Selecting a filter will update the list associated The numbers next tothe platform refers to the number of files available for that platform There is only one menu item available and that isthe ldquoDownload File List as CSVrdquo Selecting this item will begin a download process of all the files available for thecohort taking into account the selected Platform filters The file contains the following information for each file Sample Barcode Platform Pipeline Data Level File Path to the Cloud Storage Location

Commenting Any user who owns or has had a cohort shared with them can comment on it To open comments usethe menu button at the top right and select ldquoCommentsrdquo A sidebar will appear on the right side and any previouslycreated comments will be shown

On the bottom of the comments sidebar you can create a new comment and save it It should appear at the bottom ofthe list of comments

Deleting a cohort

From the dashboard Select the cohorts that you wish to delete using the checkboxes next to the cohorts When one ormore are selected the delete button will be active and you can then proceed to deleting them

40 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

From within a cohort If you are viewing a cohort you created then you can delete the cohort from the top right menuoption

Creating a Cohort from a Visualization

To create a cohort from a visualization you must be in plot selection mode If you are in plot selection mode thecrosshairs icon in the top right corner of the plot panel should be blue If it is not click on it and it should turn blue

Once in plot selection mode you can click and drag your cursor of the plot area to select the desired samples For acubbyhole plot you will have to select each cubby that you are interested in

When your selection has been made a small window should appear that contains a button labelled ldquoSave as CohortrdquoClick on this when you are ready to create a new cohort

Put in a name for you newly selected cohort and click the ldquoSaverdquo button

Copying a cohort

Copying a cohort can only be done from the cohort details page of the cohort you are want to copy

When you are looking at the cohort you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the cohort

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

143 Visualizations

The ISB-CGC web-app provides a variety of tools for visualizing the data associated with cohorts you have defined Avisualization is a collection of one or more plots and each plot is configurable to show relevant data that can be sharedwith others

Create

You can create a new visualization from the user dashboard Select the ldquoVisualizationrdquo option from the ldquo+ Createrdquomenu This will prompt you to name the visualization and provide at least one cohort to start with It will automaticallychoose the last one you created Click ldquoCreate Visualizationrdquo

Save

Use the ldquoSave Visualizationsrdquo option in the top right menu It will save the visualization and all plots in their currentconfiguration It will not save any selections that have been made within plots

Delete

From User Dashboard Select the visualizations that you wish to delete using the checkboxes next to the visualizationWhen one or more are selected the delete button will be active and you can then proceed to deleting them

From Visualization If you are viewing a visualization you created then you can delete the cohort from the top rightmenu option

14 ISB-CGC Web Interface 41

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

Copy

Copying a visualization can only be done from the visualization you want to copy

When you are looking at the visualization you wish to copy select the ldquoMake A Copyrdquo item from the top right menu

This will take you to your copy of the visualization

Add Plot

To add a plot to a visualization select the ldquoAdd a Plotrdquo item in the top right menu

This will append a new plot section to your visualization with the default plot of an age histogram for the All TCGAData Cohort

Delete Plot

To delete a plot from a visualization select the ldquoTrashrdquo icon in the top right corner of the plot panel you wish toremove

Plot Comments

To open the comments section for a particular plot select the ldquoCommentrdquo icon in the top right corner of the plot panelThis will cause the comment sidebar to appear from the right side of the window

All previous comments will be displayed with the most recent on the bottom

To add a new comment type in your comment in the text box at the bottom of the panel and click the ldquoCommentrdquobutton You should see your comment appear at the bottom of the list of comments

Edit Plot

To change the settings of a plot select the ldquoEditrdquo icon in the top right corner of the plot panel This will cause the PlotSetting panel to open

Selecting new Feature(s)

When you click on the edit icon next to the feature you would like to change (ie ldquoX Axis Featurerdquo ldquoY Axis Featurerdquo)you will be taken to the feature selection panel Here you must first specify the datatype of the feature you would liketo plot Each datatype requires a different set of parameters to narrow down the feature that can be used in the plot

bull Clinical

ndash Single autocomplete textbox This input searches through the names of the all the clinical featuresavailable

bull Gene Expression

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Platform Filter filters down the plot-able features by platforms

ndash Center Filter filters down the plot-able features by processing center

42 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull miRNA

ndash miRNA Name Filter filters down the plot-able features by a specific miRNA This is an autocompletesearch field for a miRNA name

ndash Platform Filter filters down the plot-able features by platforms

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Methylation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash CpG Probe Filter filters down the plot-able features by a specific CpG Probe This is an autocompletesearch field for a particular probe

ndash Platform Filter filters down the plot-able features by platforms

ndash Gene Region Filter filters down the plot-able features by specific gene regions

ndash CpG Island region Filter filters down the plot-able features by CpG Island region

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Copy Number

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by value

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Protein

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Protein Filter filters down the plot-able features by protein This is an autocomplete search field fora protein name

ndash Select Feature provides the filtered down list of plot-able features to select from based on selectedfilters

bull Mutation

ndash Gene Filter filters down the plot-able features by a specific gene This is an autocomplete search fieldfor a gene name

ndash Value Filter filters down the plot-able features by mutation value

bull Select Feature provides the filtered list of plot-able features to select from based on selected filters

bull Swap Values This button allows you to instantly swap the features on the X and Y Axes without having tore-select each feature individually

bull Color By Cohort This checkbox will override any feature that is in the Color By Feature It will use the cohortsprovided as the legend and Color By Feature

14 ISB-CGC Web Interface 43

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

bull Cohorts This is where you can select one or more cohorts to plot at one time

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts

When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the new settings

Pairwise Statistical Test

Each pair of features selected for a plot will be tested for statistical significance and the results will be displayedbeneath the plot

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

144 SeqPeek

SeqPeek is a special-purpose visualization designed to show mutations along a linear representation of the protein-product from a gene Only one proteingene may be viewed at any one time but the user may select multiple cohortsin order to compare the mutation patterns observed in different cohorts

Creating a SeqPeek Visualization

You can create a new SeqPeek visualization from the User Dashboard Select the ldquoSeqPeekrdquo option from the ldquo+Createrdquo menu This will automatically take you to a SeqPeek page with no pre-selected settings To make changesopen the Settings panel

In the Settings panel there will be options to select in order to make a new plot

bull Gene selection this is an autocompleting dropdown for valid gene symbols

bull Cohorts here you can select one or more cohorts for the visualization

To add a cohort select the ldquo+ Cohortrdquo option underneath the currently selected list of cohorts This will take you tothe cohorts listing panel where you can select a cohort from the list or use the autocomplete textbox to search in theirlist of cohorts When all the settings have been set you can click ldquoUpdate Plotrdquo to regenerate the plot with the newsettings

Saving a SeqPeek Visualization

To save changes to the visualization click ldquoSave Visualizationrdquo button This will prompt you to name your SeqPeekvisualization and save You will be notified after it saves correctly

Deleting a SeqPeek Visualization

bull From User Dashboard Select the SeqPeek visualizations that you wish to delete using the checkboxes next tothe visualization When one or more are selected the delete button will be active and you can then proceed todeleting them

bull From SeqPeek Visualization When viewing a SeqPeek visualization that you created you may delete it usingthe top right menu option

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

44 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

145 Integrative Genomics Viewer

Important Improved integration of the IGV browser and the ISB-CGC BigQuery data tables is coming soon Specif-ically the IGV browser will be able to access the high-level molecular data in the ISB-CGC BigQuery tables as wellthe low-level sequence data via the GA4GH API

Accessing the IGV Browser

To access IGV go to the cohort file list page The file listing table includes a column labelled ldquoIGVrdquo For those filesthat also have a ReadGroupSet ID in Google Genomics a green check mark will appear in the column Clicking onthat link will take you to the IGV browser with the appropriate dataset and readgroupset preselected

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

146 Sharing Cohorts and Visualizations

Sharing a cohort

A cohort can only be shared by the owner of the cohort

The person the cohort is shared with has read only access If they would like to be able to edit the cohort they wouldfirst need to make a copy of it This only copies the list of samples and participants associated to the cohort and notany additional information that may be private to the original owner of the cohort

Sharing a visualization

A visualization can only be shared by the owner of the visualization

The person the visualization is shared with has read only access They may make changes to the plot settings but theywill not be able to save those changes If they want to be able to save changes they need to clone the visualizationfirst Underlying cohorts are shared with the user when the visualization is sharedmaking changes to those cohortswill require the user to first clone the cohort as noted above

Sharing a SeqPeek visualization

A SeqPeek visualization can only be shared by the owner of the visualization

The person the SeqPeek visualization is shared with has read only access They may make changes to the settings butthey will not be able to save those changes If they want to be able to save changes they need to clone the SeqPeekvisualization first Underlying cohorts are shared with the user when the visualization is sharedmaking changes tothose cohorts will require the user to first clone the cohort as noted above

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

14 ISB-CGC Web Interface 45

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

147 General Permissions

Cohorts and visualizations have permissions that are distinct from the TCGA data access permissions Users maycreate cohorts and visualizations using the ISB-CGC web-app and these cohorts and visualizations will by defaultbe private to the user who created them Users may however also share these components of their interactive analysesThere are two levels of permissions Owner and Reader

bull Owner As owner of a cohort or visualization you are able to edit and share your cohort

bull Reader If a cohort or visualization is shared with you you have view-only access

ndash You may be able to make configuration changes to plots in a visualizations but you will not be ableto save those changes

ndash You are able to comment on a cohort or plot and your comments will be shared with all other Readers

ndash You may make a copy (ldquoclonerdquo) of the cohort or visualization after which you will be the owner andyou will be able to make changes

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

148 Release Notes

bull December 23 2015 v02

ndash Treemap graphs in cohort details and cohort creation pages will not apply its own filters to itself Forexample if you select a study the study treemap graph will not update

ndash Cohort file list download not working

bull December 3 2015 v01

ndash First tagged release of the web-app

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

15 Frequently Asked Questions (FAQ)

151 ISB-CGC Accounts and Cloud Projects

Do I have to request an ISB-CGC account before I can try out the web interface No you can just ldquosign inrdquo tothe web-app using your Google identity (Please be patient wersquore working on a major revision to the web-app rightnow and will let you know when itrsquos ready for you to explore)

I want to be able to run big jobs using Google Compute Engine on the TCGA data hosted by the ISB-CGCWhat should I do You will need to request a Google Cloud Platform (GCP) project Please see Your Own GCPproject for more details about requesting a project

Can I use any email address as a Google identity Yes you can If your email address is not already linkedto a Google account you can create a Google account with your current email address Please note however thatalthough these two accounts will then share the same name they will still be two separate accounts with two separate

46 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

passwords etc (It is also possible that your institutional email address is already a Google account if your institutionuses Google Apps This is how to find out)

How do I connect my GCP project to the ISB-CGC Your GCP project gives you access to all of the technologiesthat make up the Google Cloud Platform (GCP) These technologies include BigQuery Cloud Storage ComputeEngine Google Genomics etc The ISB-CGC makes use of a variety of these technologies to provide access to theTCGA data without necessarily inserting an extra interface layer between you and the GCP Although one componentof the ISB-CGC is a web-app (running on Google App Engine) some users may prefer not to go through the web-app to access other components of the ISB-CGC For example the open-access TCGA data that we have loaded intoBigQuery tables can be accessed directly via the BigQuery web interface or from Python or R Similarly the ISB-CGCprogrammatic API is a REST service that can be used from many different programming languages

The connection between your GCP project (whether it is an ISB-CGC sponsored and funded project or your ownpersonal project) and the ISB-CGC is your Google identity (also referred to as your ldquouser credentialsrdquo) Access to allISB-CGC hosted data is controlled using access control lists (ACLs) which define the permissions attached to eachdataset bucket or object

152 Data Access

Does all TCGA data require dbGaP authorization prior to access No generally only the low-level sequence(DNA and RNA) and SNP-array data (CEL files) require dbGaP authorization All of the ldquohigh-levelrdquo molecular dataas well as the clinical data are open-access and much of this has been made available in a convenient set of BigQuerytables

Where can I find the TCGA data that ISB-CGC has made publicly available in BigQuery tables The BigQueryweb interface can be accessed at bigquerycloudgooglecom If you have not already added the ISB-CGC datasets toyour BigQuery ldquoviewrdquo click on the blue arrow next to your username in the left side-bar select ldquoSwitch to Projectrdquothen ldquoDisplay Projectrdquo and enter ldquoisb-cgcrdquo (without quotes) in the text box labeled ldquoProject IDrdquo All ISB-CGCpublic BigQuery datasets and tables will now be visible in the left side-bar of the BigQuery web interface Note thatin order to use BigQuery you need to be a member of a Google Cloud Project

How can I apply for access to the low-level DNA and RNA sequence data In order to access the TCGA controlled-access data you will need to apply to dbGaP

I have dbGaP authorization How do I provide this information to the ISB-CGC platform In order for us toverify your dbGaP authorization you first need to associate your Google identity (used to sign-in to the web-app) witha valid NIH login (eg your eRA Commons id) After you have signed in click on your avatar (next to your name in theupper-right corner) and you will be taken to your account details page where you can verify your dbGaP authorizationYou will be redirected to the NIH iTrust login page and after you successfully authenticate you will be brought backto the ISB-CGC web-app After you successfully authenticate we will verify that you also have dbGaP authorizationfor the TCGA controlled-access data

My professor has dbGaP authorization Do I have to have my own authorization too Yes your professor willneed to add you as a ldquodata downloaderrdquo to hisher dbGaP application so that you have your own dbGaP authorizationassociated with your own eRA Commons id (This video explains how an authorized user of controlled-access datacan assign a downloader role to someone in hisher institution)

I already authenticated using my eRA Commons id but now I want to use a different Google identity to accessthe ISB-CGC web-app Can I re-authenticate using the same eRA Commons id Yes but you will first need tosign-in using your previous Google identity and ldquounlinkrdquo your eRA Commons id from that one before you can link itwith your new Google identity An eRA Commons id cannot be associated with more than one Google identity withinthe ISB-CGC platform at any one time

Can I authenticate to NIH programmatically No the current NIH authentication flow requires web-based authen-tication and must therefore be done from within the ISB-CGC web-app Once you have authenticated to NIH via theweb-app and your dbGaP authorization has been verified the Google identity associated with your account will haveaccess to the controlled-data for 24 hours

15 Frequently Asked Questions (FAQ) 47

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

153 Python Users

I want to write python scripts that access the TCGA data hosted by the ISB-CGC Do you have some examplesthat can get me started Yes of course The best place to start is with our examples-Python repository on githubYou can run any of those examples yourself by signing in to your Google Cloud Project and deploying an instance ofGoogle Cloud Datalab

154 R and Bioconductor Users

I want to use R and Bioconductor packages to work with the TCGA data How can I do that You can runRStudio locally or deploy a dockerized version on a Google Compute Engine VM You can find some great examplesto get you started in our examples-R repository on github and also in the documentation from the Google Genomicsworkshop at BioConductor 2015

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links

161 Contact Us

For general information about the ISB-CGC please contact us at infoisb-cgcorg We are especially keen on learningabout your particular use-cases and how we can help you take advantage of the latest in cloud-computing technologiesto answer your research questions

For feature-requests or bug-reports please send e-mail to feedbackisb-cgcorg

162 Your Own GCP project

To request a Google Cloud Platform (GCP) project please send a request to request-gcpisb-cgcorg

In your request please describe your research goals in some detail including information such as the type of data thatyou plan to use (whether it is your own data that you plan to upload or TCGA data currently hosted by the ISB-CGC)the algorithms andor methods you plan to apply and an estimate of the storage and computing costs you expect toincur Please let us know if you have students or collaborators who will also be accessing the same cloud project Notethat if you are working as a team on a single project you should all use the same cloud project ndash if your group is largewe will take this into consideration when determining your funding level

If you have previous experience using the Google Cloud Platform that would be useful for us to know ndash includingwhich specific components (eg Compute Engine BigQuery Cloud Datalab etc)

All reasonable requests will receive an initial allocation of $300 towards storage and compute costs We expect thatthis amount of funding will be more than enough for you to become familiar with the platform If you expect thatyou will need additional funding to complete your planned research this initial amount should be used to performprototype analyses and to better estimate your total costs At that time you may request additional funding

Please be aware that we will be monitoring your cloud resource usage on a daily basis and will alert you as you beginto approach your funding limit If you exceed your allocation limit and we are not able to contact you by email forseveral days we may need to take action to shut your project down which could cause you to lose work and data

48 Chapter 1 Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents

ISB Cancer Genomics Cloud Documentation Release 100

163 Other Useful Links

The ISB-CGC platform is built on top of the Google Cloud Platform and has been designed to make the TCGA dataas accessible as possible to a wide range of users For the programmatic users this includes complete access to thetools that Google is pioneering to allow users to scale-up their analyses on the Google infrastructure using a variety ofmeans

The ISB-CGC documentation and the example code on github will continue to grown to provide starting-points anduse-cases designed to suit the needs of a variety of end-users If you have a particular use-case that has not yet beenaddressed please contact us (email infoisb-cgcorg) and we will work with you to determine the best approach torun the analysis you have in mind

Cloud Datalab is a powerful web-based interactive computational environment built on the familiar IPython (nowknown as Jupyter) environment running on a Google VM in your own Google Cloud Project Cloud Datalab allowsyou to combine SQL-like queries into the TCGA BigQuery tables with all the power of Python packages like Pandasand Matplotlib See our examples-Python repository on github

Google Genomics provides tools for storing processing exploring and sharing DNA sequence reads reference-basedalignments and variant calls using Googlersquos infrastructure An extensive Cookbook here on Read the Docs as well asan ever-growing set of examples on github showcase some of the tools at your disposal

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

Have feedback or corrections You can file an issue here or email us at feedbackisb-cgcorg

16 Support amp Other Useful Links 49

  • Contents