TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

The Cancer Genome Atlas Project

January 24, 2008

TCGA

Program

• Goal: find genomic alterations that cause cancer (mutations, CNA, methylation, …)

• Pilot project– $100M (NCI/NHGRI)– 3 years– 3 diseases

• brain (glioblastoma multiforme)• lung (squamous)• ovarian (serous cystadenocarcinoma )

TCGA

Organization

• Biospecimen Core Resource (BCR)• Genome Sequencing Centers (GSCs) (3)• Cancer Genome Characterization Centers

(CGCCs) (7)• Data Coordinating Center (DCC)• Project Team (NCI/NHGRI)• Steering Committee (NCI/NHGRI & PIs)• External Scientific Committee• Working Groups

TCGA

PI’s

BCR IGC/TGEN Robert Penny

GSC Baylor Richard Gibbs

Broad Eric Lander

WashU Rick Wilson

CGCC Broad/DFCI Matthew Meyerson

Harvard/B&W Raju Kucherlapati

JHU Steve Baylin

LBL Joe Gray

MSKCC Marc Ladanyi

Stanford Rick Myers

UNC Chuck Perou

DCC SRA Ari Kahn

TCGA

URLs

• project site: http://cancergenome.nih.gov

• gforge: http://gforge.nci.nih.gov (search for TCGA)

• data: http://tcga-data.nci.nih.gov

• portal: http://tcga-portal.nci.nih.gov [coming]

TCGA

Data Types

Institution Analysis Platform

Broad/DFCI Transcription and Copy Number

Affymetrix U133 Plus 2.0 & SNP Array 6.0

Harvard/B&W Transcription and Copy Number

Agilent 244K Array

LBL Transcription Affymetrix Exon 1.0 ST Array

MSKCC Copy Number Agilent 244K Array

JHU Methylation Illumina GoldenGate

UNC Transcription Agilent 44K Array

Stanford Copy Number Illumina Infinium 550K BeadChip Array

Broad Somatic Mutations DNA sequencing

Baylor Somatic Mutations DNA sequencing

WashU Somatic Mutations DNA sequencing

TCGA

Data Levels

• raw– low-level data for a single sample, not normalized

(e.g., trace file, .cel file)• processed

– single-sample, normalized & interpreted (e.g. mutation call, amplification call for a locus, .snp, .chp)

• segmented (n/a for mutation & expression)– single-sample, aggregation of loci into regions (e.g.

amplification call for a region of a sample)• summary finding (aka “region of interest”)

– cross-sample findings (e.g. minimal common region of amplification across a sample set)

TCGA

FlowTissue Source

(MD Anderson, Henry Ford, …)

BCR1. check pathology, quality/quantity2. extract analytes3. prepare data file

GSC

WGACGCC

DNA, mRNA

DNA

NCBI Trace Archive

DCC

sample data

Bulk Download

caTissue Core

caArray caIntegrator

“tracking database”

TCGA

Data Formats

• BCR– XML (tags are CDEs)– images

• GSC– Called mutations (Genboree LFF format)– Linking table

• sample-trace-target

• CGCC– MAGE-TAB

• IDF: Investigation Definition Format• SDRF: Sample and Data Relationship Format

TCGA

Where Does/Will the Data Go?

• ftp site (now with a simple web wrapper: “portal #1”)• “tracking database”• repositories with caBIG API’s

– caArray– caTissue CORE– caIntegrator– NCIA

• NCBI trace archive• a richer, “portal #2”

– more convenient download capability– filtering datasets by clinical information– summary level data– genome browser view– gene info page– visualization on pathways– etc.

TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

Health & Medicine

Transcript of TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)