TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

10

Click here to load reader

Transcript of TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

Page 1: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

The Cancer Genome Atlas Project

January 24, 2008

Page 2: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Program

• Goal: find genomic alterations that cause cancer (mutations, CNA, methylation, …)

• Pilot project– $100M (NCI/NHGRI)– 3 years– 3 diseases

• brain (glioblastoma multiforme)• lung (squamous)• ovarian (serous cystadenocarcinoma )

Page 3: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Organization

• Biospecimen Core Resource (BCR)• Genome Sequencing Centers (GSCs) (3)• Cancer Genome Characterization Centers

(CGCCs) (7)• Data Coordinating Center (DCC)• Project Team (NCI/NHGRI)• Steering Committee (NCI/NHGRI & PIs)• External Scientific Committee• Working Groups

Page 4: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

PI’s

BCR IGC/TGEN Robert Penny

GSC Baylor Richard Gibbs

  Broad Eric Lander

  WashU Rick Wilson

CGCC Broad/DFCI Matthew Meyerson

  Harvard/B&W Raju Kucherlapati

  JHU Steve Baylin

  LBL Joe Gray

  MSKCC Marc Ladanyi

  Stanford Rick Myers

  UNC Chuck Perou

DCC SRA Ari Kahn

Page 5: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

URLs

• project site: http://cancergenome.nih.gov

• gforge: http://gforge.nci.nih.gov (search for TCGA)

• data: http://tcga-data.nci.nih.gov

• portal: http://tcga-portal.nci.nih.gov [coming]

Page 6: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Data Types

Institution Analysis Platform

Broad/DFCI Transcription and Copy Number

Affymetrix U133 Plus 2.0 & SNP Array 6.0

Harvard/B&W Transcription and Copy Number

Agilent 244K Array

LBL Transcription Affymetrix Exon 1.0 ST Array

MSKCC Copy Number Agilent 244K Array

JHU Methylation Illumina GoldenGate

UNC Transcription Agilent 44K Array

Stanford Copy Number Illumina Infinium 550K BeadChip Array

Broad Somatic Mutations DNA sequencing

Baylor Somatic Mutations DNA sequencing

WashU Somatic Mutations DNA sequencing

Page 7: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Data Levels

• raw– low-level data for a single sample, not normalized

(e.g., trace file, .cel file)• processed

– single-sample, normalized & interpreted (e.g. mutation call, amplification call for a locus, .snp, .chp)

• segmented (n/a for mutation & expression)– single-sample, aggregation of loci into regions (e.g.

amplification call for a region of a sample)• summary finding (aka “region of interest”)

– cross-sample findings (e.g. minimal common region of amplification across a sample set)

Page 8: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

FlowTissue Source

(MD Anderson, Henry Ford, …)

BCR1. check pathology, quality/quantity2. extract analytes3. prepare data file

GSC

WGACGCC

DNA, mRNA

DNA

NCBI Trace Archive

DCC

sample data

Bulk Download

caTissue Core

caArray caIntegrator

“tracking database”

Page 9: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Data Formats

• BCR– XML (tags are CDEs)– images

• GSC– Called mutations (Genboree LFF format)– Linking table

• sample-trace-target

• CGCC– MAGE-TAB

• IDF: Investigation Definition Format• SDRF: Sample and Data Relationship Format

Page 10: TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)

TCGA

Where Does/Will the Data Go?

• ftp site (now with a simple web wrapper: “portal #1”)• “tracking database”• repositories with caBIG API’s

– caArray– caTissue CORE– caIntegrator– NCIA

• NCBI trace archive• a richer, “portal #2”

– more convenient download capability– filtering datasets by clinical information– summary level data– genome browser view– gene info page– visualization on pathways– etc.