TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
Click here to load reader
-
Upload
niranabey -
Category
Health & Medicine
-
view
1.544 -
download
1
Transcript of TCGA data coordination center: Carl Schaefer and Ari Kahn (NCICB)
TCGA
The Cancer Genome Atlas Project
January 24, 2008
TCGA
Program
• Goal: find genomic alterations that cause cancer (mutations, CNA, methylation, …)
• Pilot project– $100M (NCI/NHGRI)– 3 years– 3 diseases
• brain (glioblastoma multiforme)• lung (squamous)• ovarian (serous cystadenocarcinoma )
TCGA
Organization
• Biospecimen Core Resource (BCR)• Genome Sequencing Centers (GSCs) (3)• Cancer Genome Characterization Centers
(CGCCs) (7)• Data Coordinating Center (DCC)• Project Team (NCI/NHGRI)• Steering Committee (NCI/NHGRI & PIs)• External Scientific Committee• Working Groups
TCGA
PI’s
BCR IGC/TGEN Robert Penny
GSC Baylor Richard Gibbs
Broad Eric Lander
WashU Rick Wilson
CGCC Broad/DFCI Matthew Meyerson
Harvard/B&W Raju Kucherlapati
JHU Steve Baylin
LBL Joe Gray
MSKCC Marc Ladanyi
Stanford Rick Myers
UNC Chuck Perou
DCC SRA Ari Kahn
TCGA
URLs
• project site: http://cancergenome.nih.gov
• gforge: http://gforge.nci.nih.gov (search for TCGA)
• data: http://tcga-data.nci.nih.gov
• portal: http://tcga-portal.nci.nih.gov [coming]
TCGA
Data Types
Institution Analysis Platform
Broad/DFCI Transcription and Copy Number
Affymetrix U133 Plus 2.0 & SNP Array 6.0
Harvard/B&W Transcription and Copy Number
Agilent 244K Array
LBL Transcription Affymetrix Exon 1.0 ST Array
MSKCC Copy Number Agilent 244K Array
JHU Methylation Illumina GoldenGate
UNC Transcription Agilent 44K Array
Stanford Copy Number Illumina Infinium 550K BeadChip Array
Broad Somatic Mutations DNA sequencing
Baylor Somatic Mutations DNA sequencing
WashU Somatic Mutations DNA sequencing
TCGA
Data Levels
• raw– low-level data for a single sample, not normalized
(e.g., trace file, .cel file)• processed
– single-sample, normalized & interpreted (e.g. mutation call, amplification call for a locus, .snp, .chp)
• segmented (n/a for mutation & expression)– single-sample, aggregation of loci into regions (e.g.
amplification call for a region of a sample)• summary finding (aka “region of interest”)
– cross-sample findings (e.g. minimal common region of amplification across a sample set)
TCGA
FlowTissue Source
(MD Anderson, Henry Ford, …)
BCR1. check pathology, quality/quantity2. extract analytes3. prepare data file
GSC
WGACGCC
DNA, mRNA
DNA
NCBI Trace Archive
DCC
sample data
Bulk Download
caTissue Core
caArray caIntegrator
“tracking database”
TCGA
Data Formats
• BCR– XML (tags are CDEs)– images
• GSC– Called mutations (Genboree LFF format)– Linking table
• sample-trace-target
• CGCC– MAGE-TAB
• IDF: Investigation Definition Format• SDRF: Sample and Data Relationship Format
TCGA
Where Does/Will the Data Go?
• ftp site (now with a simple web wrapper: “portal #1”)• “tracking database”• repositories with caBIG API’s
– caArray– caTissue CORE– caIntegrator– NCIA
• NCBI trace archive• a richer, “portal #2”
– more convenient download capability– filtering datasets by clinical information– summary level data– genome browser view– gene info page– visualization on pathways– etc.