1 ASHG Redux 2008 Session -- Using DNA sequence to detect variation related to disease –Richard...
-
Upload
bethany-caldwell -
Category
Documents
-
view
212 -
download
0
Transcript of 1 ASHG Redux 2008 Session -- Using DNA sequence to detect variation related to disease –Richard...
1
ASHG Redux 2008
• Session -- Using DNA sequence to detect variation related to disease– Richard Wilson – WashU – deep sequencing of
cancer tumors (AML) identified variations in 8 genes
– Richard Gibbs – Baylor College of Medicine – "Complete Genomics" – genome for < $5,000
• Accurate sequencing by hybridization for DNA diagnostics and individual genomics, Drmanac, et al., Nature Biotechnology
2
ASHG Redux
• Session -- Using DNA sequence to detect variation related to disease– Micahel Stratton – Wellcome Trust Cancer
Institute – genomic sequencing of breast cancer cell lines
• Copy number variations ("structural variants")
• "genomic shards" – 305 rearrangements in breast cancer cell line
• Difficult to assemble with short reads technology
3
ASHG Redux
• Session – Genomics I– Sharp – whole genome screen for novel imprinting
genes• Bisulphite treatment – convert all un-methylated C's to U
(uracil) -- then sequence and all methylated C's sites are ID'ed• Drawback – harsh, fragments DNA
– High density HapMap of Humans, Dogs, and Cattle• Genotypes 900 dogs /w Affy 2.0 array at 61,344 SNPs• Dogs have very uniform phylogenetic tree with bread specific
recombination rates
4
ASHG Redux
• Session – Genomics I– Biesecker – ClinSeq – effort to map phenotypic
features to genotypes for atherosclerosis• 1000 subjects
Clinical data
Subjects
Genome
Desired Data
Penetrance
SNP Freq0.5
CommonSNPs
RareMendelianVariants
UnknownTerritory
CommonMendelianVariants
5
ASHG Redux• Session – Genomics II
– BGI (Beijing Genomics Institute)• First Asian genome sequenced
• 100 bioinformaticians (-> 300)
• 18 Solexas
• 5 454's
• 4 Solids (?)
– Altshuler (1000 Genomes Project) – effort to sequence 1000 genomes to catalogue variations in genome
• www.1000genomes.org
• Duplicated amount of sequence in GenBank in Sept.
• Again in October
• Data release – Jan 2009
6
7
Genome Sequence
Reference:
"Discovering Genomics, Proteomics, and Bioinformatics." Second Edition 2/e. Campbell and Heyer. 2007. ISB: 0-8053-8219-4.
Chapter 2:
8
genomics
• reduction -- for a very long time molecular methods where primarily tools to dissect cells and understand how parts work in isolation
• expansion -- genomics, in theory, enables science to begin piecing together how parts work together as a system (systems biology?)
9
Overview
• What is Genomics?
• How to sequence a genome?
• Annotating (annotation)
• Protein function
• Gene Ontology
10
Genomics
• "involves large data sets"– human genome -- 3 billion nucleotides– hundreds of genomes have been finished
• "high-throughput methods"– sequencing– measuring the expression of all genes– genotyping (1,000,000 SNPs on 1 chip)
• other -omes– proteome, transcriptome, metabolome, variome?, exome– http://cancergenome.nih.gov/media/process_textonly.asp
11
How do we sequence a genome?
• preliminary sequencing• finishing (not always performed -- coverage)• annotating• The "dideoxy method"• Need (for DNA replication):
– DNA, DNA polymerase, primers, deoxyribonucleotide triphosphates (dNTPs) (G,T,A,C)'s (one with radioactive atoms), dideoxyribonucleotide triphosphates (ddNTPs)
12
Dideoxy Method Obsolete?
• Next-generation sequencing technology– Cost per nucleotide down by factor of 100-1000
– Cost per run is still very high
– Expen$ive for validation on an individual basis
– Dideoxy method is very mature, very well understood
13
dideoxy method• Under normal DNA polymerization, dNTPs are added to the
end of the elongating strand of DNA.
• If an ddNTP is incorporated, the elongation terminates -- also carries "label" -- radioactive isotope or fluorescent dye
• This is performed in 4 different containers (test tubes), with each test tube having ddATP, ddGTP, ddCTP, and ddGTP.
• Therefore, each tube terminates with the same ddNTP
• Run these out on a gel, and smallest migrate fastest.
• Expose to x-ray film (or scan with laser), read gel
14Figure 2.1
15Figure 2.2
16
Comment
• Note -- this is pretty awful work• The gel material is toxic• Working with radioactive molecules• Slow and tedious• reading bands on glass• capturing/entering data • 500 bases took 24 hours (16,438 years to do the
human genome with this method)
17
Automated sequencing• Leroy Hood -- developed nonradioactive dideoxy method• ddNTP's are "labeled" with a different fluorescent dye• 1 lane could be used instead of 4 (why?)• A laser fluoresces the dye, the band can be "read", indicating
which ddNTP terminated the sequence• The intensities of these bands are now captured and graphed --
in what is called a chromatogram• Lane in a gel is replaced with a capillary• Can run 96, or 384 capillaries at a time (Applied Biosystems)• A run is approximately 1 hour• 500 bases * 384 cap ==> 651 years
18Box 2.1 Table
19
Choosing genomes
• Big 7– human, mouse, yeast, E. coli, fly, worm, arabidopsis
• medical applications– Pseudomonas aeruginosa (CF infection), mosquito,
trypanosomes, HIV
• evolutionary significance– microbes, archaea, chimp, gorilla, fugu fish
• environmental impact– microbes
• food production– wheat, rice, bovine, pig, yeast
20Figure 2.3
21Figure 2.3 (detail)
22
Automated Reads
• Automated sequencing almost requires automated base-calling– PHRED
• reads chromatograms• quality assessment (for re-sequencing)• peak height and spacing
– assemble multiple reads (PHRAP) into a "contig"– What about mutations, variations, SNPs?
• Gaps– requires human intervention -- techniques to try and span
specific DNA regions• ex) chromosome walking
23
Gaps
• 2001 draft sequence published– 147,821 gaps– pressure to publish a sequence because of Celera and
Craig Venter
• 2004– 341 gaps
• Usually repeats (but may be epigenetic)• Very expensive to completely finish
– many genomes never "finished"
24Figure 2.4
25Figure MM2.1 Show BL2SEQ example
26
Annotation
• "functionally" important sections of a genome– exons, introns, promoters, enhancers, splice
sites, UTR's, – pseudogenes, SNPs, markers, repeats, Alus,
gene duplications, gene families, micro-RNAs, methylation, phosphorylation, tissue specific alternative splicing, copy number variations, (CNVs, also called "structural variations") differential expression, gene function, ????
27
Gene Identification
• Gene prediction (ORF finding)– was a hot topic– cooled when it became clear that EST sequencing was far
superior– EST sequencing in human (and some model organisms -- rat,
mouse, others) was very extensive -- millions of sequencing reads– The most effective approach to gene finding was the overlaying
of EST sequences to genomic sequence (but note you need both).– Gene prediction was 40-60% at best– Gene prediction has made a bit of resurgence because of the cost
savings of "in silico" gene finding
28
Pseudogenes
• text -- mammalian genome contains approximately 225 BP per KB of pseudogenes
• What are pseudogenes?