1 ASHG Redux 2008 Session -- Using DNA sequence to detect variation related to disease –Richard...

1

ASHG Redux 2008

• Session -- Using DNA sequence to detect variation related to disease– Richard Wilson – WashU – deep sequencing of

cancer tumors (AML) identified variations in 8 genes

– Richard Gibbs – Baylor College of Medicine – "Complete Genomics" – genome for < $5,000

• Accurate sequencing by hybridization for DNA diagnostics and individual genomics, Drmanac, et al., Nature Biotechnology

2

ASHG Redux

• Session -- Using DNA sequence to detect variation related to disease– Micahel Stratton – Wellcome Trust Cancer

Institute – genomic sequencing of breast cancer cell lines

• Copy number variations ("structural variants")

• "genomic shards" – 305 rearrangements in breast cancer cell line

• Difficult to assemble with short reads technology

3

ASHG Redux

• Session – Genomics I– Sharp – whole genome screen for novel imprinting

genes• Bisulphite treatment – convert all un-methylated C's to U

(uracil) -- then sequence and all methylated C's sites are ID'ed• Drawback – harsh, fragments DNA

– High density HapMap of Humans, Dogs, and Cattle• Genotypes 900 dogs /w Affy 2.0 array at 61,344 SNPs• Dogs have very uniform phylogenetic tree with bread specific

recombination rates

4

ASHG Redux

• Session – Genomics I– Biesecker – ClinSeq – effort to map phenotypic

features to genotypes for atherosclerosis• 1000 subjects

Clinical data

Subjects

Genome

Desired Data

Penetrance

SNP Freq0.5

CommonSNPs

RareMendelianVariants

UnknownTerritory

CommonMendelianVariants

5

ASHG Redux• Session – Genomics II

– BGI (Beijing Genomics Institute)• First Asian genome sequenced

• 100 bioinformaticians (-> 300)

• 18 Solexas

• 5 454's

• 4 Solids (?)

– Altshuler (1000 Genomes Project) – effort to sequence 1000 genomes to catalogue variations in genome

• www.1000genomes.org

• Duplicated amount of sequence in GenBank in Sept.

• Again in October

• Data release – Jan 2009

http://www.1000genomes.org/

7

Genome Sequence

Reference:

"Discovering Genomics, Proteomics, and Bioinformatics." Second Edition 2/e. Campbell and Heyer. 2007. ISB: 0-8053-8219-4.

Chapter 2:

8

genomics

• reduction -- for a very long time molecular methods where primarily tools to dissect cells and understand how parts work in isolation

• expansion -- genomics, in theory, enables science to begin piecing together how parts work together as a system (systems biology?)

9

Overview

• What is Genomics?

• How to sequence a genome?

• Annotating (annotation)

• Protein function

• Gene Ontology

10

Genomics

• "involves large data sets"– human genome -- 3 billion nucleotides– hundreds of genomes have been finished

• "high-throughput methods"– sequencing– measuring the expression of all genes– genotyping (1,000,000 SNPs on 1 chip)

• other -omes– proteome, transcriptome, metabolome, variome?, exome– http://cancergenome.nih.gov/media/process_textonly.asp

http://cancergenome.nih.gov/media/process_textonly.asp

11

How do we sequence a genome?

• preliminary sequencing• finishing (not always performed -- coverage)• annotating• The "dideoxy method"• Need (for DNA replication):

– DNA, DNA polymerase, primers, deoxyribonucleotide triphosphates (dNTPs) (G,T,A,C)'s (one with radioactive atoms), dideoxyribonucleotide triphosphates (ddNTPs)

12

Dideoxy Method Obsolete?

• Next-generation sequencing technology– Cost per nucleotide down by factor of 100-1000

– Cost per run is still very high

– Expen$ive for validation on an individual basis

– Dideoxy method is very mature, very well understood

13

dideoxy method• Under normal DNA polymerization, dNTPs are added to the

end of the elongating strand of DNA.

• If an ddNTP is incorporated, the elongation terminates -- also carries "label" -- radioactive isotope or fluorescent dye

• This is performed in 4 different containers (test tubes), with each test tube having ddATP, ddGTP, ddCTP, and ddGTP.

• Therefore, each tube terminates with the same ddNTP

• Run these out on a gel, and smallest migrate fastest.

• Expose to x-ray film (or scan with laser), read gel

14Figure 2.1

15Figure 2.2

16

Comment

• Note -- this is pretty awful work• The gel material is toxic• Working with radioactive molecules• Slow and tedious• reading bands on glass• capturing/entering data • 500 bases took 24 hours (16,438 years to do the

human genome with this method)

17

Automated sequencing• Leroy Hood -- developed nonradioactive dideoxy method• ddNTP's are "labeled" with a different fluorescent dye• 1 lane could be used instead of 4 (why?)• A laser fluoresces the dye, the band can be "read", indicating

which ddNTP terminated the sequence• The intensities of these bands are now captured and graphed --

in what is called a chromatogram• Lane in a gel is replaced with a capillary• Can run 96, or 384 capillaries at a time (Applied Biosystems)• A run is approximately 1 hour• 500 bases * 384 cap ==> 651 years

18Box 2.1 Table

19

Choosing genomes

• Big 7– human, mouse, yeast, E. coli, fly, worm, arabidopsis

• medical applications– Pseudomonas aeruginosa (CF infection), mosquito,

trypanosomes, HIV

• evolutionary significance– microbes, archaea, chimp, gorilla, fugu fish

• environmental impact– microbes

• food production– wheat, rice, bovine, pig, yeast

20Figure 2.3

21Figure 2.3 (detail)

22

Automated Reads

• Automated sequencing almost requires automated base-calling– PHRED

• reads chromatograms• quality assessment (for re-sequencing)• peak height and spacing

– assemble multiple reads (PHRAP) into a "contig"– What about mutations, variations, SNPs?

• Gaps– requires human intervention -- techniques to try and span

specific DNA regions• ex) chromosome walking

23

Gaps

• 2001 draft sequence published– 147,821 gaps– pressure to publish a sequence because of Celera and

Craig Venter

• 2004– 341 gaps

• Usually repeats (but may be epigenetic)• Very expensive to completely finish

– many genomes never "finished"

24Figure 2.4

25Figure MM2.1 Show BL2SEQ example

26

Annotation

• "functionally" important sections of a genome– exons, introns, promoters, enhancers, splice

sites, UTR's, – pseudogenes, SNPs, markers, repeats, Alus,

gene duplications, gene families, micro-RNAs, methylation, phosphorylation, tissue specific alternative splicing, copy number variations, (CNVs, also called "structural variations") differential expression, gene function, ????

27

Gene Identification

• Gene prediction (ORF finding)– was a hot topic– cooled when it became clear that EST sequencing was far

superior– EST sequencing in human (and some model organisms -- rat,

mouse, others) was very extensive -- millions of sequencing reads– The most effective approach to gene finding was the overlaying

of EST sequences to genomic sequence (but note you need both).– Gene prediction was 40-60% at best– Gene prediction has made a bit of resurgence because of the cost

savings of "in silico" gene finding

28

Pseudogenes

• text -- mammalian genome contains approximately 225 BP per KB of pseudogenes

• What are pseudogenes?

1 ASHG Redux 2008 Session -- Using DNA sequence to detect variation related to disease –Richard...

Documents

Transcript of 1 ASHG Redux 2008 Session -- Using DNA sequence to detect variation related to disease –Richard...