Genomics: Looking at Life in New Ways
description
Transcript of Genomics: Looking at Life in New Ways
Genomics: Looking at Life in New Ways
Mark D. Adams
Department of GeneticsCenter for Computational Genomics
Center for Human Genetics
• Genome publications Feb/2001• ~30,000 genes, 3 million SNPs
Computing the Genome - Assembly
ScreenerScreener Mask heterochromatin and ribo-DNA,Tag known interspersed repeats.
OverlapperOverlapper Find all overlaps 40bp allowing 6% mismatch. (1000X Blast)
UnitigerUnitigerASSEMBLER CORE:• Compute all consistent sub-assemblies = unitigs• Identify those that cover unique DNA = U-unitigs• Scaffold U-unitigs with confirmed shorts & longs• Then with BAC ends• Fill repeat gaps with:
I. Doubly anchored mates
ScaffolderScaffolder
Repeat Rez IRepeat Rez I
8:37
86:25
38:29
4:12
5:44+4:21+19:53
ConsensusConsensus Bayesian “SNP” consensus using quality values. Occurs throughout assembler core. (~25)
Repeat Rez I, II, IIIRepeat Rez I, II, IIIII. O-path confirmed singly-anchored matesIII. Greedy path completion using QVs
Computing the Genome - Analysis
• Gene Prediction
• Repeat Elements
• Large-scale structure– Did a genome-wide
duplication occur in the evolution of the human genome?
After the genome….
• Define ‘finished’….– Challenging regions– Centromeres– Annotation of genes (protein-coding and non-protein-coding)– Annotation of non-genic functional elements
• More Genomes!– Identification of functionally important regions through analysis of
conservation through evolution
• ‘Comprehensive’ parts list
• High-throughput mentality
Protein Structure Prediction/Comparison
1B0B1O1O
F28C12.5 ------------MSQLTAEELDSQKCASEGLT-SVLTSITMKFNFLFITTVILLSYC-FT F28C12.7 ------------MNKTAEDLLDSLKCASKDLS-SALTSVTIKFNCIFISTIVLISYC-FI T06G6.1 ------------MNKTAEELLDSLKCASDGLA-SALTSVTLKFNCAFISTIVLISYC-FS F28C12.2 ------------MNKTAEELD-SRNCASESLT-NALISITMKFNFIFIITVVLISYC-FT F28C12.3 ------------MNKTAEELLDSRKCASEGLT-NALTSFMMKMNFSFIVT---------- F28C12.4 ------------MNKTAEELVESLRCASEGLT-NALTSITVKVSFVFLATVILLSYY-FA T06G6.2 ------------MNKTAEEIVESRRCASEGLT-NALTSITVKMSSVLVVTVILLSYY-FA F28C12.1 --------------MNQTELLESLKCASEGMV-KAMTSTTMKLNFVFIATVIFLSFY-FA T26E3.9 ----------------MNELIDGPKCASEGIV-NAMTSIPVKISFLIIATVIFLSFY-FA F18C5.6 ---------------------MSSECARSDVH-NVLTSDSMKFNHCFIISIIIISFF-TT F18C5.8 ------------------MENLNPACASEDVK-NALTSPIMMLSHGFILMIIVVSFI-TT AH6.7 --------------------MSSQKCASHLEI-ARLESLNFKISQLIYFVLIITTLF-FT AH6.11 --------------------MSAPNCARKYDI-ARLSSLNFQISQYVYLSLISLTFI-FS AH6.8 --------------------MSLTKCASKLEI-DRLISLNFRINQIIVLIPVFITFI-FT AH6.14 --------------------MATIACASIIEQ-QRLRSSNFVIAQYIDLLCIVITFV-TT
Systems Biology
DNA
Protein
Pathway/Partners
Cell
Organ/Tissue
Organism
Var
iatio
n/S
timul
us Measurem
ent
Systems Biology
Causality
Complexity
Coordination
Robustness
Resilience
Systems Theory“The study of organization and behavior per se”
(Wolkenhauer, Brief. Bioinform. 2:258, 2001)
Outline
• Functional variation in the human genome– Extent of common protein variation– Genes that have evolved faster in human lineage
• Mouse models of complex disease– Use of natural variation to infer a
model of normal heart function
Aren’t there enough SNPs already?
• Depends on disease mapping strategyYes! No! Yes! No!
0
50000
100000
150000
200000
missense SNPs
# of
SN
Ps
dbSNP + CRA
dbSNP + CRA+HGMDSNP universe
Risch 2000. Nature 405:847.
Deficiency of missense SNPsDisease causing alleleGenetic Marker
infer
direct
Identifying Common Sites of Variation
March, 2001<6,500 missense SNPs in3,500 of 10,000 RefSeq genes
Identifying Common Sites of Variation
SNP Discovery in:
20 Female Caucasians
19 Female African-Americans
1 Male chimpanzee
Why 39 people?
Power to detect at least 1 minor allele
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 10 20 30 40 50Number of individuals sequenced
Pro
ba
bili
ty o
f de
tect
ion
1%
2.50%
5%
10%
20%
Allele Freq.
Re-sequencing Workflow Primer design
Unique primers are designed around coding exons and human-mouse conserved segments in 1 kbp upstream of transcript
Splice sites should be sequenced most of the time
coding exons5’ UTRConservedRegions withTF binding sites
• Amplification & Sequencing
– Re-arrayed primer and DNA plates are mixed to generate PCR and sequencing plate
– Both strands are sequenced using the M13 tails on the primers
• SNP detection
– Polyphred analysis SNP scoring by expert system Manual QA
• SNP annotation
– SNPs mapped to the Celera reference genome and annotated with regards to gene location, mutation type, allele frequency, genotypes…..
Data Source: Human and Chimp
2 4 .7 M bco d in g seq u en cein a t lea s t 5D N A s
o bta ined
P C R a ndse qu en ce
in 39 hu m a ns
P C R a ndse qu en ce
in 1 ch im p an zee
1 8 .3 M bco d in g seq u en ce
o bta ined
2 0 1K a m p lico nsd e sig ne d to 30 .8 M b
C e le ra G e no m icsh u m a n cod ing
se qu en ce25K genes
23K genes
20K genes
Summary of SNPs found
• >18 million lanes run (compare to 36 million for shotgun sequencing human genome)
• 23,363 genes assayed from 30,115 in the genome
• 265,978 Total SNPs
• ~75% are novel
• 36,900 missense SNPs – Doubled the number that were previously known
Why are we different from chimpanzees?
Proteins are 97-100% identical
King and Wilson, Science 188:107-116 (1975)
• The differing 1-3% is important
• The important differences are in gene regulation
• A small number of genes of divergent function with a disproportionate impact
Goal
• Identify genes that have shaped a particular species
• Identify human genes that may be more likely to be involved in human disease
mouse
human
chimp4.6 – 6.2 MY
112 MY
Random drift
Goal
• Identify genes that have shaped a particular species
• Identify human genes that may be more likely to be involved in human disease
mouse
human
chimp4.6 – 6.2 MY
112 MY
Natural Selection
Metric
• dN – Non-synonymous substitution rate– Nucleotide differences that CHANGE the amino acid sequence in
orthologous proteins
CGC (Arg) GGC (Gly)
• dS – Synonymous substitution rate– Nucleotide changes that do not change the amino acid
CGC (Arg) CGG (Arg)
• dN/dS Ratio– dN = dS indicates neutral change– dN/dS < 1 indicates constraint/negative selection– dN/dS > 1 indicates possible positive selection
Caveats
• Low dS causes problems– Divide by ~0 problem
• Must match true orthologs– Paralogous genes are subject to differing evolutionary pressures
• Annotation and alignment must be correct
List of human genes
Human gene
Determine coding
sequence
Chimp traces
Build chimp transcript
Align to human
QC alignment
Chimp Gene Passes
Determine mouse ortholog
Build mouse transcript
Align to human
QC alignment
Mouse Gene Passes
Determine whatwas “covered”
Alignment files (2 or 3 species) Analysis
Data Set
7,645 codingsequence
alignments
HUMAN
MOUSEORTHOLOG
CHIMP
Evidence Distribution 7645 MH Orthologs
Evidence– Tblastx (+/-)– Syntenic anchor (+/-)– Syntenic block (+)– Shared protein family (+/-/0)
Selected human chromosomes and their mouse orthologs
Nonsynonymous and synonymous divergence: human-chimp
Nonsynonymous and synonymous divergence: human-mouse
Nonsynonymous and synonymous divergence
Correlation between dN and dS
Method
• Generate three-species (human-chimp-mouse) coding sequence alignments
• Apply models of sequence divergence
• Identify genes that violate null hypothesis
mouse
human
chimp4.6 – 6.2 MY
112 MY
mouse
human
chimp
Null hypothesis Gene with accelerated evolution on the human
branch
Yang and Nielsen Evolutionary Model
• Allows variation in the dN/dS ratio among
lineages and among sites at the same time
• Tests what is more likely:
– all sites are either neutral (dN/dS =1) or evolve under
negative selection (dN/dS < 1)
– some sites are evolving under positive selection in the human (or chimp) lineage only
Adapted from Mol. Biol. Evol. 19:908, 2002
Evolutionary Model
List of top 22 human accelerated genes (model 1)symbol name p valueTECTA tectorin alpha 0.00001570none none 0.00002110none none 0.00004010CSMD1 CUB and Sushi multiple domains 1 0.00006880FARP2 FERM, RhoGEF and pleckstrin domain protein 2 0.00008390
OR5I1 olfactory receptor, family 5, subfamily I , member 1 0.00015100none none 0.00015400MAN1A2 mannosidase, alpha, class 1A, member 2 0.00020100
TRPV6 transient receptor potential cation channel, subfamily V, member 6 0.00024200none none 0.00025900
CDON cell adhesion molecule-related/down-regulated by oncogenes 0.00029300
SLC6A5 solute carrier family 6 (neurotransmitter transporter, glycine), member 5 0.00033700none none 0.00037300
DMRT3 doublesex and mab-3 related transcription factor 3 0.00038600none none 0.00038800NUP155 nucleoporin 155kDa 0.00044400none none 0.00048900WHN winged-helix nude 0.00055700UBP1 upstream binding protein 1 (LBP-1a) 0.00056500none none 0.00057700
GCAT glycine C-acetyltransferase (2-amino-3-ketobutyrate coenzyme A ligase) 0.00061100none none 0.00061800TRAF5 TNF receptor-associated factor 5 0.00063500
-Tectorin and hearing
Tectorial membrane
Hair cells
• Protein plays a vital role in the tectorial membrane of the inner ear
• Single amino acid polymorphisms are associated with familial high frequency hearing loss
• Knockout mice are deaf
FOXP2
• Molecular evolution of FOXP2, a gene involved in speech and language– Enard, et al. Nature, 418:869, 2002
• “The ability to develop articulate speech relies on capabilities, such as fine control of the larynx and mouth, that are absent in chimpanzees and other great apes”
• “FOXP2 seem to be required for acquisition of normal spoken language”
• “FOXP2 … has been the target of selection during recent human evolution”
Enrichment of biological processes
Model: dN/dS > 1 and >1 nonsyn sub, binomial test
*= significant in 1 species**=significant in 2 species
Olfaction: human genes pseudogenes?
Blue = pseudogeneRed = gene
Pseudogene status from HORDE: http://bioinformatics.weizmann.ac.il/HORDE/
Over-Representation of Certain Families
Correlation between diversity and divergence
Comparative Genomics
Photo: 1997 Purina Mills Calendar
C57BL/6J and A/J mice: Models for study of the metabolic syndrome
On a high fat, high sucrose diet:
C57BL/6J A/J
Obesity
Hypertension
Hyperglycemia
Hypertriglyceridemia
Low HDL Cholesterol
X
X
X
X
X
Indicates that the strain develops the condition X Indicates that the strain does
not develop the condition
Functional Networks
Attributes
• applicable to all kinds of biological traits
• here - subtle, naturally-occurring, non-pathologic variation
• quantitative and qualitative biological properties
• monogenic and polygenic traits
• additive and epistatic traits
• uses results from all kinds of assays
• healthy individuals to learn about normal biological functions
• abnormal conditions to learn about disease processes
Computational and genomic synthesis of complex systems
from assays of components traits
Perturbation tests
Traditional approach
Single gene mutations (endogenous challenge)
Drug treatments (exogenous challenges)
Both establish causal relations
But
How do we interpret networks derived from perturbations that have dramatic
effects?
Alternative: Factorial design (after Fisher)
1. Segregating populations
2. Reference network based on normal variation
3. Use to evaluate single gene mutations, modifier genes and drug perturbations
Nadeau, et al. Genome Research 13:2082, 2003
RV
LV
LA
AortaSW
PW
AWRV
CW
Aorta
LV
LA
transducer
Abbreviations
AWRV - anterior wall, right ventricle PW - posterior wall
CW - chest wall RV - right ventricle
LA - left atrium SW - septal wall
LV - left ventricle
Echocardiography
Heart:Proof-of-concept study
ESDEDD
SWTh
PWTh
LV Cavity
PW
SW
CWAWRV
RV Cavity
TimeAbbreviations Calculations
EDD - end diastolic dimension FS (fractional shortenting) = (EDD - ESD) / EDD
ESD - end systolic dimension LV mass = 1.06 x [(EDD + PWTh + SWTh)3 – (EDD)3]
PWTh - posterior wall thickness Th/r = (PWTH + SWTh) / EDD
SWTh - septal wall thickness SV (stroke volume) = EDD3 - ESD3
HR = beats per min CO (cardiac output) = SV x HR
Echocardiography: Measures and calculations
Alternative genetic solutions to
the same cardiovascular problem
C57BL/6J A/JLV mass (g) 46.2 +- 14.1 32.7 +- 11.5 *LV EDD (mm) 3.31 +- 0.42 2.83 +- 0.31 *LV ESD (mm) 2.01 +- 0.32 1.49 +- 0.25 *Exercise time (min) 9.6 +- 3.4 4.4 +- 1.9 *LV frac. shortening (%) 39.1 +- 6.2 47.1 +- 6.9 *
Vcf (s-1) 8.8 +- 1.9 11.7 +- 2.6 *
SW Th (mm) 0.49 +- 0.06 0.47 +- 0.07PW Th (mm) 0.49 +- 0.05 0.45 +- 0.08LV mass / BW (mg/g) 1.96 +- 0.38 1.54 +- 0.43Rel wall thickness 0.30 +- 0.04 0.32 +- 0.04HR (echo; bpm) 433 +- 55 524 +- 45HR (tail cuff; bpm) 615 +- 79 694 +- 75Systolic BP (mm Hg) 122 +- 13 123 +- 20.8
Cardiac output (ml/min) 0.58 +- 0.19 0.50 +- 0.17
These strains were not constructed to differ in CV functions
B6: ‘athlete’s heart’, physiologic hypertrophy,
exercise endurance
Summary of cardiovascular traits
PerturbationsSubtle
Naturally-occurring
Non-pathologic
Chr 1
Chr 2
Chr 3
Chr 4
Chr 5
Chr 6
Chr7
’’
Chr X
A/J B6 AXB1 AXB2 AXB3 BXA30
Randomizing genomes in recombinant inbred strains
Key features
Probability of coincidental match for 2 strains: 0.50 (50% chance of fixing A or B allele).
Probability of coincidental match for 30 strains: (0.50)29 = <2 x 10-9 !!!
(These results apply a single gene trait; probabilities are lower for polygenic traits)
Ht rate: 680 590 691 585 597 666
Exer time: 233 582 540 597 241 255
Methods: building functional networks Strain (randomized genetics)Trait S1 S2 S3 . . SnT1 # # # #T2 # # # #T3 # #..Tn #
TraitTrait T1 T2 T3 . . TnT2 r12 -- -- -- T3 r13 r23 T4 r14 r24 r34 ..Tn r1n
TraitTrait T1 T2 T3 . . TnT2 +r12 T3 -- -- T4 +r14 +r24 -r34 ..Tn -- -- -- . . --
3. Identify networks
1. Type traits 2. Estimate cosegregation
2b. Identify significant relations
Trait 2
Trait 1
Trait nTrait 4
Trait 3
2a. Cluster analysis
Trait 2
Trait 1
Trait nTrait 4
Trait 3
Segregation of CV traits in AXB / BXA RI strains
Septal wall thickness
012345678
0.44 to0.47
0.47 to0.50
0.50 to0.53
0.53 to0.56
0.56 to0.59
0.59 to0.62
0.62 to0.65
0.65 to0.68
Number
Posterior wall thickness
012345678
0.46to
0.48
0.48to
0.50
0.50to
0.52
0.52to
0.54
0.54to
0.56
0.56to
0.58
0.58to
0.60
0.60to
0.62
0.62to
0.64
Number
Septal vs posterior wall thickness
0.0000.1000.2000.3000.4000.5000.6000.7000.800
0.000 0.200 0.400 0.600 0.800
PWThT
• multigenic variation
• positive cosegregation
r = 0.88
r2 = 0.77
- transgressive variation
(traits exceeding parental values)
SWTh
xxx 0.88
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
xxx
PWTh EDD ESD FS LV mass BW LV/BW SV THR HR EST
SWTh
PWTh
EDD
ESD
FS
LVmass
BW
LV/BW
SV
THR
HR
0.72 0.61
-0.69
-0.66
-0.68
0.61 0.74
0.94 -0.84 0.78 0.97 -0.65
-0.97 0.71 0.84 -0.65
-0.63 -0.70
0.65 0.77
-0.61
r = 0.61, P < 0.05
r = 0.68, P < 0.01
Cosegregation thresholds
+ -
Cosegregation of CV traits in mouse RI strains
r
-r
-r
r
Threshold r values are based on 10,000 permutations for each trait,
evaluated across all traits, which accounts for multiple testing
(not highlighted)
Post wall th
ESD
Stroke vol
Frac short
Heart rateLV mass
Septal wall th
Body wtExercise Th/r
LV / BW EDD
CO
Positive cosegregationInverse cosegregation
P < 0.01
Proof-of-concept: functional architecture of cardiovascular traits
Post wall th
ESD
Stroke vol
Frac short
Heart rateLV mass
Septal wall th
Body wtExercise Th/r
LV / BW EDD
CO
retainedlost not measured
Over-expression of calsequestrin
C57BL/6J and 129/SV Background; 3 months; Anesthesia: AvertinPhenotype: Hypertrophic cardiomyopathy(Harris et al. Circ. Res. 2002: 90: 594-601)
Single gene perturbations of CV networks
Genetics
linkages
QTLs
complex traits
modifier genes
Functional dependencies
perturbations
functional modularity
systems biology
Pathways
direct interactions
signal transduction
transcription
metabolism
Acknowledgements
• Applied Biosystems– Stephen Glanowski– Carey Gire– Cheryl Evans– Steve Ferriera
• Celera Diagnostics– Michele Cargill– Daniel Civello– Steve Schrodi– John Sninsky– Tom White
• Celera Genomics– Paul Thomas– Anish Kejariwal– Xianqgun Xheng– Fu Lu– Jim Duff– David Tanenbaum
• Cornell University– Andy Clark– Melissa Todd– Rasmus Nielsen
• Case Western Reserve University– Joe Nadeau– Brian Hoit– Yo-Han Pao