Genomics: Looking at Life in New Ways

Genomics: Looking at Life in New Ways

Mark D. Adams

Department of GeneticsCenter for Computational Genomics

Center for Human Genetics

• Genome publications Feb/2001• ~30,000 genes, 3 million SNPs

Computing the Genome - Assembly

ScreenerScreener Mask heterochromatin and ribo-DNA,Tag known interspersed repeats.

OverlapperOverlapper Find all overlaps 40bp allowing 6% mismatch. (1000X Blast)

UnitigerUnitigerASSEMBLER CORE:• Compute all consistent sub-assemblies = unitigs• Identify those that cover unique DNA = U-unitigs• Scaffold U-unitigs with confirmed shorts & longs• Then with BAC ends• Fill repeat gaps with:

I. Doubly anchored mates

ScaffolderScaffolder

Repeat Rez IRepeat Rez I

8:37

86:25

38:29

4:12

5:44+4:21+19:53

ConsensusConsensus Bayesian “SNP” consensus using quality values. Occurs throughout assembler core. (~25)

Repeat Rez I, II, IIIRepeat Rez I, II, IIIII. O-path confirmed singly-anchored matesIII. Greedy path completion using QVs

Computing the Genome - Analysis

• Gene Prediction

• Repeat Elements

• Large-scale structure– Did a genome-wide

duplication occur in the evolution of the human genome?

After the genome….

• Define ‘finished’….– Challenging regions– Centromeres– Annotation of genes (protein-coding and non-protein-coding)– Annotation of non-genic functional elements

• More Genomes!– Identification of functionally important regions through analysis of

conservation through evolution

• ‘Comprehensive’ parts list

• High-throughput mentality

Protein Structure Prediction/Comparison

1B0B1O1O

F28C12.5 ------------MSQLTAEELDSQKCASEGLT-SVLTSITMKFNFLFITTVILLSYC-FT F28C12.7 ------------MNKTAEDLLDSLKCASKDLS-SALTSVTIKFNCIFISTIVLISYC-FI T06G6.1 ------------MNKTAEELLDSLKCASDGLA-SALTSVTLKFNCAFISTIVLISYC-FS F28C12.2 ------------MNKTAEELD-SRNCASESLT-NALISITMKFNFIFIITVVLISYC-FT F28C12.3 ------------MNKTAEELLDSRKCASEGLT-NALTSFMMKMNFSFIVT---------- F28C12.4 ------------MNKTAEELVESLRCASEGLT-NALTSITVKVSFVFLATVILLSYY-FA T06G6.2 ------------MNKTAEEIVESRRCASEGLT-NALTSITVKMSSVLVVTVILLSYY-FA F28C12.1 --------------MNQTELLESLKCASEGMV-KAMTSTTMKLNFVFIATVIFLSFY-FA T26E3.9 ----------------MNELIDGPKCASEGIV-NAMTSIPVKISFLIIATVIFLSFY-FA F18C5.6 ---------------------MSSECARSDVH-NVLTSDSMKFNHCFIISIIIISFF-TT F18C5.8 ------------------MENLNPACASEDVK-NALTSPIMMLSHGFILMIIVVSFI-TT AH6.7 --------------------MSSQKCASHLEI-ARLESLNFKISQLIYFVLIITTLF-FT AH6.11 --------------------MSAPNCARKYDI-ARLSSLNFQISQYVYLSLISLTFI-FS AH6.8 --------------------MSLTKCASKLEI-DRLISLNFRINQIIVLIPVFITFI-FT AH6.14 --------------------MATIACASIIEQ-QRLRSSNFVIAQYIDLLCIVITFV-TT

Systems Biology

DNA

Protein

Pathway/Partners

Cell

Organ/Tissue

Organism

Var

iatio

n/S

timul

us Measurem

ent

Systems Biology

Causality

Complexity

Coordination

Robustness

Resilience

Systems Theory“The study of organization and behavior per se”

(Wolkenhauer, Brief. Bioinform. 2:258, 2001)

Outline

• Functional variation in the human genome– Extent of common protein variation– Genes that have evolved faster in human lineage

• Mouse models of complex disease– Use of natural variation to infer a

model of normal heart function

Aren’t there enough SNPs already?

• Depends on disease mapping strategyYes! No! Yes! No!

0

50000

100000

150000

200000

missense SNPs

# of

SN

Ps

dbSNP + CRA

dbSNP + CRA+HGMDSNP universe

Risch 2000. Nature 405:847.

Deficiency of missense SNPsDisease causing alleleGenetic Marker

infer

direct

Identifying Common Sites of Variation

March, 2001<6,500 missense SNPs in3,500 of 10,000 RefSeq genes

Identifying Common Sites of Variation

SNP Discovery in:

20 Female Caucasians

19 Female African-Americans

1 Male chimpanzee

Why 39 people?

Power to detect at least 1 minor allele

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 10 20 30 40 50Number of individuals sequenced

Pro

ba

bili

ty o

f de

tect

ion

1%

2.50%

5%

10%

20%

Allele Freq.

Re-sequencing Workflow Primer design

Unique primers are designed around coding exons and human-mouse conserved segments in 1 kbp upstream of transcript

Splice sites should be sequenced most of the time

coding exons5’ UTRConservedRegions withTF binding sites

• Amplification & Sequencing

– Re-arrayed primer and DNA plates are mixed to generate PCR and sequencing plate

– Both strands are sequenced using the M13 tails on the primers

• SNP detection

– Polyphred analysis SNP scoring by expert system Manual QA

• SNP annotation

– SNPs mapped to the Celera reference genome and annotated with regards to gene location, mutation type, allele frequency, genotypes…..

Data Source: Human and Chimp

2 4 .7 M bco d in g seq u en cein a t lea s t 5D N A s

o bta ined

P C R a ndse qu en ce

in 39 hu m a ns

P C R a ndse qu en ce

in 1 ch im p an zee

1 8 .3 M bco d in g seq u en ce

o bta ined

2 0 1K a m p lico nsd e sig ne d to 30 .8 M b

C e le ra G e no m icsh u m a n cod ing

se qu en ce25K genes

23K genes

20K genes

Summary of SNPs found

• >18 million lanes run (compare to 36 million for shotgun sequencing human genome)

• 23,363 genes assayed from 30,115 in the genome

• 265,978 Total SNPs

• ~75% are novel

• 36,900 missense SNPs – Doubled the number that were previously known

Why are we different from chimpanzees?

Proteins are 97-100% identical

King and Wilson, Science 188:107-116 (1975)

• The differing 1-3% is important

• The important differences are in gene regulation

• A small number of genes of divergent function with a disproportionate impact

Goal

• Identify genes that have shaped a particular species

• Identify human genes that may be more likely to be involved in human disease

mouse

human

chimp4.6 – 6.2 MY

112 MY

Random drift

Goal

• Identify genes that have shaped a particular species

• Identify human genes that may be more likely to be involved in human disease

mouse

human

chimp4.6 – 6.2 MY

112 MY

Natural Selection

Metric

• dN – Non-synonymous substitution rate– Nucleotide differences that CHANGE the amino acid sequence in

orthologous proteins

CGC (Arg) GGC (Gly)

• dS – Synonymous substitution rate– Nucleotide changes that do not change the amino acid

CGC (Arg) CGG (Arg)

• dN/dS Ratio– dN = dS indicates neutral change– dN/dS < 1 indicates constraint/negative selection– dN/dS > 1 indicates possible positive selection

Caveats

• Low dS causes problems– Divide by ~0 problem

• Must match true orthologs– Paralogous genes are subject to differing evolutionary pressures

• Annotation and alignment must be correct

List of human genes

Human gene

Determine coding

sequence

Chimp traces

Build chimp transcript

Align to human

QC alignment

Chimp Gene Passes

Determine mouse ortholog

Build mouse transcript

Align to human

QC alignment

Mouse Gene Passes

Determine whatwas “covered”

Alignment files (2 or 3 species) Analysis

Data Set

7,645 codingsequence

alignments

HUMAN

MOUSEORTHOLOG

CHIMP

Evidence Distribution 7645 MH Orthologs

Evidence– Tblastx (+/-)– Syntenic anchor (+/-)– Syntenic block (+)– Shared protein family (+/-/0)

Selected human chromosomes and their mouse orthologs

Nonsynonymous and synonymous divergence: human-chimp

Nonsynonymous and synonymous divergence: human-mouse

Nonsynonymous and synonymous divergence

Correlation between dN and dS

Method

• Generate three-species (human-chimp-mouse) coding sequence alignments

• Apply models of sequence divergence

• Identify genes that violate null hypothesis

mouse

human

chimp4.6 – 6.2 MY

112 MY

mouse

human

chimp

Null hypothesis Gene with accelerated evolution on the human

branch

Yang and Nielsen Evolutionary Model

• Allows variation in the dN/dS ratio among

lineages and among sites at the same time

• Tests what is more likely:

– all sites are either neutral (dN/dS =1) or evolve under

negative selection (dN/dS < 1)

– some sites are evolving under positive selection in the human (or chimp) lineage only

Adapted from Mol. Biol. Evol. 19:908, 2002

Evolutionary Model

List of top 22 human accelerated genes (model 1)symbol name p valueTECTA tectorin alpha 0.00001570none none 0.00002110none none 0.00004010CSMD1 CUB and Sushi multiple domains 1 0.00006880FARP2 FERM, RhoGEF and pleckstrin domain protein 2 0.00008390

OR5I1 olfactory receptor, family 5, subfamily I , member 1 0.00015100none none 0.00015400MAN1A2 mannosidase, alpha, class 1A, member 2 0.00020100

TRPV6 transient receptor potential cation channel, subfamily V, member 6 0.00024200none none 0.00025900

CDON cell adhesion molecule-related/down-regulated by oncogenes 0.00029300

SLC6A5 solute carrier family 6 (neurotransmitter transporter, glycine), member 5 0.00033700none none 0.00037300

DMRT3 doublesex and mab-3 related transcription factor 3 0.00038600none none 0.00038800NUP155 nucleoporin 155kDa 0.00044400none none 0.00048900WHN winged-helix nude 0.00055700UBP1 upstream binding protein 1 (LBP-1a) 0.00056500none none 0.00057700

GCAT glycine C-acetyltransferase (2-amino-3-ketobutyrate coenzyme A ligase) 0.00061100none none 0.00061800TRAF5 TNF receptor-associated factor 5 0.00063500

-Tectorin and hearing

Tectorial membrane

Hair cells

• Protein plays a vital role in the tectorial membrane of the inner ear

• Single amino acid polymorphisms are associated with familial high frequency hearing loss

• Knockout mice are deaf

FOXP2

• Molecular evolution of FOXP2, a gene involved in speech and language– Enard, et al. Nature, 418:869, 2002

• “The ability to develop articulate speech relies on capabilities, such as fine control of the larynx and mouth, that are absent in chimpanzees and other great apes”

• “FOXP2 seem to be required for acquisition of normal spoken language”

• “FOXP2 … has been the target of selection during recent human evolution”

Enrichment of biological processes

Model: dN/dS > 1 and >1 nonsyn sub, binomial test

*= significant in 1 species**=significant in 2 species

Olfaction: human genes pseudogenes?

Blue = pseudogeneRed = gene

Pseudogene status from HORDE: http://bioinformatics.weizmann.ac.il/HORDE/

Over-Representation of Certain Families

Correlation between diversity and divergence

Comparative Genomics

http://www.primates.com/chimps/chimpanzee.html

Photo: 1997 Purina Mills Calendar

C57BL/6J and A/J mice: Models for study of the metabolic syndrome

On a high fat, high sucrose diet:

C57BL/6J A/J

Obesity

Hypertension

Hyperglycemia

Hypertriglyceridemia

Low HDL Cholesterol

X

X

X

X

X

Indicates that the strain develops the condition X Indicates that the strain does

not develop the condition

Functional Networks

Attributes

• applicable to all kinds of biological traits

• here - subtle, naturally-occurring, non-pathologic variation

• quantitative and qualitative biological properties

• monogenic and polygenic traits

• additive and epistatic traits

• uses results from all kinds of assays

• healthy individuals to learn about normal biological functions

• abnormal conditions to learn about disease processes

Computational and genomic synthesis of complex systems

from assays of components traits

Perturbation tests

Traditional approach

Single gene mutations (endogenous challenge)

Drug treatments (exogenous challenges)

Both establish causal relations

But

How do we interpret networks derived from perturbations that have dramatic

effects?

Alternative: Factorial design (after Fisher)

1. Segregating populations

2. Reference network based on normal variation

3. Use to evaluate single gene mutations, modifier genes and drug perturbations

Nadeau, et al. Genome Research 13:2082, 2003

RV

LV

LA

AortaSW

PW

AWRV

CW

Aorta

LV

LA

transducer

Abbreviations

AWRV - anterior wall, right ventricle PW - posterior wall

CW - chest wall RV - right ventricle

LA - left atrium SW - septal wall

LV - left ventricle

Echocardiography

Heart:Proof-of-concept study

ESDEDD

SWTh

PWTh

LV Cavity

PW

SW

CWAWRV

RV Cavity

TimeAbbreviations Calculations

EDD - end diastolic dimension FS (fractional shortenting) = (EDD - ESD) / EDD

ESD - end systolic dimension LV mass = 1.06 x [(EDD + PWTh + SWTh)3 – (EDD)3]

PWTh - posterior wall thickness Th/r = (PWTH + SWTh) / EDD

SWTh - septal wall thickness SV (stroke volume) = EDD3 - ESD3

HR = beats per min CO (cardiac output) = SV x HR

Echocardiography: Measures and calculations

Alternative genetic solutions to

the same cardiovascular problem

C57BL/6J A/JLV mass (g) 46.2 +- 14.1 32.7 +- 11.5 *LV EDD (mm) 3.31 +- 0.42 2.83 +- 0.31 *LV ESD (mm) 2.01 +- 0.32 1.49 +- 0.25 *Exercise time (min) 9.6 +- 3.4 4.4 +- 1.9 *LV frac. shortening (%) 39.1 +- 6.2 47.1 +- 6.9 *

Vcf (s-1) 8.8 +- 1.9 11.7 +- 2.6 *

SW Th (mm) 0.49 +- 0.06 0.47 +- 0.07PW Th (mm) 0.49 +- 0.05 0.45 +- 0.08LV mass / BW (mg/g) 1.96 +- 0.38 1.54 +- 0.43Rel wall thickness 0.30 +- 0.04 0.32 +- 0.04HR (echo; bpm) 433 +- 55 524 +- 45HR (tail cuff; bpm) 615 +- 79 694 +- 75Systolic BP (mm Hg) 122 +- 13 123 +- 20.8

Cardiac output (ml/min) 0.58 +- 0.19 0.50 +- 0.17

These strains were not constructed to differ in CV functions

B6: ‘athlete’s heart’, physiologic hypertrophy,

exercise endurance

Summary of cardiovascular traits

PerturbationsSubtle

Naturally-occurring

Non-pathologic

Chr 1

Chr 2

Chr 3

Chr 4

Chr 5

Chr 6

Chr7

’’

Chr X

A/J B6 AXB1 AXB2 AXB3 BXA30

Randomizing genomes in recombinant inbred strains

Key features

Probability of coincidental match for 2 strains: 0.50 (50% chance of fixing A or B allele).

Probability of coincidental match for 30 strains: (0.50)29 = <2 x 10-9 !!!

(These results apply a single gene trait; probabilities are lower for polygenic traits)

Ht rate: 680 590 691 585 597 666

Exer time: 233 582 540 597 241 255

Methods: building functional networks Strain (randomized genetics)Trait S1 S2 S3 . . SnT1 # # # #T2 # # # #T3 # #..Tn #

TraitTrait T1 T2 T3 . . TnT2 r12 -- -- -- T3 r13 r23 T4 r14 r24 r34 ..Tn r1n

TraitTrait T1 T2 T3 . . TnT2 +r12 T3 -- -- T4 +r14 +r24 -r34 ..Tn -- -- -- . . --

3. Identify networks

1. Type traits 2. Estimate cosegregation

2b. Identify significant relations

Trait 2

Trait 1

Trait nTrait 4

Trait 3

2a. Cluster analysis

Trait 2

Trait 1

Trait nTrait 4

Trait 3

Segregation of CV traits in AXB / BXA RI strains

Septal wall thickness

012345678

0.44 to0.47

0.47 to0.50

0.50 to0.53

0.53 to0.56

0.56 to0.59

0.59 to0.62

0.62 to0.65

0.65 to0.68

Number

Posterior wall thickness

012345678

0.46to

0.48

0.48to

0.50

0.50to

0.52

0.52to

0.54

0.54to

0.56

0.56to

0.58

0.58to

0.60

0.60to

0.62

0.62to

0.64

Number

Septal vs posterior wall thickness

0.0000.1000.2000.3000.4000.5000.6000.7000.800

0.000 0.200 0.400 0.600 0.800

PWThT

• multigenic variation

• positive cosegregation

r = 0.88

r2 = 0.77

- transgressive variation

(traits exceeding parental values)

SWTh

xxx 0.88

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

xxx

PWTh EDD ESD FS LV mass BW LV/BW SV THR HR EST

SWTh

PWTh

EDD

ESD

FS

LVmass

BW

LV/BW

SV

THR

HR

0.72 0.61

-0.69

-0.66

-0.68

0.61 0.74

0.94 -0.84 0.78 0.97 -0.65

-0.97 0.71 0.84 -0.65

-0.63 -0.70

0.65 0.77

-0.61

r = 0.61, P < 0.05

r = 0.68, P < 0.01

Cosegregation thresholds

+ -

Cosegregation of CV traits in mouse RI strains

r

-r

-r

r

Threshold r values are based on 10,000 permutations for each trait,

evaluated across all traits, which accounts for multiple testing

(not highlighted)

Post wall th

ESD

Stroke vol

Frac short

Heart rateLV mass

Septal wall th

Body wtExercise Th/r

LV / BW EDD

CO

Positive cosegregationInverse cosegregation

P < 0.01

Proof-of-concept: functional architecture of cardiovascular traits

Post wall th

ESD

Stroke vol

Frac short

Heart rateLV mass

Septal wall th

Body wtExercise Th/r

LV / BW EDD

CO

retainedlost not measured

Over-expression of calsequestrin

C57BL/6J and 129/SV Background; 3 months; Anesthesia: AvertinPhenotype: Hypertrophic cardiomyopathy(Harris et al. Circ. Res. 2002: 90: 594-601)

Single gene perturbations of CV networks

Genetics

linkages

QTLs

complex traits

modifier genes

Functional dependencies

perturbations

functional modularity

systems biology

Pathways

direct interactions

signal transduction

transcription

metabolism

Acknowledgements

• Applied Biosystems– Stephen Glanowski– Carey Gire– Cheryl Evans– Steve Ferriera

• Celera Diagnostics– Michele Cargill– Daniel Civello– Steve Schrodi– John Sninsky– Tom White

• Celera Genomics– Paul Thomas– Anish Kejariwal– Xianqgun Xheng– Fu Lu– Jim Duff– David Tanenbaum

• Cornell University– Andy Clark– Melissa Todd– Rasmus Nielsen

• Case Western Reserve University– Joe Nadeau– Brian Hoit– Yo-Han Pao

Genomics: Looking at Life in New Ways

Documents

Transcript of Genomics: Looking at Life in New Ways