From Genome Sequences to Regulatory Network Phenotypes

50
From Genome Sequences to Regulatory Network Phenotypes • Study the systematic operation of genes and their products in whole genome, whole cell contexts. • Discover the effect of every gene on growth, expression, & interaction . • Test quantitative network models. (bioinformatic functional genomics:)

description

From Genome Sequences to Regulatory Network Phenotypes. (bioinformatic functional genomics:). Study the systematic operation of genes and their products in whole genome, whole cell contexts. Discover the effect of every gene on growth, expression, & interaction . - PowerPoint PPT Presentation

Transcript of From Genome Sequences to Regulatory Network Phenotypes

Page 1: From Genome Sequences to Regulatory Network Phenotypes

From Genome Sequences to Regulatory Network Phenotypes

• Study the systematic operation of genes and their products in whole genome, whole cell contexts.

• Discover the effect of every gene on growth, expression, & interaction .

• Test quantitative network models.

(bioinformatic functional genomics:)

Page 2: From Genome Sequences to Regulatory Network Phenotypes

Growth, Expression, & InteractionHarvard Center for

Computational Genetics

John Aach

Tim Chen

George Church

Jason Hughes

Jason Johnson

Abby McGuire

Jong Park

Fritz RothAffymetrix

David Lockhart

Eric Gentalen

NCBI

Andrew Neuwald

DOE, DARPA, Lipper, NIST, HMR

HMS Genetics

Andy Link, Doug Selinger

Pete Estep, Michael Ching

Martha Bulyk, Sonali Bose

Martin Steffen

Saeed Tavazoie, Annie Chan

Dereth Phillips, Chris Harbison

UCSD

Bernhard Palsson

Page 3: From Genome Sequences to Regulatory Network Phenotypes

Sequenced genomes

Organism # Genes% Unknown

functionS cerevisiae 6034 49%E coli 4288 38%B subtilus 4000 42%Synechocystis sp. 3168 56%A fulgidus 2471 52%H influenzae 1740 42%M thermoautotrophicum 1855 56%H pylori 1590 43%M jannaschii 1692 54%B burgdorgeri 863 42%M pneumoniae 677 51%M genitalium 470 31%

Total 28848 47%

Science 277: 1433 (1997) FUNs

Page 4: From Genome Sequences to Regulatory Network Phenotypes

Choice of Cells

Small genome size: Mycoplasma, Haemophilus, MethanococcusEnergy relevance: Methanobacterium, Synechocystis Major Pathogens: Mycobacterium, Escherichia, HelicobacterBiotech Production: Escherichia, Saccharomyces, Homo Recombinant protein production, in vivo combinatorial chemistry,BACs, gene delivery, etc.

15 going on 40 complete genomes. 30,000 going on 150,000 complete genes (& intergenic regions).

Smith, et al. (1997) J. Bacteriol. 179:7135-55. MethanobacteriumBlattner, et al. (1997) Science 277, 1453-74. EscherichiaGoffeau, et al. (1996) Science 274, 563-7. Saccharomyces

Page 5: From Genome Sequences to Regulatory Network Phenotypes

Metabolic & regulatory databases

4288 / 4909 E. coli orfs / genes 587 - 804 enzymes720 - 988 metabolic reactions436 / 1303 metabolites / compounds

Varma & Palsson (1994) Appl. Env. Micro. 60:3724.Karp et al. (1998) NAR 26:50. EcoCycSelkov, et al. (1997) NAR 25:37. WITRobison and Church http://arep.med.harvard.edu

Page 6: From Genome Sequences to Regulatory Network Phenotypes

has

exhibits

used in

described by

has

described by

described bydescribed by

exhibits

exhibits

exhibits

exhibits

exhibits

exhibits

exhibits

input to

used in

used in

used in

Strain Phenotype Expt

Starting Cell CountStarting Cell Density

Condition Set

Condition Set NumberDescriptionComment

Experiment Measures Set

Expt Measures Set NoTime of MeasurementExpt Measures Set TypeDescriptionCommentRaw Data Sets DescripData Transform DescripOutcome CommentSuccess CodeDate RecordedSample SizeOpenInd

Growth

Rel Growth MutantStd dev Rel Growth MutantWinner Mutant IndRel Growth AllStd dev Rel Growth AllWinner All Ind

mRNA Expression

mRNA Expression LevelStd dev Express Level

Protein Expression

Cell FractionProtein State Exp LevelStd Dev Prot State Level

Strain Mix

Strain Mix NumberStrain Mix NameDescriptionPreparation Comments

Conceptual Data Model

Project : TBEID1

Model : TBEID

Author : John Aach Version: 1.04 7/7/97

Footprint

Fraction OccupancySt Dev Frac Occupancy

DNA Protein Binding Expt

DNA Seq Binding

DNA Seq Bind Const NumDNA SequenceBinding ConstantStd Dev Binding Constant

Protein Preparation Set

Prot Prep Set NumberDescriptionComment

Protein Protein Binding

Binding LevelStd Dev Binding Level

Protein Protein Binding Expt

Submodel cross-references: * = main model, C = Condition Set Entities, D = DNA and Protein Elements, N = Names, P = Protein Preparation Entities, S = Strain and Strain Mix Entities

(P)

Competition Phenotype Expt

Starting Cell CountStarting Cell Density

(S)

(C)

(S,N)

Non Specific DNA Binding

Non Specific Binding ConstStd Dev Non Spec Bind Const

Experiment Info

Experiment NumberExperiment TypeExperimenter NameDescriptionCommentStart TimeEnd TimeOutcome CommentSuccess CodeSample SizeOpenInd

Strain

Strain NumberProgenitorIndDescriptionComment

Results Selection

Results Selection CodeExpt Measures Set TypeResults Selection Description

BIGED

Biomolecule Interaction,

Growth, Expression, &

Database:

John AachHarvard Center for Computational Genetics

Page 7: From Genome Sequences to Regulatory Network Phenotypes

Functional Genomics: Growth, Expression, & Interaction

Why?Sampled sequence vs. Completed genomesRandom vs. Engineered mutations & environmentsEvolutionary models vs. High-throughput assays

Pure comparative genomics challenge:15% amino acid identity:Globins retain heme & oxygen binding functions

100% amino acid identity:Enolase functions vary from enzymatic to major vertebrate lens structural component.

Page 8: From Genome Sequences to Regulatory Network Phenotypes

Environments

Metabolites

Growth rate

RNADNA Protein

Expression

InteractionskD

kR kP

kI

kc

kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade

Escherichia coli & Saccharomyces cerevisiaeRegulatory and Metabolic Networks

Page 9: From Genome Sequences to Regulatory Network Phenotypes

Automate Data Model Similarity quality quality search

X-ray 1960 resolution |o-c|/o DALIdiffraction < 0.2nm R < 0.2

Sequence 1988 discrepancy conserved BLAST bp <0.01% proteins

Function 1999 completion DNAgibbs CorFun (growth, expression, & interaction; CorEnvironment)

Translating successful strategies: Metrics(physics envy & killer applications)

Page 10: From Genome Sequences to Regulatory Network Phenotypes

Ratio of strains over environments, e ,times, te , selection coefficients, se,R = Ro exp[-sete]

80% of 34 random yeast insertions have s<0.3% or s>0.3%t=160 generations, e=1 (rich media); ~50% for t=15, e=7.Should allow comparisons with population allele models.

Other multiplex competitive growth experiments:Thatcher, et al. (1998) PNAS 95:253.Link AJ (1994) thesis; (1997) J Bacteriol 179:6228.Smith V, et al. (1995) PNAS 92:6479. Shoemaker D, et al. (1996) Nat Genet 14:450.

Page 11: From Genome Sequences to Regulatory Network Phenotypes

Multiplex DNA sequencing.Church GM. Kieffer-Higgins S. (1988) Science. 240:185.

Physical mapping of complex genomes by cosmid multiplex analysis. Evans GA. Lewis KA. (1989) PNAS 86: 5030.

Multiplexed biochemical assays with biological chips.Fodor SP, et al. (1993) Nature 364:555.

Lashkari DA, et al. (1995) An automated multiplex oligonucleotide synthesizer. PNAS 92(17):7912.

Multiplex: Tag(Mix) > Process > DecodeInternal standards, identical conditions, microscale

Page 12: From Genome Sequences to Regulatory Network Phenotypes

Multiplex Competitive Growth Experiments

In-framemutants+ wild-type

Pool Select

MultiplexPCRsize-tagor chipreadout

40° pH5 NaCl Complex

t=0

Page 13: From Genome Sequences to Regulatory Network Phenotypes

107 Environments (so far)

minimal mediayeast extractsynthetic richLow NLow PNaClurinepancreatinBile Cholatetriton X-1002 acetate4 butyrate6 hexanoatehomoserine lactone

Combinatorial:a,H,F,Q,tg,L,Y,N,SC,I,W,u,E M,K,T,D,dapV,P,R,G,thiaminea,g,C,M,thiamine H,L,I,K,VF,Y,W,T,PQ,N,u,D,Rt,S,E,dap,G

pH: 5, 6, 7, 8, 9Temperature: 25, 30, 37, 45

pyridoxin,nicotinate,biotin,pantothenate,A

Page 14: From Genome Sequences to Regulatory Network Phenotypes

Genome EngineeringChallenges: Construct any mutant in any background,multiple mutants, minimizing hitchhiking mutants.

Avoid undesired residual activities and neomorphic effects on adjacent genes in most deletion, insertionnonsense, or antisense alleles.Full in-frame replacements, computationally track gene overlaps, primer & genomic repeats.

Link, et al. (1997) J. Bacteriol. 179: 6228-6237. (pKO3)http://arep.med.harvard.edu

Page 15: From Genome Sequences to Regulatory Network Phenotypes

ATG

TAA

Primer with NotI site

c-tag

tagATG

TAA

ATG

TAA

Primer with Bam site

TAAATG

tag

Crossover PCR in-frame deletions / tag substitutions

nearby genegene of interest

Page 16: From Genome Sequences to Regulatory Network Phenotypes

30°sucrose

Resolving the cointegrant

2 = mutantwild type = 1

repAts

camR

sacB

M13 ori

43° Cam

pKO3: in-frame tagged deletions

tag

tag

Page 17: From Genome Sequences to Regulatory Network Phenotypes

Deleted Orf

yiaU

yhcS

ydhB

yfiE

pssR

789

518

348

266

194

141

106

universaltag primer

Primer design for size-tagged PCR3% agarose

size-tagged primerslength

ygfX

ygoX

Page 18: From Genome Sequences to Regulatory Network Phenotypes

Competitive Growth Rate Tag Readout

ygfX

yiaU

ydhB

yfiE

ygoX

pssR

yhcS

1 2

rich P- minimal N- minimal

111 222

Page 19: From Genome Sequences to Regulatory Network Phenotypes
Page 20: From Genome Sequences to Regulatory Network Phenotypes

Effects of pH in rich media

-200

-100

0

100

200

300

400

500

600

700

pssR farR nhaR ydhB yhcS yidP yhiF yidL uw6519

% c

han

ge f

rom

in

ocu

late

r' pH5

r' pH6

r' pH7

r' pH8

r' pH9

Page 21: From Genome Sequences to Regulatory Network Phenotypes

Genome EngineeringCurrent status

5 Highly Expressed Genes Link46 Putative regulatory FUNs Phillips24 Highly conserved FUNs Loferer20 Flux Balance Predictions in prep.

Page 22: From Genome Sequences to Regulatory Network Phenotypes

Flux balance modelwith max growth objective:

S . v = bS = stoichiometric matrix (m x n)v = vector of n fluxesb = I/O rate vectorn = 720 metabolic fluxesm= 436 metabolites

Predict major flux changes:

zwf-

zwf- pnt-

& synthetic lethals:

zwf- pgi-

GA3P

DPG

FDP

F6P

G6P

10.5010.5010.50

Glucose

3.929.279.36

3PG

2PG

PEP

Pyr

DHAP

6PGA 6PG

Ru5P

E4P

X5P

R5P

S7P

For

OAA

Mal

Fum

Succ

SuccCoA

KG

Ic i tC it

AcCoA

QH2

FADH

NADH

ATP

NADPH

H+

Ac

6.1600

3.9210.0810.11

2.700.590.64

1.87- 0 .1 8- 0 .6 2

1.54- 0 .5 1- 0 .4 7

3.929.279.36

3.929.279.36

1.89- 0 .1 6- 0 .1 5

3.44- 0 .6 7- 0 .6 2

15.9218.0018.21

15.9218.0018.21

14.5216.6216.93

14.5216.6216.93

10.5

0.953.07 0

0.522.525.18

0.522.525.18

0.122.134.82

1.403.405.99

1.403.405.99

1.403.405.99

1.343.345.94

1.343.342.33

0 03.61

5.085.253.54

9.3711.51 0

0 012.19

0.522.525.18

010.2 0

36.2731.5633.43

30.0

00.04 0

29.1227.1224.52

2.382.355.79

Page 23: From Genome Sequences to Regulatory Network Phenotypes

Non-coding regions:E. coli: 11%Yeast: 25%Human: 95%

Similarity searching for environments,growth, expression, & interaction data and then theChallenges of DNA sequence motifs:short motifs & limited alphabet (4)

Page 24: From Genome Sequences to Regulatory Network Phenotypes

Yggn

pspAo85

YiaK

carAB

f214

hrsAf105

ppiA

o184mtlA5’

mtlA3’

rspA

YidX

kdgT

Yggn

pspAo85

YiaK

carAB

f214

hrsAf105

ppiA

o184

mtlA

5’

mtlA

3’

YidX

rspA

kdgT

A

B

C

D

E

F

Positive correlationNegative correlation

Catabolite repressionglucose & Crp regulated

CorFun = Zg.Zg

T /nn = #environ+genotypesg = gene sites

(switching n & g gives CorEnv)

Log vs. stationary-phase regulated

growth, expression, &/or interaction

Page 25: From Genome Sequences to Regulatory Network Phenotypes

Expression data from four cultures,allow three comparisons

glucose 30oC

Mating type a

galactose 30oC

Mating type a

glucose 30oC

Mating type

glucose 30o C -> 39o C shock

Mating type a

Page 26: From Genome Sequences to Regulatory Network Phenotypes

Expression Quantitation Options

1) n-dimensional cDNA or protein displays2) Computer selected oligomer-arraysphotolithographic or piezoelectric deposition3) Gridded microarrays from clones4) Counting 13-bp cDNA tags (SAGE)(20,000 tags means <800 RNAs have S/N>4)

Lockhart, et al. (1997) Nature Biotechnology 15:1359. DeRisi, et al. (1997) Science 278:680.Velculescu, et al. (1997) Cell 88:243.

Page 27: From Genome Sequences to Regulatory Network Phenotypes

Galactose Regulatory Network

Gal4p-Gal80p active complex

Gal3p

GAL1MEL1 GAL7PGM2 GAL2 GAL10

Gal4p-Gal80p inactive complex

GALACTOSE

GAL80

GAL4

GCY1

Structural Genes For Galactose Metabolism

?

GAL3

Gal1p

Page 28: From Genome Sequences to Regulatory Network Phenotypes

Fold Change in GAL3 in Galactose vs. Glucose(Median Fold Change is 3.1)

GAL3: Fold Change in Expression between Growth in Galactose and Growth in Glucose

0

5

10

15

20

25

1 3 5 7 9

11 13

15

17

19

Probe Number

Fo

ld C

ha

ng

e

Page 29: From Genome Sequences to Regulatory Network Phenotypes

orfID/gene:chip#probes medFC consFC thrshld missingMM? expr ratio log expr ratio BINS log expr ratioFRE Q

Y BR020w/GAL1:A 21 64.81 24.57 2 64.81 1.81164202 -2 0

Y BR018c/GAL7:A 21 41.91 10.58 2 41.91 1.62231766 -1.95 0

Y BR019c/GAL10:A 20 37.8 13.03 2 37.8 1.5774918 -1.9 0

Y DR345c/HXT3:A 20 -25.05 -13.58 0.03992016 -1.39880773 -1.85 0

Y OR120W /GCY 1:D 20 12.31 7.81 2 12.31 1.09025805 -1.8 0

Y LR081w/GAL2:C 21 8.19 3.56 2 8.19 0.9132839 -1.75 0

Y GL189C/RP S 26A:B 19 -7.82 -0.45 0.12787724 -0.89320675 -1.7 0

Y P L066W /VP S 28:D 20 6.35 2.75 2 6.35 0.80277373 -1.65 0

Y HR094c/HXT1:B 20 -6.26 -2.38 1 0.15974441 -0.79657433 -1.6 0

Y OL154W /:D 21 -6.04 -3.27 0.16556291 -0.78103694 -1.55 0

Y P L067C/:D 21 5.95 3.13 2 5.95 0.77451697 -1.5 0

Y GL030W /RP L32_ex1:B21 -5.32 -3.11 0.18796992 -0.72591163 -1.45 0

Y FL045C/S E C53:B 21 -5.17 -2.73 0.1934236 -0.71349054 -1.4 0

Y BR106w/:A 21 -5.03 -2.66 1 0.19880716 -0.70156799 -1.35 1

Y E R190w/_f:B 20 -4.9 -2.48 1 0.20408163 -0.69019608 -1.3 0

Y MR318C/:D 20 4.02 2.36 4.02 0.60422605 -1.25 0

Y NL015W /P BI2:D 20 3.89 2.3 2 3.89 0.5899496 -1.2 0

Y BR011c/IP P 1:A 20 -3.73 -1.75 0.26809651 -0.57170883 -1.15 0

Y E R178w/P DA1:B 20 -3.46 -2.22 0.28901734 -0.5390761 -1.1 0

Y OL058W /ARG1:D 20 3.36 2.24 3.36 0.52633928 -1.05 0

Y CR005c/CIT2:A 20 -3.3 -2.15 0.3030303 -0.51851394 -1 0

Y HR092c/HXT4:B 20 -3.27 -1.52 1 0.3058104 -0.51454775 -0.95 0

25srRnaa:A::25srRnaa:B::25srRnaa:C::25srRnaa:D84 -3.27 -1.49 0.3058104 -0.51454775 -0.9 0

Y GL055W /OLE 1:B 20 3.21 1.98 3.21 0.50650503 -0.85 1

Y FR024C/_r:B 20 -3.21 -1.43 1 0.31152648 -0.50650503 -0.8 0

Y HR033W /:B 20 3.15 1.52 3.15 0.49831055 -0.75 2

Y DR009W /GAL3:A 20 3.08 1.38 2 3.08 0.48855072 -0.7 3

Y GR244C/:B 20 2.99 1.55 2 2.99 0.47567119 -0.65 1

Y KL096W /CW P 1:C 21 -2.97 -1.78 0.33670034 -0.47275645 -0.6 0

Y NL052W /COX5A:D 20 2.94 1.96 2.94 0.46834733 -0.55 1

Y J R073C/OP I3:C 20 -2.92 -1.52 0.34246575 -0.46538285 -0.5 5

Y MR256c/COX7:D 21 2.84 1.64 2.84 0.45331834 -0.45 3

0

5

10

15

20

25

30

Food Gas Motel

JanFebMarAprMayJun

Relative expression of all genes: Galactose vs. Glucose

0.1

1

10

100

1000

10000

-2.0

-1.5

-1.0

-0.5 0.0

0.5

1.0

1.5

2.0

Log of Fold Change

Num

ber

of G

enes

Page 30: From Genome Sequences to Regulatory Network Phenotypes

To analyze the most induced genes, we...

• Extracted the intergenic DNA sequence upstream of each translation start using the Saccharomyces Genome Database.

• Used an algorithm for multiple sequence alignment to look for sequence motifs conserved among the most induced (or repressed).

• Looked at the intersection of genes which both matched a conserved motif and were induced (or repressed)

Page 31: From Genome Sequences to Regulatory Network Phenotypes

Gibbs Motif Sampling Strategy1 Initialize the alignment by choosing a random subset of all

possible sites as the ‘site’ alignment, and use all remaining sequences to give a ‘non-site’ alignment.

2 Select a potential site from among all possible sites.3 If the site is in the alignment, take it out.4 Calculate the relative likelihood that the potential site belongs

with the site alignment rather than the ‘non-site’ alignment, based on a Bayesian multinomial distribution model.

5 Randomly choose whether or not to add the site, weighted by this relative likelihood.

6 Repeat Step 2

Page 32: From Genome Sequences to Regulatory Network Phenotypes

‘DNAGibbs’: A Modified Gibbs Motif Sampler Optimized for DNA searches.

• Either forward or reverse strand of a potential site -- but not both -- may be added to the alignment.

• Near-optimum sampling method was improved so that it is faster and tends to result in higher scoring alignments.

• Simultaneous multiple motif searching was replaced with a more efficient iterative masking approach.

• The model for base frequencies of non-site sequence was fixed using the average nucleotide frequencies of S. cerevisiae.

• Now runs on DEC Unix and Windows platforms, in addition to the formerly supported SGI and Sun Unix platforms.

Page 33: From Genome Sequences to Regulatory Network Phenotypes

• DNAGibbs (maximum log a posteriori likelihood ratio) scores less than 5. .

• Good matches (Z < 3 sd below the mean of the aligned positive motifs) with greater than 10% of all yeast genes (ORFs)

Finally, exclude motifs with:

*O.G. Berg & P.H. von Hippel, J. Mol. Biol., 193: 723-750 (1987)

Page 34: From Genome Sequences to Regulatory Network Phenotypes

Using the top 10 genes induced in galactose, DNAGibbs found UASG, the site recognized by Gal4p

Info

rmat

ion

(B

its)

sequence logos were developed by T.D. Schneider & R.M. Stephens, Nucleic Acids Res., 18: 6097-6100 (1990).

CGYTCGGA-GA-AGT---CCGA Previous UASG consensus

Page 35: From Genome Sequences to Regulatory Network Phenotypes

Genes that changed between galactose and glucose by more than 2-fold and have strong matches to the UASG motif

Gene Fold Change Best Z-Score # of SitesGAL1 >65 -1.4 5GAL7 >42 -0.7 2GAL10 >38 -1.4 5GCY1 >12 0.5 1GAL2 >8 0.4 4YPL066W >6 -1.1 1YPL067C >6 -1.1 1YMR318C 4 1.1 1GAL3 >3 2 2

Page 36: From Genome Sequences to Regulatory Network Phenotypes

Galactose Regulatory Network

Gal4p-Gal80p active complex

Gal3p

GAL1MEL1 GAL7PGM2 GAL2 GAL10

Gal4p-Gal80p inactive complex

GALACTOSE

GAL80

GAL4

GCY1

Structural Genes For Galactose Metabolism

YPL067C YPL066W

?

?

YMR318CGAL3

Gal1p

Page 37: From Genome Sequences to Regulatory Network Phenotypes

DNAGibbs and mating type

Motif Score %ORF Consensus Similaritymt-1 (A) 8.9 0.11 ttcctarttng P Boxmta-1 (B) 8.5 0.05 anwncwnkmaananantcwtbwtnw -mta-2 (C) 5.0 0.10 aaaycawmawnanwa -mta-3 (D) 28.1 0.31 grnawktacayg 2-bind, mt-mta-1mt-mta-1 (E) 20.7 0.34 crtgtanntwyc 2-bind mta-3mt-mta-2 (F) 5.3 0.13 kwtnywnnnknnntgtttsa PRE, mt-mta-2mt-mta-3 (G) 8.6 0.27 tgamaywwtnaama PRE, mt-mta-1mt-mta-4 (H) 5.3 0.31 rmtgmcngcma Q Box

Expect DNABP Consensus Ref: Herskowitz, et al.,P Box Mcm1p tttcctaattaggnan in Gene Expression, E. W. Jones, Q Box Mat1p tcaatgacag et al., Eds. (CSHL Press, NY, 1992) .2-bind Mat2p crtgtaawt vol. 2: pp. 583-656PRE Ste12p tgaaaca

Page 38: From Genome Sequences to Regulatory Network Phenotypes

0 1 2 3 4 5 6 7 8 9 10Z-score

rpoD15rpoD17rpoD16rpoD18

ompRhnslrp

rpoD19malTrpoS

crpdnaA

fisnarLfarR

glpRtrpRsoxS

ihfoxyRmetRtyrRargRcytR

furmetJphoBfruRcspAtorR

nagCfadRpurRarcA

pdhRlexAgcvA

fnrgalRntrCrhaSiclRfhlA

cynRada

deoRcarPlacI

marRrpoH14

ilvYrpoH13

araCtus

hipBflhCD

rpoEmelRcysBrpoN

rpoD15rpoD17rpoD16rpoD18

ompRhnslrp

rpoD19malTrpoS

crpdnaA

fisnarLfarR

glpRtrpRsoxS

ihfoxyRmetRtyrRargRcytR

furmetJphoBfruRcspAtorR

nagCfadRpurRarcA

pdhRlexAgcvA

fnrgalRntrCrhaSiclRfhlA

cynRada

deoRcarPlacI

marRrpoH14

ilvYrpoH13

araCtus

hipBflhCD

rpoEmelRcysBrpoN

Calibration of 60 E. coli binding site matrices

Page 39: From Genome Sequences to Regulatory Network Phenotypes

Interaction Quantitation Options

Over-expression:Yeast two-hybrid screens (in vivo complexity)

In vitro chip assaysMartha Bulyk, David Lockhart, Erik Gentalen

Natural levels, environmental regulation:Subcellular fractionation (unstable)In vivo footprinting (partners unknown)In vivo crosslinking

Page 40: From Genome Sequences to Regulatory Network Phenotypes

xmask 2

3'

A A o o o oxx x

h

Combinatorial ds-DNA Chips(chemical, photo & enzymatic synthesis)

SiO2

A A C C G G

3'

specific 16-mer

A C A C A C

A A C C G GA C A C A C

Polymerase

cg

GC

GC

cg

5'

3'5'

spacern-mer

primer

Page 41: From Genome Sequences to Regulatory Network Phenotypes

2nd strandsequenceat half-sites

GTAGTAAGTACGTAGGTATGTCGTCAGTCCGTCGGTCTGTGGTGAGTGCGTGGGTGT

length of spacer between half-sites

14 0 14 0 14 0 14 0

length of spacer between half-sites

BEFORE RsaI Digestion(zoomed in view)

AFTER RsaI Digestion(zoomed in view)

RsaI Digestion of a Fixed Density Double-Stranded DNA Chip with a Variable Spacer Length of 0 to 14 bp Between the Half-Sites

Conclusion: Loss of Signal Intensity Corresponds to Cleavage of dsDNA by RsaI

Significance:1) Double-Stranded DNA is Created by Primer Extension of ssDNA Chips

2) Double-Stranded DNA on the Surface of the Chip is Accessible for Interaction with a DNA-Binding Protein

5'

GTAC

GTAC

CA*TG

CA*TG

RsaI

Page 42: From Genome Sequences to Regulatory Network Phenotypes

Interaction Quantitation Options

Over-expression:Yeast two-hybrid screens (in vivo complexity)

In vitro chip assays

Natural levels, environmental regulation:Subcellular fractionation (unstable)In vivo footprinting (partners unknown)In vivo crosslinkingMartin Steffen, Andy Link

Page 43: From Genome Sequences to Regulatory Network Phenotypes

Isolate in vivo crosslinked complexes

by nucleic acid CsCl (or hybridization) by protein epitope tag

analyze protein by DNase 2D gel,trypsin-LC-ESI-MS/MS

analyze DNA/RNA by chip pH

kdal

Link et al. (1997) Electrophoresis 18:1259 & 1314

Page 44: From Genome Sequences to Regulatory Network Phenotypes

Rich media log-phase, in vivo crosslink, DNaseI digest

pH

kdal

4 5 6 7

10

20

30

40

50

100

lac I

fu r

grpE

dps

hns

efp

purEdps

sspA

ihfB

ssb

Page 45: From Genome Sequences to Regulatory Network Phenotypes

In vivo crosslinking & footprinting summary

11% of the E.coli genome is non-coding.About 340 / 4328 proteins are likely DNA-binding proteins (2 or the top 380 proteins).

24/25 footprinted GATC sites are non-coding. Odds = 10-27.

2/3 crosslinked DNA molecules are likely regulatory binding sites. Odds = 0.04

8/11 top DNA-crosslinked proteins are known DNA-binding proteins. Odds = 10-16.

Page 46: From Genome Sequences to Regulatory Network Phenotypes

Thoughts on chips for crosslinked epitope selections (& generally).

An easy 10-fold enrichment but with 40,000 fragments meansan expensive 1:4000 Signal:Noise,if sequencing (or SAGE) were used.

However, spread over a chip, 1:10.

Page 47: From Genome Sequences to Regulatory Network Phenotypes

E. coli oligonucleotide chip challenges:

#1) Closely spaced transcripts, e.g. carAB: (Intergenic 25-mers overlap, start 6 bp apart on average)

P1(pyrimidine) ... 48 bp ... P2(arginine)

gggtaagcaaatttgcattgcttcatactgactgaatgaattaatatgcaaataaagtg

#2) Repeats, e.g. tufA & tufB DNA. Mismatches: *.....*.........*..*.........................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................*.................................................................................................................................................................................................................................................................................................................*...........................................................................................................................................................*...................................*.................*..*........*.......................*.............................*.............

Page 48: From Genome Sequences to Regulatory Network Phenotypes

Expression: Cell-type & condition clustering plus DNAGibbs algorithm extracts intergenic binding motifs for yeast Gal-Glc, Mat-Mata, & 30oC-39oC comparisons.

Interaction: Strong enrichment for low abundance wild-type & mutant in vivo E.coli DNA-protein contactsestablishes mechanistically anchored intergenic elements.

Growth: Multiplex competitive growth of in-frame replacements for novel E.coli regulatory genes definescellular system integration & environments.

From Genome Sequences to Regulatory Network Phenotypes

Summary

Page 49: From Genome Sequences to Regulatory Network Phenotypes

Environments

Metabolites

Growth rate

RNADNA ProteinExpression

InteractionskD

kR kP

kI

kc

kD , kD , kD : Initiate, Elongate, Terminate, Fold, Modify, Localize, Degrade

Escherichia coli & Saccharomyces cerevisiaeRegulatory and Metabolic Networks

Population Selection, Flux Balance, & Gibbs

Page 50: From Genome Sequences to Regulatory Network Phenotypes

Growth, Expression, & InteractionHarvard Center for

Computational Genetics

John Aach

Tim Chen

George Church

Jason Hughes

Jason Johnson

Abby McGuire

Jong Park

Fritz RothAffymetrix

David Lockhart

Eric Gentalen

NCBI

Andrew Neuwald

DOE, DARPA, Lipper, NIST, HMR

HMS Genetics

Andy Link, Doug Selinger

Pete Estep, Michael Ching

Martha Bulyk, Sonali Bose

Martin Steffen

Saeed Tavazoie, Annie Chan

Dereth Phillips, Chris Harbison

UCSD

Bernhard Palsson