Life, diversity and the pursuit of haplotypes

2
1376 VOLUME 23 NUMBER 11 NOVEMBER 2005 NATURE BIOTECHNOLOGY tors) in response to the presence or absence of several input mRNAs 8 . The second system, developed in our labo- ratory 3,6 , uses circuits of nucleic acid–based logic gates like those described by Penchovsky and Breaker. In this approach, a series of (deoxy)riboswitches would recognize multiple disease markers and release oligonucleotides that would serve as inputs for downstream cir- cuits of logic gates, which would in turn trigger the release of drugs 5 (Fig. 1b). Again, a protocol for high-throughput, automated design of reli- able molecular switches brings this application one step closer to reality. Penchovsky and Breaker’s work will also make a significant contribution to the field of synthetic biology and engineered intracellular circuits. For example, the recent demonstra- tion in a mouse model of ribozymes as oligo- nucleotide-responsive gene expression control devices 9 foretells next-generation experiments in which ribozyme logic gates organized in molecular cascades can reprogram cellular behavior in response to combinatorial molecu- lar inputs. Finally, this work supports the contention that it might be easier to design complex intra- cellular circuits made of RNAs than of proteins; currently, computational predictions based on extensive search of potential sequences is pos- sible only for nucleic acid–based systems, for which the principles of Watson-Crick base pair- ing are well understood, the thermodynamic parameters are well studied and catalytic activ- ity can be reliably predicted through knowledge of the secondary structure. However, cells are adept at assembling protein-based molecular circuits from individual protein gates. Recent progress in understanding the underlying principles of cellular signaling network orga- nization, in particular the ability to construct new protein gates using principles of modular allostery and to rewire networks using engi- neered scaffold proteins 10 , argues that proteins may be equally suited to the construction of artificial circuits. Thus, although nucleic acids seem to have the initial advantage in this field, the eventual outcome of this rerun of the old evolutionary battle between two biopolymers is not easy to predict. 1. De Silva, A.P. & McClenaghan, N.D. Chemistry 10, 574-586 (2004). 2. Penchovsky, R. & Breaker, R.R. Nat. Biotechnol. 23, 1424–1433, (2005) 3. Stojanovic, M.N. et al. J. Am. Chem. Soc. 124, 3555–3561 (2002). 4 Stojanovic, M.N. et al. J. Am. Chem. Soc. 127, 6914–6915 (2005). 5. Kolpashchikov, D.K. & Stojanovic, M.N. J. Am. Chem. Soc. 127, 11348–11351 (2005). 6. Stojanovic, M.N. & Stefanovic, D. Nat. Biotechnol. 21, 1069–1074 (2003). 7. Seeman, N.C. Trends Biochem. Sci. 30, 119–235 (2005). 8. Benenson, Y. et al. Nature 429, 423–429 (2004). 9. Yen, L. et al. Nature 431, 471–476 (2004). 10. Dueber, J.E. et al. Curr. Opin. Struct. Biol. 14, 690– 699 (2004). Life, diversity and the pursuit of haplotypes Jeanette McCarthy The complete haplotype map of the human genome provides an unprecedented view of human genetic variation. The completion of the Human Genome Project revealed, among other things, that sequence variation in the human genome is abundant 1,2 —so abundant, in fact, that identifying disease-causing variation from among the vast amount of inconsequen- tial variation poses a formidable challenge. Understanding how genetic variation is organized in the genome would greatly improve the efficiency with which we can identify alleles underlying common diseases. In a recent paper in Nature, the International HapMap Consortium reports on its genome- wide survey of common genetic variation in the human genome and reveals the under- lying architecture of genetic variation in human populations 3 . It is estimated that there are 10 million polymorphic sites in the human genome, the result of mutations in the DNA of our ancestors that have been transmitted through the human population over time. It has also long been recognized that genetic variation at Jeanette McCarthy is at the Graduate School of Public Health, San Diego State University, 5500 Campanile Drive, San Diego, California 92182-4162, USA. e-mail: [email protected] different nearby sites is correlated, or in link- age disequilibrium, but the degree of correla- tion varies between any two sites. The patterns of linkage disequilibrium in the genome reflect, among other things, the process of chromosomal recombination, whereby seg- ments of DNA are swapped between mater- nally and paternally derived chromosomes. The segments of chromosomes that remain intact, without disruption from recombina- tion, are inherited in blocks (Fig. 1). The delineation of these blocks, and the genetic diversity represented within them, is revealed in the human genome HapMap. The International HapMap Consortium identified a set of over one million common (>5% frequency of less common allele) single nucleotide polymorphisms (SNPs) evenly spaced throughout the genome at an average distance of one per 5 kb, which they genotyped in 269 individuals from four diverse populations. Their analysis of haplotype structure and diversity confirmed that recombination rates vary extensively throughout the genome, with some regions (centromeres for example) exhibiting little or no recombination and others, hotspots, demonstrating recombination rates dra- matically higher than background rates. For example, within one 5-Mb region, 80% of all recombinations occurred in 15% of the sequence. The result is that most of the genome omprises long blocks of DNA that are dis- rupted by recombination sites. These blocks harbor many sequence variants, but because these variants are in strong linkage disequi- librium, the diversity of the block is actu- ally quite limited. On average, for a given block, there are only four to six haplotypes or unique combinations of alleles at adjacent sites. Furthermore, the investigators report that common and rare haplotypes are often shared across ethnic populations, although the frequency of a particular haplotype may vary between populations. As it turns out, the average common SNP is in strong linkage disequilibrium with three to ten other SNPs, indicating that the redundant SNPs add no new information. Therefore, a set of less than 500,000 SNPs is sufficient to capture information on all common variation in the human genome. How will this information be used? One of the goals of the HapMap project was to improve the ability to carry out genome- wide association studies for the purpose of identifying gene variants underlying quanti- tative traits, common diseases and response to therapeutics (pharmacogenomics). By surveying the entire genome, the genome- NEWS AND VIEWS © 2005 Nature Publishing Group http://www.nature.com/naturebiotechnology

Transcript of Life, diversity and the pursuit of haplotypes

1376 VOLUME 23 NUMBER 11 NOVEMBER 2005 NATURE BIOTECHNOLOGY

tors) in response to the presence or absence of several input mRNAs8.

The second system, developed in our labo-ratory3,6, uses circuits of nucleic acid–based logic gates like those described by Penchovsky and Breaker. In this approach, a series of (deoxy)riboswitches would recognize multiple disease markers and release oligonucleotides that would serve as inputs for downstream cir-cuits of logic gates, which would in turn trigger the release of drugs5 (Fig. 1b). Again, a protocol for high-throughput, automated design of reli-able molecular switches brings this application one step closer to reality.

Penchovsky and Breaker’s work will also make a significant contribution to the field of synthetic biology and engineered intracellular circuits. For example, the recent demonstra-tion in a mouse model of ribozymes as oligo-nucleotide-responsive gene expression control devices9 foretells next-generation experiments in which ribozyme logic gates organized in molecular cascades can reprogram cellular behavior in response to combinatorial molecu-lar inputs.

Finally, this work supports the contention that it might be easier to design complex intra-cellular circuits made of RNAs than of proteins; currently, computational predictions based on extensive search of potential sequences is pos-sible only for nucleic acid–based systems, for which the principles of Watson-Crick base pair-

ing are well understood, the thermodynamic parameters are well studied and catalytic activ-ity can be reliably predicted through knowledge of the secondary structure. However, cells are adept at assembling protein-based molecular circuits from individual protein gates. Recent progress in understanding the underlying principles of cellular signaling network orga-nization, in particular the ability to construct new protein gates using principles of modular allostery and to rewire networks using engi-neered scaffold proteins10, argues that proteins may be equally suited to the construction of artificial circuits. Thus, although nucleic acids seem to have the initial advantage in this field, the eventual outcome of this rerun of the old evolutionary battle between two biopolymers is not easy to predict.

1. De Silva, A.P. & McClenaghan, N.D. Chemistry 10, 574-586 (2004).

2. Penchovsky, R. & Breaker, R.R. Nat. Biotechnol. 23, 1424–1433, (2005)

3. Stojanovic, M.N. et al. J. Am. Chem. Soc. 124, 3555–3561 (2002).

4 Stojanovic, M.N. et al. J. Am. Chem. Soc. 127, 6914–6915 (2005).

5. Kolpashchikov, D.K. & Stojanovic, M.N. J. Am. Chem. Soc. 127, 11348–11351 (2005).

6. Stojanovic, M.N. & Stefanovic, D. Nat. Biotechnol. 21, 1069–1074 (2003).

7. Seeman, N.C. Trends Biochem. Sci. 30, 119–235 (2005).

8. Benenson, Y. et al. Nature 429, 423–429 (2004).9. Yen, L. et al. Nature 431, 471–476 (2004).10. Dueber, J.E. et al. Curr. Opin. Struct. Biol. 14, 690–

699 (2004).

Life, diversity and the pursuit of haplotypesJeanette McCarthy

The complete haplotype map of the human genome provides an unprecedented view of human genetic variation.

The completion of the Human Genome Project revealed, among other things, that sequence variation in the human genome is abundant1,2—so abundant, in fact, that identifying disease-causing variation from among the vast amount of inconsequen-tial variation poses a formidable challenge. Understanding how genetic variation is

organized in the genome would greatly improve the efficiency with which we can identify alleles underlying common diseases. In a recent paper in Nature, the International HapMap Consortium reports on its genome-wide survey of common genetic variation in the human genome and reveals the under-lying architecture of genetic variation in human populations3.

It is estimated that there are 10 million polymorphic sites in the human genome, the result of mutations in the DNA of our ancestors that have been transmitted through the human population over time. It has also long been recognized that genetic variation at

Jeanette McCarthy is at the Graduate School of Public Health, San Diego State University, 5500 Campanile Drive, San Diego, California 92182-4162, USA. e-mail: [email protected]

different nearby sites is correlated, or in link-age disequilibrium, but the degree of correla-tion varies between any two sites. The patterns of linkage disequilibrium in the genome reflect, among other things, the process of chromosomal recombination, whereby seg-ments of DNA are swapped between mater-nally and paternally derived chromosomes. The segments of chromosomes that remain intact, without disruption from recombina-tion, are inherited in blocks (Fig. 1). The delineation of these blocks, and the genetic diversity represented within them, is revealed in the human genome HapMap.

The International HapMap Consortium identified a set of over one million common (>5% frequency of less common allele) single nucleotide polymorphisms (SNPs) evenly spaced throughout the genome at an average distance of one per 5 kb, which they genotyped in 269 individuals from four diverse populations. Their analysis of haplotype structure and diversity confirmed that recombination rates vary extensively throughout the genome, with some regions (centromeres for example) exhibiting little or no recombination and others, hotspots, demonstrating recombination rates dra-matically higher than background rates. For example, within one 5-Mb region, 80% of all recombinations occurred in 15% of the sequence.

The result is that most of the genome omprises long blocks of DNA that are dis-rupted by recombination sites. These blocks harbor many sequence variants, but because these variants are in strong linkage disequi-librium, the diversity of the block is actu-ally quite limited. On average, for a given block, there are only four to six haplotypes or unique combinations of alleles at adjacent sites. Furthermore, the investigators report that common and rare haplotypes are often shared across ethnic populations, although the frequency of a particular haplotype may vary between populations. As it turns out, the average common SNP is in strong linkage disequilibrium with three to ten other SNPs, indicating that the redundant SNPs add no new information. Therefore, a set of less than 500,000 SNPs is sufficient to capture information on all common variation in the human genome.

How will this information be used? One of the goals of the HapMap project was to improve the ability to carry out genome-wide association studies for the purpose of identifying gene variants underlying quanti-tative traits, common diseases and response to therapeutics (pharmacogenomics). By surveying the entire genome, the genome-

NEWS AND V IEWS©

2005

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy

NATURE BIOTECHNOLOGY VOLUME 23 NUMBER 11 NOVEMBER 2005 1377

wide approach makes no a priori assump-tions about which genes may be important, in contrast to a candidate-gene approach in which specific genes are targeted for analy-sis. The approach has many strengths, but has met with skepticism for a number of reasons, including a lack of information on the extent of linkage disequilibrium through the genome4.

With the knowledge gained from the HapMap project, we are one step closer to making genome-wide association studies feasible. By selecting a set of nonredundant tagged SNPs, a tenfold reduction in SNP genotyping can be realized, a vast improve-ment in efficiency for genome-wide associa-tion studies. These genotyped SNPs will be used in population studies to examine asso-ciation with disease. A positive finding may indicate that the specific block, represented by the haplotype-tagged SNP may harbor disease-causing variants. As reported by the investigators, this approach will help to iden-tify not only SNPs but structural variants (deletions, duplications, rearrangements) that may underlie disease.

Was it worth the investment? The HapMap project has validated and estimated fre-quencies of over a million common SNPs throughout the genome in the public SNP database, dbSNP, with 11,500 of them resid-ing within genes. Coupled with information on linkage disequilibrium, this provides a rich resource for all consumers of human genetic information, not just those under-taking genome-wide studies. In addition, the HapMap project provides empirical data to validate previous hypotheses concerning natural selection. Furthermore, the data have allowed us to resolve uncertainty about the utility of nongenic sequences that are con-served between species. Theoretical predic-tions that these regions lacked diversity have been refuted by HapMap data, which sug-gests instead that variation in these regions is skewed toward rare alleles. These novel find-ings highlight the importance of continued research into the function of these variants and their possible role in disease.

The question remains whether this extraordinary resource will greatly improve our ability to uncover genetic associations with disease. As a first step in cataloging human genome variation, the HapMap is an impressive achievement. In the absence of a robust strategy for identifying candi-date genes a priori, whole-genome haplo-type mapping offers an enticing approach for discovery of disease-causing alleles. However, much of this is predicated on the theory that common variants cause common

diseases. Several examples can be found in support of this theory, but few investigators have explored alternative theories, such as one that ascribes a role to rare variants in common diseases. One exception is a recent study by Cohen et al.5, which found that the sum of rare variants in three candidate genes contributed significantly to low HDL-cholesterol levels in the general population, supporting a hypothesis that common dis-eases can be the result of the aggregate effects of rare variants.

In the human genome, rare variants (those having minor allele frequencies <5%) are plen-tiful: 45% of SNPs in specific regions evaluated by HapMap investigators were rare. Although haplotypes capture common variation, their utility as markers of rare variants is unknown. Regardless of how well haplotypes perform in this situation, if rare variants are important in

common diseases, they will need to be stud-ied directly. Since PCR-based sequencing of diploid samples (used to populate dbSNP) may be biased against very rare variants, con-certed efforts to find these SNPs will need to be undertaken. To this end, technologies that make it possible to carry out deep resequenc-ing of gene regions at a substantially reduced cost and time savings, such as that highlighted in a recent issue of Nature6, will become increasingly important as a complement to genome-wide HapMap approaches.

1. Lander, E.S. et al. Nature 409, 860–921 (2001).2. Venter, J.C. et al. Science 291, 1304–1351

(2001).3. The International HapMap Consortium Nature 437,

1299–1320 (2005).4. McCarthy, J.J. & Hilfiker, R. Nat. Biotechnol. 18,

505–508 (2000).5. Cohen, J.C. et al. Science 305, 869–872 (2004).6. Margulies, M. et al. Nature 437, 376–380 (2005).

Figure 1 Emergence of haplotypes. Crossing over, or exchange of genetic information between homologous chromosomes during meiosis, results in recombination of parental chromosomes. This process generates an additional level of genetic diversity in the population. After many generations, the chromosomes are quite different, with some regions of the ancestral chromosomes no longer inherited together. Some blocks continue to be inherited together and as such, genetic variation in those blocks is inherited as a unit, called a haplotype. Haplotype blocks may harbor many genetic polymorphisms, including disease-causing alleles, which are highly correlated with each other, or redundant, even becoming ‘perfect proxies’ for each other. Because of this correlation, or linkage disequilibrium, surveying only a subset of polymorphisms per block can yield useful information. The HapMap project has produced a minimum set of polymorphisms, called haplotype-tagged SNPs that captures the identity and diversity of haplotype blocks in the genome. These haplotype-tagged SNPs can be used to identify the location of genetic variants underlying diseases.

. . t a a c t a a t t t c a t c c g g a a g t c c

Redundant SNPs

* All SNPs in each block

Many generations

Ancestral chromosomes

X

* ht-SNPs uniquely identifying blocks

. . t a g c t a a t a t c a t t c g g c a g t c c

. . t a g c t a a t t t c a t c c g g a a g t gc

. . t a a c t a a t t t c a t c c g g a a g t gc

. . t a a c t a a t a t c a t t c g g c a g t c c

. . a g t c a t g g a a c t t g a g c t g g t . .

. . a g t c a t g g a a t t t g a c c t g g t . .

. . a c t c a t g g a a t t t g a c c t g g t . .

. . a g t c a t t g a a c t tg a g c t g a t . .

. . a g t c a t t g a a t t t g a c c t g a t . .

*** * ** * *

* * * * * * * *

Bob

Crim

i

NEWS AND V IEWS©

2005

Nat

ure

Pub

lishi

ng G

roup

ht

tp://

ww

w.n

atur

e.co

m/n

atur

ebio

tech

nolo

gy