Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q...

24
www.sciencemag.org/cgi/content/full/science.1225057/DC1 Supplementary Materials for Evidence of Abundant Purifying Selection in Humans for Recently Acquired Regulatory Functions Lucas D. Ward and Manolis Kellis* *To whom correspondence should be addressed. E-mail: [email protected] Published 5 September 2012 on Science Express DOI: 10.1126/science.1225057 This PDF file includes: Materials and Methods Figs. S1 to S10 Tables S1 to S6 References

Transcript of Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q...

Page 1: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

www.sciencemag.org/cgi/content/full/science.1225057/DC1

Supplementary Materials for

Evidence of Abundant Purifying Selection in Humans for

Recently Acquired Regulatory Functions

Lucas D. Ward and Manolis Kellis*

*To whom correspondence should be addressed. E-mail: [email protected]

Published 5 September 2012 on Science Express

DOI: 10.1126/science.1225057

This PDF file includes:

Materials and Methods

Figs. S1 to S10

Tables S1 to S6

References

Page 2: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

2

Materials and Methods

Software

For all set operations on genomic intervals, the BEDTools package was used (31).

Gene annotations

The GENCODE v7 annotation (32) was obtained and exons were defined as

follows: features annotated as “CDS” were selected as protein-coding and all protein-

coding gene features annotated as “UTR” were selected as UTRs. Protein-coding and

non-coding genes were selected as genes, and the set difference between bases annotated

as being in a gene and bases annotated as being exonic were selected as intronic. Both

CAGE and non-CAGE TSS clusters from GENCODE were used, and regions within 2kb

were defined as TSS-proximal in order to be excluded from the analysis (324 Mb). The

genome was masked as follows: Autosomes from the hg19 version of the human genome

were selected, and the following regions were excluded: SimpleRepeat regions (124 Mb),

from the UCSC table browser (33); two ENCODE blacklist regions at which signal

artifacts are predicted, the DAC Blacklisted Regions and the Duke Excluded Regions

(together, 14 Mb) (5); regions not included in the EPO (Enredo, Pecan, Ortheus) multi-

species alignment from ENSEMBL (277 Mb) (34); regions to which 36-bp sequences

would not be mappable allowing at most one mismatch, using the CRG Alignability track

from the UCSC table browser (815 Mb); all CpG islands from the UCSC Table Browser

and any dinucleotide that is “CG” in either the reference genome or when mutated to a

1000 Genomes SNP observed in the YRI population (285 Mb); and any regions not

falling within a 1000 Genomes callable region (591 Mb).

Transcription factor binding annotations

ChIP-seq peaks were defined using the SPP method (5) and peaks were chosen with

an irreproducible discovery rate (IDR) of less than 1%. DNAse regions from both the

Duke and University of Washington groups were included and peaks were called using a

uniform pipeline (5).

Chromatin state annotations

The ChromHMM segmentation of the ENCODE data (5) was used to define four

broad sets of functional elements across six cell lines: promoters (states 1-4), enhancers

(states 5-11), insulators (states 12 and 13), and transcribed regions (states 14-19). Each

200-bp window comprising the segmentation was annotated as being in one of these four

sets in each cell type, and the union was taken for each annotation across cell types.

New transcript annotations

RNA contigs from the ENCODE data were thresholded with an IDR of less than

10% (5) and were split as follows: novel intronic RNA contigs were selected from each

experiment by selecting contigs that entirely overlap an intron and have no overlap with

an exon or any base within 2kb of a TSS, and novel intergenic RNA contigs were

selected from each experiment by selecting contigs that are entirely annotated as

intergenic and have no overlap with any base within 2kb of a TSS. Novel intergenic

RNAs were then split based on whether they were polyadenylated.

Mammalian-conserved regions

Mammalian-conserved regions were defined within EPO blocks based on Siphy-ω,

which identifies regions of rejected substitutions at 12-mer resolution and an FDR of

Page 3: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

3

10% (2, 35). Regions called as non-conserved were nevertheless required to be in the

EPO alignment.The human-macaque genomic alignment (36) was obtained from the

UCSC Genome Browser. Within regions, divergence was counted as the ratio of

mismatches to the total number of alignable bases; positions with a „N‟ in either genome

were not counted.

Human diversity estimates

Pilot data from the 60 Yoruba individuals from Ibadan, Nigeria (YRI population)

from the 1000 Genomes Project pilot phase population was obtained in Variant Call

Format (VCF). Analysis was restricted to this population because it provides the highest

diversity and lowest LD (37), thus increasing power while minimizing the influence of

population bottlenecks, admixture, or population substructure. We then used three

metrics of human variation associated with purifying selection (38): SNP density,

heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies

estimated from the 120 chromosomes observed in the sample), and derived allele

frequency (the ancestral allele was chosen as defined by the authors; in cases where an

ancestral allele is not called, the SNP was included in our heterozygosity analysis but not

in our DAF analysis.)

Bootstrapping procedure

Because of the LD structure of the genome, adjacent DAF are not independent,

which will tend to exaggerate the statistical significance of the difference in DAF

between SNPs in a segment of the genome (such as those defined by ENCODE) and the

rest of the genome. Additionally, variation in local mutation rate causes

interdependencies in SNP density and heterozygosity. Therefore, a bootstrapping

procedure was developed to determine the distribution of SNP density, heterozygosity,

and DAF considered over a set of intervals expected if the entire ensemble of intervals

was rotated relative to their actual genomic positions. For each feature-background

comparison, a single background space was constructed by masking the genome

appropriately and concatenating all of the resulting segments across all autosomes.

Features and SNPs were then mapped to this space. For each iteration of bootstrapping

(10,000 for all tests except for GO/KEGG/Reactome categories, for which an initial

round of 1,000 was used for FDR thresholding and a second round of 10,000 was used to

calculate final p-values), the feature intervals were all shifted by the same random

number of bases within the background space. This null distribution was observed to be

normally-distributed for all three metrics, allowing a Z-score and p-value to then be

assigned to the initial difference in means observed.

Regulatory motif annotations

A joint annotation of position weight matrices (PWMs) and assayed transcription

factors (TFs) into families was used for the TF binding site analysis (5). A representative

PWM for each TF family was selected based on which showed the most significant

enrichment for any of the ChIP-seq experiments performed within that family (possibly

across a variety of assayed paralogous proteins, cell types, and experimental conditions).

Genomewide matches to the PWM were then defined as described previously (2) and

filtered each set of matches based on whether it was ever bound in one of the TF family‟s

experiments, or never bound.

Gene ontology annotations.

Page 4: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

4

Gene Ontology (GO) annotations for all human genes were obtained as a table using

the ENSEMBL BioMart tool (39). KEGG(40) and Reactome(41) annotations were

obtained from MSigDB (42). Categories containing fewer than five ENSEMBL genes

were discarded. Enhancers from the ChromHMM segmentation were then aggregated

over the six Tier 1 and Tier 2 cell lines and classified according to the GO category (or

categories) of their nearest ENSEMBL gene. SNP density, heterozygosity, and DAF were

calculated at non-conserved enhancers grouped by the GO categories of their target

genes, and bootstrap p-values were calculated for all three metrics. Categories were

retained as significant if their heterozygosity and DAF bootstrap p-values both passed a

false discovery rate threshold of q=0.05.

Background selection

Background selection leads to locally reduced diversity near strongly selected

elements, and since ENCODE active elements are biased to be close to such elements, it

is a potential confounder. For the background selection analysis, the HapMap genetic

map calculated on the hg19 genome was used to assign a genetic coordinate along each

chromosome (in cM) for each feature boundary and each SNP. SNPs were then binned

according to their genomic distance to exons, using ten bins equally spaced beween 0 and

0.1 cM. SNPs were also binned using expected background selection values, B, from

(23). Heterozygosity calculations were then performed as described previously for

ENCODE feature and background sets, restricted to each bin. Genomic regions in the bin

corresponding to values of B between 0.12 and 0.14 showed abnormally high

heterozygosity. The result was found to be driven by a single outlier region (11q11.2)

which contains four genes, all encoding olfactory receptors. After excluding this 5.3 Mb

region, heterozygosity in this bin is consistent with the genomewide relationship between

B and diversity (Figure 2B). We confirmed that heterozygosity is consistently depleted at

ENCODE elements within these bins.

Biased gene conversion

Biased gene conversion favors the fixation of strong (CG) over weak (AT) alleles in

strong/weak polymorphic sites, and can mimic selection (43). To isolate sites immune to

biased gene conversion SNPs for the DAF analysis were discarded unless they were

weak-weak (A-T) or strong-strong (C-G), retaining 833,979 sites (18%), and the

bootstrapping analysis was then performed as described previously.

Non-reference allele mapping

A third potential confounder is a decrease in measured biochemical activity for non-

reference alleles, due to a bias towards mapping to the reference allele. For the read

mapping bias analysis, only ENCODE features and chromatin segmentations derived

from experiments on the GM12878 lymphoblastoid cell line were used. Then, the

NA12878 variant calls from the 1000 Genomes trio pilot project were used to select only

3.5M (77%) SNPs for DAF analysis which are homozygous for the reference allele in the

GM12878 cell line. Thus, only SNPs were considered which were not actually present in

the cell being assayed.

Proportion under constraint

To estimate the proportion of the human genome under constraint (PUC), every

nucleotide was assigned to one of ten bins of expected background constraint B as

described above, and within each bin the SNP density, heterozygosity, and DAF of the

feature being tested were scaled between a value of zero and unity, with zero (the most

Page 5: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

5

constrained) defined as the constraint value of non-degenerate coding conserved bases,

and unity (the least constrained) defined as non-ENCODE non-conserved regions. This

scaled constraint value within each bin was then multiplied by the number of nucleotides

covered by the feature in each bin, and this product was summed across bins, providing a

total number of bases under constraint. This was then compared to the total overall

coverage of the feature to arrive at an overall PUC for each class of elements. A

confidence interval was calculated for each PUC by generating three sets of 1000 random

normally-distributed constraint values within each bin (one for the test set and one each

for the two references), with a standard deviation equal to the standard error of the mean

constraint value. The PUC calculation was then performed using each of these 1000

trials, and quantiles in the resulting PUC distribution were used to report a 95%

confidence interval.

Page 6: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

6

Fig. S1.

Comparison of (A) SNP density and (B) DAF for ENCODE-annotated elements within

and outside mammalian-conserved regions, as in Fig. 1B.

Page 7: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

7

Fig. S2.

Distribution of mammalian conservation values by SiPhy in the genomic subsets

shown in Figure 1.

Page 8: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

8

Fig. S3

SNP density in the unconserved genome at variously annotated features. As in Fig. 2, the

histograms represent the distribution of background values from the bootstrap procedure,

and Z-scores report the difference between the tested feature (shown as a vertical dash)

and the background region in units of the standard deviation of the values obtained from

the bootstrap procedure.

Page 9: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

9

Fig. S4

Heterozygosity in the unconserved genome at variously annotated features.As in Fig. 2,

the histograms represent the distribution of background values from the bootstrap

procedure, and Z-scores report the difference between the tested feature (shown as a

vertical dash) and the background region in units of the standard deviation of the values

obtained from the bootstrap procedure.

Page 10: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

10

Fig. S5

Derived allele frequency in the unconserved genome at ENCODE feature classes,

conditioning on whether regions are annotated as intronic or intergenic. As in Fig. 2, the

histograms represent the distribution of background values from the bootstrap procedure,

and Z-scores report the difference between the tested feature (shown as a vertical dash)

and the background region in units of the standard deviation of the values obtained from

the bootstrap procedure.

Page 11: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

11

Fig. S6

Mean heterozygosity and human-macaque divergence in the genomic subsets shown in

Figure 1.

Page 12: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

12

Fig. S7

Pathway analysis of unconserved enhancers near genes involved with nerve growth factor

signaling (by three pathway annotations.) Only genes with at least 30 kb of neighboring

enhancer sequence are included. Genes for which enhancers have a heterozygosity below

the unconserved unannotated genome background of 6.22 × 10-4

are listed with blue

labels, and those with higher heterozygosity are listed with red labels. Bootstrap results

for all significant categories are shown in Table S5.

Page 13: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

13

Fig. S8

Empirical cumulative distribution function of heterozygosity at individual elements and

1,000 samples each of 10-100,000 elements, in multiples of 10, sampled at a time from

two populations: (red) DNAse hypersensitive sites, located in TSS-distal intergenic

regions and background selection values B < 0.1, and (black) matched control non-

ENCODE regions.

Page 14: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

14

Fig. S9

The effect of background selection on SNP density, heterozygosity, DAF, and divergence

at reference minimum and maximum constraint regions used for the PUC estimates

(conserved non-degenerate coding, and unconserved non-ENCODE).

Page 15: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

15

Fig. S10

Partitioning of human-constrained bases (nucleotides under constraint, NUC) between the

conserved and unconserved genome at annotated exons and DNase I hypersensitive sites.

Page 16: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

16

Table S1.

Human selection on SNPs not prone to biased gene conversion.

Feature Weak-weak or strong-strong

SNPs DAF p (DAF)

Genome (TSS-distal, nonexonic) 833979 0.211

ENCODE-annotated (TSS-distal, nonexonic) 543065 0.207 5.10E-43

Non-ENCODE 290914 0.218

Active chromatin (TSS-distal, nonexonic) 270959 0.203 2.20E-43

Inactive chromatin 563020 0.215

Page 17: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

17

Table S2.

Human selection on features observed solely in GM12878, on derived alleles not present

in that individual.

Feature

non-NA12878

SNPs DAF p (DAF)

Genome (TSS-distal, nonexonic) 3509971 0.17

ENCODE-annotated (TSS-distal, nonexonic) 1417654 0.164 5.10E-49

Non-ENCODE 2092317 0.174

Active chromatin (TSS-distal, nonexonic) 454740 0.158 1.20E-42

Inactive chromatin 3055231 0.171

Page 18: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

18

Table S3

Human selection in the unconserved genome.

Feature bp SNPs

dens (kb-1)

p (density) het (π×10-4) p (het) DAF p (DAF)

Mappable genome 1692105614 5059631 2.99 6.13 0.208

Genome (TSS-distal, nonexonic) 1500452776 4539733 3.03 2.20E-58 6.22 1.90E-59 0.209 3.70E-22

Intergenic (TSS-distal) 775609576 2454978 3.17 6.10E-63 6.58 1.80E-69 0.212 1.80E-37

Intronic (TSS-distal) 724843200 2084755 2.88 7.70E-35 5.82 5.60E-40 0.204 9.70E-28

Non-degenerate coding 3282088 5959 1.82 2.20E-130 3.47 8.80E-109 0.187 8.10E-15

Annotated CDS 5076572 10082 1.99 4.40E-113 3.88 1.30E-93 0.195 6.60E-09

Annotated UTR 20114137 50365 2.5 6.30E-77 4.94 9.00E-72 0.196 1.40E-22

TSS 172497879 471430 2.73 4.00E-44 5.5 1.20E-45 0.202 1.80E-21

Intron 724843200 2084755 2.88 7.70E-35 5.82 5.60E-40 0.204 9.70E-28

Four-fold degenerate coding 875586 1758 2.01 2.80E-45 4.07 1.60E-31 0.206 0.36

Repeats 679717150 2093669 3.08 5.30E-179 6.34 1.30E-157 0.209 2.70E-14

Intergenic 775609576 2454978 3.17 6.10E-63 6.58 1.80E-69 0.212 1.80E-37

ENCODE-annotated (TSS-distal, nonexonic)

1034532977 3024992 2.92 3.00E-64 5.93 7.20E-85 0.205 1.70E-79

Bound motifs 2676169 6976 2.61 6.80E-31 5.04 8.20E-37 0.193 5.10E-09

DNAse I hypersensitive 310016583 884421 2.85 4.20E-38 5.71 8.30E-58 0.202 9.30E-63

FAIRE 238662992 675064 2.83 3.40E-22 5.67 1.30E-30 0.203 2.70E-27

Long RNA 881679464 2552187 2.89 2.20E-65 5.86 1.70E-78 0.204 2.00E-67

Protein bound by ChIP 112679654 323954 2.87 5.50E-33 5.77 7.60E-48 0.203 2.60E-29

Short RNA 1892306 5033 2.66 2.40E-15 5.2 2.30E-17 0.193 2.70E-06

Novel long intergenic PolyA RNA 191506259 580177 3.03 4.80E-21 6.22 1.20E-22 0.208 3.00E-15

Novel long intergenic non-PolyA RNA

100922369 316303 3.13 0.014 6.48 0.0034 0.21 0.00034

Novel long intronic RNA 555945590 1562135 2.81 8.00E-34 5.63 4.70E-46 0.202 1.10E-40

Non-ENCODE 465919799 1514741 3.25 4.20E-65 6.84 1.00E-86 0.215 1.10E-65

Active chromatin (TSS-distal, nonexonic)

546833480 1538999 2.81 4.60E-57 5.64 3.00E-73 0.202 1.80E-81

Enhancer chromatin 285221744 818955 2.87 8.50E-25 5.79 5.50E-32 0.204 3.90E-30

Insulator chromatin 34518467 101998 2.95 2.50E-06 6.02 1.60E-07 0.207 0.019

Promoter chromatin 47213250 130661 2.77 1.20E-42 5.53 3.40E-47 0.202 3.90E-16

Transcribed chromatin 301694820 812255 2.69 3.80E-85 5.31 1.00E-103 0.197 3.00E-102

Inactive chromatin 953619296 3000734 3.15 1.80E-57 6.55 4.40E-74 0.212 9.60E-67

Page 19: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

19

Table S4

Human selection in the conserved genome. Feature bp SNPs dens (kb-1) p (density) het (π×10-4) p (het) DAF p (DAF)

Mappable genome 115312286 206185 1.79 3.35 0.179

Genome (TSS-distal, nonexonic) 67141385 147277 2.19 0 4.19 9.30E-298 0.184 2.54E-15

Intergenic (TSS-distal) 32704673 75630 2.31 3.00E-137 4.47 2.80E-128 0.187 1.05E-12

Intronic (TSS-distal) 34436712 71647 2.08 3.10E-44 3.92 3.80E-31 0.181 0.441

Non-degenerate coding 10885201 6992 0.642 0 1.03 0 0.147 5.41E-98

Annotated CDS 16601142 16879 1.02 0 1.79 0 0.167 3.10E-25

Annotated UTR 9078363 11872 1.31 5.90E-190 2.27 6.10E-174 0.163 7.51E-23

TSS 17757580 25938 1.46 1.20E-111 2.63 1.20E-105 0.171 6.82E-16

Intron 34436712 71647 2.08 3.10E-44 3.92 3.80E-31 0.181 0.441

Four-fold degenerate coding 2551676 4404 1.73 2.50E-09 3.25 1.40E-06 0.186 0.0774

Repeats 5948434 12880 2.17 7.00E-39 4.08 3.40E-25 0.178 0.0983

Intergenic 32704673 75630 2.31 3.00E-137 4.47 2.80E-128 0.187 1.05E-12

ENCODE-annotated (TSS-distal, nonexonic) 51655262 109738 2.12 2.10E-41 4 3.50E-52 0.181 5.04E-18

Bound motifs 556844 963 1.73 1.60E-13 2.92 2.70E-16 0.166 0.00321

DNAse I hypersensitive 24658238 50270 2.04 5.00E-50 3.76 2.00E-67 0.176 1.11E-19

FAIRE 15761407 31746 2.01 3.80E-31 3.7 8.30E-41 0.177 7.36E-10

Long RNA 41538141 87166 2.1 6.60E-34 3.95 3.50E-37 0.18 2.47E-13

Protein bound by ChIP 10776655 21797 2.02 2.10E-27 3.72 1.50E-33 0.177 7.89E-08

Short RNA 158159 278 1.76 0.00015 3.17 0.00031 0.162 0.0425

Novel long intergenic PolyA RNA 7976838 17274 2.17 4.80E-13 4.12 2.40E-12 0.183 0.00562

Novel long intergenic non-PolyA RNA 4806880 11288 2.35 0.068 4.5 0.27 0.182 0.0117

Novel long intronic RNA 25745396 53444 2.08 0.24 3.89 0.015 0.179 0.00336

Non-ENCODE 15486123 37539 2.42 9.00E-43 4.79 6.50E-53 0.193 3.09E-14

Active chromatin (TSS-distal, nonexonic) 28105340 55850 1.99 4.50E-59 3.67 1.80E-67 0.176 2.81E-24

Enhancer chromatin 17751923 35982 2.03 2.80E-31 3.73 9.60E-41 0.176 5.94E-15

Insulator chromatin 1836220 3800 2.07 0.00033 3.86 0.00015 0.183 0.43

Promoter chromatin 3703789 6962 1.88 3.10E-27 3.43 8.80E-27 0.172 1.34E-06

Transcribed chromatin 11870117 22003 1.85 1.10E-55 3.37 9.50E-58 0.172 2.33E-16

Inactive chromatin 39036045 91427 2.34 9.70E-62 4.56 7.60E-70 0.189 2.58E-18

Page 20: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

20

Table S5

Human selection in unconserved enhancers associated with gene sets. GO category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)

retinal cone cell development 73555 136 1.85 0.0051 2.46 0.00069 0.124 7.70E-05

transcription initiation from RNA polymerase II promoter 717810 1661 2.31 9.40E-05 4.19 2.40E-05 0.179 0.00052

fibroblast growth factor receptor signaling pathway 1847741 4720 2.55 2.00E-04 4.89 6.90E-05 0.187 1.00E-04

phosphatidylinositol-mediated signaling 1749490 4449 2.54 0.00037 4.84 8.80E-05 0.189 0.00052

potassium ion transmembrane transport 1551767 4067 2.62 0.0019 5.07 0.00071 0.189 0.00059

voltage-gated potassium channel activity 1856383 4782 2.58 0.00015 5.1 2.00E-04 0.19 0.00023

insulin receptor signaling pathway 2888724 7650 2.65 0.00025 5.25 0.00031 0.191 0.00013

DNA repair 3366924 8926 2.65 2.00E-04 5.22 0.00012 0.193 0.00014

transcription factor binding 6670091 17617 2.64 6.60E-07 5.08 1.10E-08 0.193 6.00E-07

negative regulation of transcription, DNA-dependent 10946505 29687 2.71 3.10E-07 5.29 2.60E-09 0.194 1.30E-09

chromatin binding 4652818 11817 2.54 4.40E-07 4.9 5.60E-08 0.194 8.10E-05

negative regulation of transcription from RNA polymerase II promoter 10305800 27903 2.71 9.30E-07 5.28 1.00E-08 0.194 1.90E-08

mitotic cell cycle 3142263 8325 2.65 2.80E-05 5.12 3.20E-06 0.195 0.00048

transcription factor complex 6332741 16852 2.66 5.40E-06 5.21 7.00E-07 0.195 1.70E-05

transcription coactivator activity 4943837 13029 2.64 1.10E-05 5.12 1.20E-06 0.195 0.00017

protein kinase activity 12795332 35155 2.75 5.00E-08 5.47 1.20E-08 0.195 3.40E-10

in utero embryonic development 5609697 15414 2.75 0.00053 5.38 5.80E-05 0.195 5.10E-05

protein tyrosine kinase activity 11946271 32789 2.74 1.40E-07 5.46 3.60E-08 0.195 3.40E-09

protein serine/threonine kinase activity 12656016 34778 2.75 4.70E-08 5.46 7.40E-09 0.195 1.30E-09

nerve growth factor receptor signaling pathway 6111437 16512 2.7 7.50E-05 5.34 2.80E-05 0.195 5.40E-05

signal transducer activity 6688041 18137 2.71 8.70E-06 5.32 8.40E-07 0.195 8.10E-06

kinase activity 4645863 12901 2.78 0.0027 5.44 0.00036 0.195 0.00022

protein complex 5190432 14292 2.75 0.00035 5.39 3.40E-05 0.196 0.00011

protein phosphorylation 14162029 39092 2.76 4.10E-08 5.49 6.80E-09 0.196 9.00E-10

protein kinase binding 4895492 13275 2.71 0.00047 5.33 1.00E-04 0.196 0.00051

positive regulation of transcription, DNA-dependent 13920086 37756 2.71 1.10E-08 5.36 6.70E-10 0.197 4.00E-08

transcription from RNA polymerase II promoter 5003073 13335 2.67 1.30E-05 5.25 4.60E-06 0.197 0.00029

negative regulation of apoptosis 5705645 15766 2.76 0.00062 5.53 5.00E-04 0.197 0.00045

protein heterodimerization activity 6512788 17785 2.73 2.50E-05 5.4 5.80E-06 0.197 0.00016

regulation of transcription from RNA polymerase II promoter 5544704 15048 2.71 4.90E-05 5.26 2.00E-06 0.197 0.00047

nucleoplasm 12471336 32833 2.63 1.30E-12 5.17 1.50E-13 0.198 6.80E-07

positive regulation of cell proliferation 8010653 22721 2.84 0.003 5.6 0.00024 0.198 4.20E-05

cell proliferation 6004861 16670 2.78 0.00033 5.42 1.20E-05 0.198 0.00037

apoptosis 11882831 33339 2.81 1.70E-05 5.54 5.10E-07 0.198 3.40E-06

DNA binding 29156331 79806 2.74 7.40E-14 5.42 2.80E-16 0.198 2.70E-12

zinc ion binding 28823266 79223 2.75 8.00E-15 5.44 1.10E-17 0.198 1.40E-12

ATP binding 27325112 75898 2.78 4.10E-12 5.55 3.00E-13 0.198 2.90E-12

sequence-specific DNA binding 12483858 34287 2.75 1.60E-06 5.44 1.20E-07 0.199 1.10E-05

regulation of transcription, DNA-dependent 34108182 92840 2.72 1.80E-16 5.4 7.10E-19 0.199 8.90E-12

transcription, DNA-dependent 14784017 40164 2.72 1.50E-09 5.34 1.40E-11 0.199 2.90E-06

positive regulation of transcription from RNA polymerase II promoter 14530520 40366 2.78 1.20E-06 5.47 1.20E-08 0.199 7.70E-06

sequence-specific DNA binding transcription factor activity 19434431 53728 2.76 8.00E-09 5.46 7.50E-11 0.2 7.30E-07

nucleus 80575833 223497 2.77 1.70E-23 5.52 1.00E-28 0.2 3.70E-22

nucleotide binding 34315822 95814 2.79 3.40E-12 5.63 5.70E-12 0.2 8.70E-11

metal ion binding 45272248 125526 2.77 2.00E-18 5.54 4.50E-20 0.2 1.50E-12

nucleic acid binding 12792116 35143 2.75 1.20E-08 5.49 6.10E-09 0.201 0.00011

signal transduction 29647443 83795 2.83 1.30E-08 5.66 9.20E-10 0.201 1.10E-07

intracellular 34236666 96086 2.81 6.00E-10 5.62 1.80E-11 0.201 3.20E-08

Golgi apparatus 17726613 50369 2.84 9.20E-06 5.72 3.60E-06 0.201 3.70E-05

hydrolase activity 15678281 43183 2.75 7.70E-09 5.49 1.50E-09 0.201 1.00E-04

cytoplasm 92222808 260002 2.82 1.50E-19 5.65 7.90E-24 0.202 5.30E-18

cytosol 36606393 101494 2.77 8.00E-14 5.57 3.90E-14 0.202 4.40E-08

protein binding 151008527 425254 2.82 2.80E-28 5.64 4.80E-36 0.202 1.70E-27

plasma membrane 66136117 192408 2.91 1.50E-05 5.9 2.70E-06 0.205 1.50E-05

membrane 77310603 225616 2.92 3.40E-06 5.92 2.30E-07 0.206 0.00026

Reactome category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)

MAPK targets - nuclear events mediated by MAP kinases 866717 2137 2.47 0.0042 4.53 0.0011 0.177 2.00E-04

TRKA signalling from the plasma membrane 2810831 7304 2.6 0.00012 5.01 2.50E-05 0.191 9.90E-05

signalling by NGF 5898144 15797 2.68 1.60E-05 5.26 4.00E-06 0.194 1.90E-05

signaling in immune system 5634277 15421 2.74 8.10E-05 5.43 4.00E-05 0.195 1.60E-05

Cell cycle - mitotic 3075972 7845 2.55 1.10E-07 4.95 6.80E-08 0.196 0.0014

Axon guidance 4478364 12325 2.75 0.00082 5.4 0.00014 0.198 0.0015

KEGG category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)

Prostate cancer 2015129 5043 2.5 1.40E-05 4.77 3.60E-06 0.185 9.50E-06

Progesterone mediated oocyte maturation 1571198 3966 2.52 8.70E-05 4.93 1.00E-04 0.185 6.00E-05

melanoma 1734357 4392 2.53 9.80E-05 4.83 3.10E-05 0.186 8.80E-05

glioma 1751888 4651 2.65 0.0044 5.07 0.00092 0.188 4.00E-04

neurotrophin signaling pathway 2957205 8134 2.75 0.0042 5.28 0.00025 0.188 4.90E-06

epithelial cell signaling in Helicobacter pylori infection 1198962 3205 2.67 0.0097 4.95 0.00066 0.188 0.0018

ERBB signaling pathway 2588388 7063 2.73 0.0041 5.26 0.00055 0.19 7.70E-05

colorectal cancer 1713597 4348 2.54 0.00027 4.96 0.00029 0.191 0.0017

cell cycle 1867497 4706 2.52 7.00E-05 4.82 3.30E-05 0.191 0.0015

MAPK signaling pathway 6543557 17341 2.65 2.70E-07 5.16 1.60E-08 0.191 6.30E-09

T cell receptor signaling pathway 2661933 6991 2.63 0.00024 5.13 0.00012 0.194 0.0014

endocytosis 4269724 11763 2.75 0.002 5.35 0.00016 0.197 0.0019

regulation of actin cytoskeleton 4479622 12040 2.69 2.20E-05 5.3 9.30E-06 0.197 0.0011

pathways in cancer 8827601 24242 2.75 7.60E-06 5.46 2.10E-06 0.2 0.0012

Page 21: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

21

Table S6

Estimated proportion of bases under constraint (PUC) and total nucleotides under

constraint (NUC) in human and primates, by conservation and feature, corrected for

background selection.

Page 22: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

22

References

1. E. S. Lander et al.; International Human Genome Sequencing Consortium, Initial sequencing

and analysis of the human genome. Nature 409, 860 (2001). doi:10.1038/35057062

Medline

2. K. Lindblad-Toh et al.; Broad Institute Sequencing Platform and Whole Genome Assembly

Team; Baylor College of Medicine Human Genome Sequencing Center Sequencing

Team; Genome Institute at Washington University, A high-resolution map of human

evolutionary constraint using 29 mammals. Nature 478, 476 (2011).

doi:10.1038/nature10530 Medline

3. C. P. Ponting, R. C. Hardison, What fraction of the human genome is functional? Genome Res.

21, 1769 (2011). doi:10.1101/gr.116814.110 Medline

4. E. Birney et al.; ENCODE Project Consortium; NISC Comparative Sequencing Program;

Baylor College of Medicine Human Genome Sequencing Center; Washington University

Genome Sequencing Center; Broad Institute; Children’s Hospital Oakland Research

Institute, Identification and analysis of functional elements in 1% of the human genome

by the ENCODE pilot project. Nature 447, 799 (2007). doi:10.1038/nature05874 Medline

5. The ENCODE Project Consortium, Nature, 5 September 2012; doi:10.1038/nature11247.

10.1038/nature11247

6. J. Ernst et al., Mapping and analysis of chromatin state dynamics in nine human cell types.

Nature 473, 43 (2011). doi:10.1038/nature09906 Medline

7. M. R. Nelson et al., An abundance of rare functional variants in 202 drug target genes

sequenced in 14,002 people. Science 337, 100 (2012). doi:10.1126/science.1217876

Medline

8. L. A. Hindorff et al., Potential etiologic and functional implications of genome-wide

association loci for human diseases and traits. Proc. Natl. Acad. Sci. U.S.A. 106, 9362

(2009). doi:10.1073/pnas.0903103106 Medline

9. C. B. Lowe et al., Three periods of regulatory innovation during vertebrate evolution. Science

333, 1019 (2011). doi:10.1126/science.1202702 Medline

10. D. Brawand et al., The evolution of gene expression levels in mammalian organs. Nature

478, 343 (2011). doi:10.1038/nature10532 Medline

11. D. Schmidt et al., Five-vertebrate ChIP-seq reveals the evolutionary dynamics of

transcription factor binding. Science 328, 1036 (2010). doi:10.1126/science.1186176

Medline

12. 1000 Genomes Project Consortium, A map of human genome variation from population-

scale sequencing. Nature 467, 1061 (2010). Medline

13. S. R. Eddy, A model of the statistical power of comparative genome sequence analysis. PLoS

Biol. 3, e10 (2005). doi:10.1371/journal.pbio.0030010 Medline

14. S. Asthana et al., Widely distributed noncoding purifying selection in the human genome.

Proc. Natl. Acad. Sci. U.S.A. 104, 12410 (2007). doi:10.1073/pnas.0705140104 Medline

Page 23: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

23

15. J. A. Drake et al., Conserved noncoding sequences are selectively constrained and not

mutation cold spots. Nat. Genet. 38, 223 (2006). doi:10.1038/ng1710 Medline

16. D. G. Torgerson et al., Evolutionary processes acting on candidate cis-regulatory regions in

humans inferred from patterns of polymorphism and divergence. PLoS Genet. 5,

e1000592 (2009). doi:10.1371/journal.pgen.1000592 Medline

17. S. Katzman et al., Human genome ultraconserved elements are ultraselected. Science 317,

915 (2007). doi:10.1126/science.1142430 Medline

18. D. Lomelin, E. Jorgenson, N. Risch, Human genetic variation recognizes functional elements

in noncoding sequence. Genome Res. 20, 311 (2010). doi:10.1101/gr.094151.109

Medline

19. X. J. Mu, Z. J. Lu, Y. Kong, H. Y. Lam, M. B. Gerstein, Analysis of genomic variation in

non-coding elements using population-scale sequencing data from the 1000 Genomes

Project. Nucleic Acids Res. 39, 7058 (2011). doi:10.1093/nar/gkr342 Medline

20. K. S. Pollard et al., An RNA gene expressed during cortical development evolved rapidly in

humans. Nature 443, 167 (2006). doi:10.1038/nature05113 Medline

21. P. C. Sabeti et al., Positive natural selection in the human lineage. Science 312, 1614 (2006).

doi:10.1126/science.1124309 Medline

22. Materials and methods are available as Supporting Online Materials on Science Online.

23. G. McVicker, D. Gordon, C. Davis, P. Green, Widespread genomic signatures of natural

selection in hominid evolution. PLoS Genet. 5, e1000471 (2009).

doi:10.1371/journal.pgen.1000471 Medline

24. J. Ernst, M. Kellis, Discovery and characterization of chromatin states for systematic

annotation of the human genome. Nat. Biotechnol. 28, 817 (2010). doi:10.1038/nbt.1662

Medline

25. G. Bejerano et al., A distal enhancer and an ultraconserved exon are derived from a novel

retroposon. Nature 441, 87 (2006). doi:10.1038/nature04696 Medline

26. S. Dorus et al., Accelerated evolution of nervous system genes in the origin of Homo

sapiens. Cell 119, 1027 (2004). doi:10.1016/j.cell.2004.11.040 Medline

27. G. H. Jacobs, The evolution of vertebrate color vision. Adv. Exp. Med. Biol. 739, 156 (2012).

doi:10.1007/978-1-4614-1704-0_10 Medline

28. S. Meader, C. P. Ponting, G. Lunter, Massive turnover of functional sequence in human and

other mammalian genomes. Genome Res. 20, 1335 (2010). doi:10.1101/gr.108795.110

Medline

29. T. S. Mikkelsen et al.; Broad Institute Genome Sequencing Platform; Broad Institute Whole

Genome Assembly Team, Genome of the marsupial Monodelphis domestica reveals

innovation in non-coding sequences. Nature 447, 167 (2007). doi:10.1038/nature05805

Medline

30. X. Y. Li et al., Transcription factors bind thousands of active and inactive regions in the

Drosophila blastoderm. PLoS Biol. 6, e27 (2008). doi:10.1371/journal.pbio.0060027

Medline

Page 24: Supplementary Materials for · 2012-09-04 · heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies estimated from the 120 chromosomes observed

24

31. A. R. Quinlan, I. M. Hall, BEDTools: A flexible suite of utilities for comparing genomic

features. Bioinformatics 26, 841 (2010). doi:10.1093/bioinformatics/btq033 Medline

32. J. Harrow et al., GENCODE: Producing a reference annotation for ENCODE. Genome Biol.

7, (Suppl 1), S4, 1 (2006). doi:10.1186/gb-2006-7-s1-s4 Medline

33. D. Karolchik et al., The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32,

(Database issue), D493 (2004). doi:10.1093/nar/gkh103 Medline

34. B. Paten et al., Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome

Res. 18, 1829 (2008). doi:10.1101/gr.076521.108 Medline

35. M. Garber et al., Identifying novel constrained elements by exploiting biased substitution

patterns. Bioinformatics 25, i54 (2009). doi:10.1093/bioinformatics/btp190 Medline

36. R. A. Gibbs et al.; Rhesus Macaque Genome Sequencing and Analysis Consortium,

Evolutionary and biomedical insights from the rhesus macaque genome. Science 316, 222

(2007). doi:10.1126/science.1139247 Medline

37. S. B. Gabriel et al., The structure of haplotype blocks in the human genome. Science 296,

2225 (2002). doi:10.1126/science.1069424 Medline

38. D. L. Hartl, A. G. Clark, Principles of Population Genetics (Sinauer Associates, Sunderland,

Mass., ed. 4, 2007)

39. P. Flicek et al., Ensembl 2012. Nucleic Acids Res. 40, (Database issue), D84 (2012).

doi:10.1093/nar/gkr991 Medline

40. M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, M. Tanabe, KEGG for integration and

interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, (Database issue),

D109 (2012). doi:10.1093/nar/gkr988 Medline

41. D. Croft et al., Reactome: A database of reactions, pathways and biological processes.

Nucleic Acids Res. 39, (Database issue), D691 (2011). doi:10.1093/nar/gkq1018 Medline

42. A. Subramanian et al., Gene set enrichment analysis: A knowledge-based approach for

interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545

(2005). doi:10.1073/pnas.0506580102 Medline

43. J. Berglund, K. S. Pollard, M. T. Webster, Hotspots of biased nucleotide substitutions in

human genes. PLoS Biol. 7, e26 (2009). doi:10.1371/journal.pbio.1000026 Medline