www.sciencemag.org/cgi/content/full/science.1225057/DC1
Supplementary Materials for
Evidence of Abundant Purifying Selection in Humans for
Recently Acquired Regulatory Functions
Lucas D. Ward and Manolis Kellis*
*To whom correspondence should be addressed. E-mail: [email protected]
Published 5 September 2012 on Science Express
DOI: 10.1126/science.1225057
This PDF file includes:
Materials and Methods
Figs. S1 to S10
Tables S1 to S6
References
2
Materials and Methods
Software
For all set operations on genomic intervals, the BEDTools package was used (31).
Gene annotations
The GENCODE v7 annotation (32) was obtained and exons were defined as
follows: features annotated as “CDS” were selected as protein-coding and all protein-
coding gene features annotated as “UTR” were selected as UTRs. Protein-coding and
non-coding genes were selected as genes, and the set difference between bases annotated
as being in a gene and bases annotated as being exonic were selected as intronic. Both
CAGE and non-CAGE TSS clusters from GENCODE were used, and regions within 2kb
were defined as TSS-proximal in order to be excluded from the analysis (324 Mb). The
genome was masked as follows: Autosomes from the hg19 version of the human genome
were selected, and the following regions were excluded: SimpleRepeat regions (124 Mb),
from the UCSC table browser (33); two ENCODE blacklist regions at which signal
artifacts are predicted, the DAC Blacklisted Regions and the Duke Excluded Regions
(together, 14 Mb) (5); regions not included in the EPO (Enredo, Pecan, Ortheus) multi-
species alignment from ENSEMBL (277 Mb) (34); regions to which 36-bp sequences
would not be mappable allowing at most one mismatch, using the CRG Alignability track
from the UCSC table browser (815 Mb); all CpG islands from the UCSC Table Browser
and any dinucleotide that is “CG” in either the reference genome or when mutated to a
1000 Genomes SNP observed in the YRI population (285 Mb); and any regions not
falling within a 1000 Genomes callable region (591 Mb).
Transcription factor binding annotations
ChIP-seq peaks were defined using the SPP method (5) and peaks were chosen with
an irreproducible discovery rate (IDR) of less than 1%. DNAse regions from both the
Duke and University of Washington groups were included and peaks were called using a
uniform pipeline (5).
Chromatin state annotations
The ChromHMM segmentation of the ENCODE data (5) was used to define four
broad sets of functional elements across six cell lines: promoters (states 1-4), enhancers
(states 5-11), insulators (states 12 and 13), and transcribed regions (states 14-19). Each
200-bp window comprising the segmentation was annotated as being in one of these four
sets in each cell type, and the union was taken for each annotation across cell types.
New transcript annotations
RNA contigs from the ENCODE data were thresholded with an IDR of less than
10% (5) and were split as follows: novel intronic RNA contigs were selected from each
experiment by selecting contigs that entirely overlap an intron and have no overlap with
an exon or any base within 2kb of a TSS, and novel intergenic RNA contigs were
selected from each experiment by selecting contigs that are entirely annotated as
intergenic and have no overlap with any base within 2kb of a TSS. Novel intergenic
RNAs were then split based on whether they were polyadenylated.
Mammalian-conserved regions
Mammalian-conserved regions were defined within EPO blocks based on Siphy-ω,
which identifies regions of rejected substitutions at 12-mer resolution and an FDR of
3
10% (2, 35). Regions called as non-conserved were nevertheless required to be in the
EPO alignment.The human-macaque genomic alignment (36) was obtained from the
UCSC Genome Browser. Within regions, divergence was counted as the ratio of
mismatches to the total number of alignable bases; positions with a „N‟ in either genome
were not counted.
Human diversity estimates
Pilot data from the 60 Yoruba individuals from Ibadan, Nigeria (YRI population)
from the 1000 Genomes Project pilot phase population was obtained in Variant Call
Format (VCF). Analysis was restricted to this population because it provides the highest
diversity and lowest LD (37), thus increasing power while minimizing the influence of
population bottlenecks, admixture, or population substructure. We then used three
metrics of human variation associated with purifying selection (38): SNP density,
heterozygosity (estimated as π = 2pq, where p and q are the population allele frequencies
estimated from the 120 chromosomes observed in the sample), and derived allele
frequency (the ancestral allele was chosen as defined by the authors; in cases where an
ancestral allele is not called, the SNP was included in our heterozygosity analysis but not
in our DAF analysis.)
Bootstrapping procedure
Because of the LD structure of the genome, adjacent DAF are not independent,
which will tend to exaggerate the statistical significance of the difference in DAF
between SNPs in a segment of the genome (such as those defined by ENCODE) and the
rest of the genome. Additionally, variation in local mutation rate causes
interdependencies in SNP density and heterozygosity. Therefore, a bootstrapping
procedure was developed to determine the distribution of SNP density, heterozygosity,
and DAF considered over a set of intervals expected if the entire ensemble of intervals
was rotated relative to their actual genomic positions. For each feature-background
comparison, a single background space was constructed by masking the genome
appropriately and concatenating all of the resulting segments across all autosomes.
Features and SNPs were then mapped to this space. For each iteration of bootstrapping
(10,000 for all tests except for GO/KEGG/Reactome categories, for which an initial
round of 1,000 was used for FDR thresholding and a second round of 10,000 was used to
calculate final p-values), the feature intervals were all shifted by the same random
number of bases within the background space. This null distribution was observed to be
normally-distributed for all three metrics, allowing a Z-score and p-value to then be
assigned to the initial difference in means observed.
Regulatory motif annotations
A joint annotation of position weight matrices (PWMs) and assayed transcription
factors (TFs) into families was used for the TF binding site analysis (5). A representative
PWM for each TF family was selected based on which showed the most significant
enrichment for any of the ChIP-seq experiments performed within that family (possibly
across a variety of assayed paralogous proteins, cell types, and experimental conditions).
Genomewide matches to the PWM were then defined as described previously (2) and
filtered each set of matches based on whether it was ever bound in one of the TF family‟s
experiments, or never bound.
Gene ontology annotations.
4
Gene Ontology (GO) annotations for all human genes were obtained as a table using
the ENSEMBL BioMart tool (39). KEGG(40) and Reactome(41) annotations were
obtained from MSigDB (42). Categories containing fewer than five ENSEMBL genes
were discarded. Enhancers from the ChromHMM segmentation were then aggregated
over the six Tier 1 and Tier 2 cell lines and classified according to the GO category (or
categories) of their nearest ENSEMBL gene. SNP density, heterozygosity, and DAF were
calculated at non-conserved enhancers grouped by the GO categories of their target
genes, and bootstrap p-values were calculated for all three metrics. Categories were
retained as significant if their heterozygosity and DAF bootstrap p-values both passed a
false discovery rate threshold of q=0.05.
Background selection
Background selection leads to locally reduced diversity near strongly selected
elements, and since ENCODE active elements are biased to be close to such elements, it
is a potential confounder. For the background selection analysis, the HapMap genetic
map calculated on the hg19 genome was used to assign a genetic coordinate along each
chromosome (in cM) for each feature boundary and each SNP. SNPs were then binned
according to their genomic distance to exons, using ten bins equally spaced beween 0 and
0.1 cM. SNPs were also binned using expected background selection values, B, from
(23). Heterozygosity calculations were then performed as described previously for
ENCODE feature and background sets, restricted to each bin. Genomic regions in the bin
corresponding to values of B between 0.12 and 0.14 showed abnormally high
heterozygosity. The result was found to be driven by a single outlier region (11q11.2)
which contains four genes, all encoding olfactory receptors. After excluding this 5.3 Mb
region, heterozygosity in this bin is consistent with the genomewide relationship between
B and diversity (Figure 2B). We confirmed that heterozygosity is consistently depleted at
ENCODE elements within these bins.
Biased gene conversion
Biased gene conversion favors the fixation of strong (CG) over weak (AT) alleles in
strong/weak polymorphic sites, and can mimic selection (43). To isolate sites immune to
biased gene conversion SNPs for the DAF analysis were discarded unless they were
weak-weak (A-T) or strong-strong (C-G), retaining 833,979 sites (18%), and the
bootstrapping analysis was then performed as described previously.
Non-reference allele mapping
A third potential confounder is a decrease in measured biochemical activity for non-
reference alleles, due to a bias towards mapping to the reference allele. For the read
mapping bias analysis, only ENCODE features and chromatin segmentations derived
from experiments on the GM12878 lymphoblastoid cell line were used. Then, the
NA12878 variant calls from the 1000 Genomes trio pilot project were used to select only
3.5M (77%) SNPs for DAF analysis which are homozygous for the reference allele in the
GM12878 cell line. Thus, only SNPs were considered which were not actually present in
the cell being assayed.
Proportion under constraint
To estimate the proportion of the human genome under constraint (PUC), every
nucleotide was assigned to one of ten bins of expected background constraint B as
described above, and within each bin the SNP density, heterozygosity, and DAF of the
feature being tested were scaled between a value of zero and unity, with zero (the most
5
constrained) defined as the constraint value of non-degenerate coding conserved bases,
and unity (the least constrained) defined as non-ENCODE non-conserved regions. This
scaled constraint value within each bin was then multiplied by the number of nucleotides
covered by the feature in each bin, and this product was summed across bins, providing a
total number of bases under constraint. This was then compared to the total overall
coverage of the feature to arrive at an overall PUC for each class of elements. A
confidence interval was calculated for each PUC by generating three sets of 1000 random
normally-distributed constraint values within each bin (one for the test set and one each
for the two references), with a standard deviation equal to the standard error of the mean
constraint value. The PUC calculation was then performed using each of these 1000
trials, and quantiles in the resulting PUC distribution were used to report a 95%
confidence interval.
6
Fig. S1.
Comparison of (A) SNP density and (B) DAF for ENCODE-annotated elements within
and outside mammalian-conserved regions, as in Fig. 1B.
7
Fig. S2.
Distribution of mammalian conservation values by SiPhy in the genomic subsets
shown in Figure 1.
8
Fig. S3
SNP density in the unconserved genome at variously annotated features. As in Fig. 2, the
histograms represent the distribution of background values from the bootstrap procedure,
and Z-scores report the difference between the tested feature (shown as a vertical dash)
and the background region in units of the standard deviation of the values obtained from
the bootstrap procedure.
9
Fig. S4
Heterozygosity in the unconserved genome at variously annotated features.As in Fig. 2,
the histograms represent the distribution of background values from the bootstrap
procedure, and Z-scores report the difference between the tested feature (shown as a
vertical dash) and the background region in units of the standard deviation of the values
obtained from the bootstrap procedure.
10
Fig. S5
Derived allele frequency in the unconserved genome at ENCODE feature classes,
conditioning on whether regions are annotated as intronic or intergenic. As in Fig. 2, the
histograms represent the distribution of background values from the bootstrap procedure,
and Z-scores report the difference between the tested feature (shown as a vertical dash)
and the background region in units of the standard deviation of the values obtained from
the bootstrap procedure.
11
Fig. S6
Mean heterozygosity and human-macaque divergence in the genomic subsets shown in
Figure 1.
12
Fig. S7
Pathway analysis of unconserved enhancers near genes involved with nerve growth factor
signaling (by three pathway annotations.) Only genes with at least 30 kb of neighboring
enhancer sequence are included. Genes for which enhancers have a heterozygosity below
the unconserved unannotated genome background of 6.22 × 10-4
are listed with blue
labels, and those with higher heterozygosity are listed with red labels. Bootstrap results
for all significant categories are shown in Table S5.
13
Fig. S8
Empirical cumulative distribution function of heterozygosity at individual elements and
1,000 samples each of 10-100,000 elements, in multiples of 10, sampled at a time from
two populations: (red) DNAse hypersensitive sites, located in TSS-distal intergenic
regions and background selection values B < 0.1, and (black) matched control non-
ENCODE regions.
14
Fig. S9
The effect of background selection on SNP density, heterozygosity, DAF, and divergence
at reference minimum and maximum constraint regions used for the PUC estimates
(conserved non-degenerate coding, and unconserved non-ENCODE).
15
Fig. S10
Partitioning of human-constrained bases (nucleotides under constraint, NUC) between the
conserved and unconserved genome at annotated exons and DNase I hypersensitive sites.
16
Table S1.
Human selection on SNPs not prone to biased gene conversion.
Feature Weak-weak or strong-strong
SNPs DAF p (DAF)
Genome (TSS-distal, nonexonic) 833979 0.211
ENCODE-annotated (TSS-distal, nonexonic) 543065 0.207 5.10E-43
Non-ENCODE 290914 0.218
Active chromatin (TSS-distal, nonexonic) 270959 0.203 2.20E-43
Inactive chromatin 563020 0.215
17
Table S2.
Human selection on features observed solely in GM12878, on derived alleles not present
in that individual.
Feature
non-NA12878
SNPs DAF p (DAF)
Genome (TSS-distal, nonexonic) 3509971 0.17
ENCODE-annotated (TSS-distal, nonexonic) 1417654 0.164 5.10E-49
Non-ENCODE 2092317 0.174
Active chromatin (TSS-distal, nonexonic) 454740 0.158 1.20E-42
Inactive chromatin 3055231 0.171
18
Table S3
Human selection in the unconserved genome.
Feature bp SNPs
dens (kb-1)
p (density) het (π×10-4) p (het) DAF p (DAF)
Mappable genome 1692105614 5059631 2.99 6.13 0.208
Genome (TSS-distal, nonexonic) 1500452776 4539733 3.03 2.20E-58 6.22 1.90E-59 0.209 3.70E-22
Intergenic (TSS-distal) 775609576 2454978 3.17 6.10E-63 6.58 1.80E-69 0.212 1.80E-37
Intronic (TSS-distal) 724843200 2084755 2.88 7.70E-35 5.82 5.60E-40 0.204 9.70E-28
Non-degenerate coding 3282088 5959 1.82 2.20E-130 3.47 8.80E-109 0.187 8.10E-15
Annotated CDS 5076572 10082 1.99 4.40E-113 3.88 1.30E-93 0.195 6.60E-09
Annotated UTR 20114137 50365 2.5 6.30E-77 4.94 9.00E-72 0.196 1.40E-22
TSS 172497879 471430 2.73 4.00E-44 5.5 1.20E-45 0.202 1.80E-21
Intron 724843200 2084755 2.88 7.70E-35 5.82 5.60E-40 0.204 9.70E-28
Four-fold degenerate coding 875586 1758 2.01 2.80E-45 4.07 1.60E-31 0.206 0.36
Repeats 679717150 2093669 3.08 5.30E-179 6.34 1.30E-157 0.209 2.70E-14
Intergenic 775609576 2454978 3.17 6.10E-63 6.58 1.80E-69 0.212 1.80E-37
ENCODE-annotated (TSS-distal, nonexonic)
1034532977 3024992 2.92 3.00E-64 5.93 7.20E-85 0.205 1.70E-79
Bound motifs 2676169 6976 2.61 6.80E-31 5.04 8.20E-37 0.193 5.10E-09
DNAse I hypersensitive 310016583 884421 2.85 4.20E-38 5.71 8.30E-58 0.202 9.30E-63
FAIRE 238662992 675064 2.83 3.40E-22 5.67 1.30E-30 0.203 2.70E-27
Long RNA 881679464 2552187 2.89 2.20E-65 5.86 1.70E-78 0.204 2.00E-67
Protein bound by ChIP 112679654 323954 2.87 5.50E-33 5.77 7.60E-48 0.203 2.60E-29
Short RNA 1892306 5033 2.66 2.40E-15 5.2 2.30E-17 0.193 2.70E-06
Novel long intergenic PolyA RNA 191506259 580177 3.03 4.80E-21 6.22 1.20E-22 0.208 3.00E-15
Novel long intergenic non-PolyA RNA
100922369 316303 3.13 0.014 6.48 0.0034 0.21 0.00034
Novel long intronic RNA 555945590 1562135 2.81 8.00E-34 5.63 4.70E-46 0.202 1.10E-40
Non-ENCODE 465919799 1514741 3.25 4.20E-65 6.84 1.00E-86 0.215 1.10E-65
Active chromatin (TSS-distal, nonexonic)
546833480 1538999 2.81 4.60E-57 5.64 3.00E-73 0.202 1.80E-81
Enhancer chromatin 285221744 818955 2.87 8.50E-25 5.79 5.50E-32 0.204 3.90E-30
Insulator chromatin 34518467 101998 2.95 2.50E-06 6.02 1.60E-07 0.207 0.019
Promoter chromatin 47213250 130661 2.77 1.20E-42 5.53 3.40E-47 0.202 3.90E-16
Transcribed chromatin 301694820 812255 2.69 3.80E-85 5.31 1.00E-103 0.197 3.00E-102
Inactive chromatin 953619296 3000734 3.15 1.80E-57 6.55 4.40E-74 0.212 9.60E-67
19
Table S4
Human selection in the conserved genome. Feature bp SNPs dens (kb-1) p (density) het (π×10-4) p (het) DAF p (DAF)
Mappable genome 115312286 206185 1.79 3.35 0.179
Genome (TSS-distal, nonexonic) 67141385 147277 2.19 0 4.19 9.30E-298 0.184 2.54E-15
Intergenic (TSS-distal) 32704673 75630 2.31 3.00E-137 4.47 2.80E-128 0.187 1.05E-12
Intronic (TSS-distal) 34436712 71647 2.08 3.10E-44 3.92 3.80E-31 0.181 0.441
Non-degenerate coding 10885201 6992 0.642 0 1.03 0 0.147 5.41E-98
Annotated CDS 16601142 16879 1.02 0 1.79 0 0.167 3.10E-25
Annotated UTR 9078363 11872 1.31 5.90E-190 2.27 6.10E-174 0.163 7.51E-23
TSS 17757580 25938 1.46 1.20E-111 2.63 1.20E-105 0.171 6.82E-16
Intron 34436712 71647 2.08 3.10E-44 3.92 3.80E-31 0.181 0.441
Four-fold degenerate coding 2551676 4404 1.73 2.50E-09 3.25 1.40E-06 0.186 0.0774
Repeats 5948434 12880 2.17 7.00E-39 4.08 3.40E-25 0.178 0.0983
Intergenic 32704673 75630 2.31 3.00E-137 4.47 2.80E-128 0.187 1.05E-12
ENCODE-annotated (TSS-distal, nonexonic) 51655262 109738 2.12 2.10E-41 4 3.50E-52 0.181 5.04E-18
Bound motifs 556844 963 1.73 1.60E-13 2.92 2.70E-16 0.166 0.00321
DNAse I hypersensitive 24658238 50270 2.04 5.00E-50 3.76 2.00E-67 0.176 1.11E-19
FAIRE 15761407 31746 2.01 3.80E-31 3.7 8.30E-41 0.177 7.36E-10
Long RNA 41538141 87166 2.1 6.60E-34 3.95 3.50E-37 0.18 2.47E-13
Protein bound by ChIP 10776655 21797 2.02 2.10E-27 3.72 1.50E-33 0.177 7.89E-08
Short RNA 158159 278 1.76 0.00015 3.17 0.00031 0.162 0.0425
Novel long intergenic PolyA RNA 7976838 17274 2.17 4.80E-13 4.12 2.40E-12 0.183 0.00562
Novel long intergenic non-PolyA RNA 4806880 11288 2.35 0.068 4.5 0.27 0.182 0.0117
Novel long intronic RNA 25745396 53444 2.08 0.24 3.89 0.015 0.179 0.00336
Non-ENCODE 15486123 37539 2.42 9.00E-43 4.79 6.50E-53 0.193 3.09E-14
Active chromatin (TSS-distal, nonexonic) 28105340 55850 1.99 4.50E-59 3.67 1.80E-67 0.176 2.81E-24
Enhancer chromatin 17751923 35982 2.03 2.80E-31 3.73 9.60E-41 0.176 5.94E-15
Insulator chromatin 1836220 3800 2.07 0.00033 3.86 0.00015 0.183 0.43
Promoter chromatin 3703789 6962 1.88 3.10E-27 3.43 8.80E-27 0.172 1.34E-06
Transcribed chromatin 11870117 22003 1.85 1.10E-55 3.37 9.50E-58 0.172 2.33E-16
Inactive chromatin 39036045 91427 2.34 9.70E-62 4.56 7.60E-70 0.189 2.58E-18
20
Table S5
Human selection in unconserved enhancers associated with gene sets. GO category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)
retinal cone cell development 73555 136 1.85 0.0051 2.46 0.00069 0.124 7.70E-05
transcription initiation from RNA polymerase II promoter 717810 1661 2.31 9.40E-05 4.19 2.40E-05 0.179 0.00052
fibroblast growth factor receptor signaling pathway 1847741 4720 2.55 2.00E-04 4.89 6.90E-05 0.187 1.00E-04
phosphatidylinositol-mediated signaling 1749490 4449 2.54 0.00037 4.84 8.80E-05 0.189 0.00052
potassium ion transmembrane transport 1551767 4067 2.62 0.0019 5.07 0.00071 0.189 0.00059
voltage-gated potassium channel activity 1856383 4782 2.58 0.00015 5.1 2.00E-04 0.19 0.00023
insulin receptor signaling pathway 2888724 7650 2.65 0.00025 5.25 0.00031 0.191 0.00013
DNA repair 3366924 8926 2.65 2.00E-04 5.22 0.00012 0.193 0.00014
transcription factor binding 6670091 17617 2.64 6.60E-07 5.08 1.10E-08 0.193 6.00E-07
negative regulation of transcription, DNA-dependent 10946505 29687 2.71 3.10E-07 5.29 2.60E-09 0.194 1.30E-09
chromatin binding 4652818 11817 2.54 4.40E-07 4.9 5.60E-08 0.194 8.10E-05
negative regulation of transcription from RNA polymerase II promoter 10305800 27903 2.71 9.30E-07 5.28 1.00E-08 0.194 1.90E-08
mitotic cell cycle 3142263 8325 2.65 2.80E-05 5.12 3.20E-06 0.195 0.00048
transcription factor complex 6332741 16852 2.66 5.40E-06 5.21 7.00E-07 0.195 1.70E-05
transcription coactivator activity 4943837 13029 2.64 1.10E-05 5.12 1.20E-06 0.195 0.00017
protein kinase activity 12795332 35155 2.75 5.00E-08 5.47 1.20E-08 0.195 3.40E-10
in utero embryonic development 5609697 15414 2.75 0.00053 5.38 5.80E-05 0.195 5.10E-05
protein tyrosine kinase activity 11946271 32789 2.74 1.40E-07 5.46 3.60E-08 0.195 3.40E-09
protein serine/threonine kinase activity 12656016 34778 2.75 4.70E-08 5.46 7.40E-09 0.195 1.30E-09
nerve growth factor receptor signaling pathway 6111437 16512 2.7 7.50E-05 5.34 2.80E-05 0.195 5.40E-05
signal transducer activity 6688041 18137 2.71 8.70E-06 5.32 8.40E-07 0.195 8.10E-06
kinase activity 4645863 12901 2.78 0.0027 5.44 0.00036 0.195 0.00022
protein complex 5190432 14292 2.75 0.00035 5.39 3.40E-05 0.196 0.00011
protein phosphorylation 14162029 39092 2.76 4.10E-08 5.49 6.80E-09 0.196 9.00E-10
protein kinase binding 4895492 13275 2.71 0.00047 5.33 1.00E-04 0.196 0.00051
positive regulation of transcription, DNA-dependent 13920086 37756 2.71 1.10E-08 5.36 6.70E-10 0.197 4.00E-08
transcription from RNA polymerase II promoter 5003073 13335 2.67 1.30E-05 5.25 4.60E-06 0.197 0.00029
negative regulation of apoptosis 5705645 15766 2.76 0.00062 5.53 5.00E-04 0.197 0.00045
protein heterodimerization activity 6512788 17785 2.73 2.50E-05 5.4 5.80E-06 0.197 0.00016
regulation of transcription from RNA polymerase II promoter 5544704 15048 2.71 4.90E-05 5.26 2.00E-06 0.197 0.00047
nucleoplasm 12471336 32833 2.63 1.30E-12 5.17 1.50E-13 0.198 6.80E-07
positive regulation of cell proliferation 8010653 22721 2.84 0.003 5.6 0.00024 0.198 4.20E-05
cell proliferation 6004861 16670 2.78 0.00033 5.42 1.20E-05 0.198 0.00037
apoptosis 11882831 33339 2.81 1.70E-05 5.54 5.10E-07 0.198 3.40E-06
DNA binding 29156331 79806 2.74 7.40E-14 5.42 2.80E-16 0.198 2.70E-12
zinc ion binding 28823266 79223 2.75 8.00E-15 5.44 1.10E-17 0.198 1.40E-12
ATP binding 27325112 75898 2.78 4.10E-12 5.55 3.00E-13 0.198 2.90E-12
sequence-specific DNA binding 12483858 34287 2.75 1.60E-06 5.44 1.20E-07 0.199 1.10E-05
regulation of transcription, DNA-dependent 34108182 92840 2.72 1.80E-16 5.4 7.10E-19 0.199 8.90E-12
transcription, DNA-dependent 14784017 40164 2.72 1.50E-09 5.34 1.40E-11 0.199 2.90E-06
positive regulation of transcription from RNA polymerase II promoter 14530520 40366 2.78 1.20E-06 5.47 1.20E-08 0.199 7.70E-06
sequence-specific DNA binding transcription factor activity 19434431 53728 2.76 8.00E-09 5.46 7.50E-11 0.2 7.30E-07
nucleus 80575833 223497 2.77 1.70E-23 5.52 1.00E-28 0.2 3.70E-22
nucleotide binding 34315822 95814 2.79 3.40E-12 5.63 5.70E-12 0.2 8.70E-11
metal ion binding 45272248 125526 2.77 2.00E-18 5.54 4.50E-20 0.2 1.50E-12
nucleic acid binding 12792116 35143 2.75 1.20E-08 5.49 6.10E-09 0.201 0.00011
signal transduction 29647443 83795 2.83 1.30E-08 5.66 9.20E-10 0.201 1.10E-07
intracellular 34236666 96086 2.81 6.00E-10 5.62 1.80E-11 0.201 3.20E-08
Golgi apparatus 17726613 50369 2.84 9.20E-06 5.72 3.60E-06 0.201 3.70E-05
hydrolase activity 15678281 43183 2.75 7.70E-09 5.49 1.50E-09 0.201 1.00E-04
cytoplasm 92222808 260002 2.82 1.50E-19 5.65 7.90E-24 0.202 5.30E-18
cytosol 36606393 101494 2.77 8.00E-14 5.57 3.90E-14 0.202 4.40E-08
protein binding 151008527 425254 2.82 2.80E-28 5.64 4.80E-36 0.202 1.70E-27
plasma membrane 66136117 192408 2.91 1.50E-05 5.9 2.70E-06 0.205 1.50E-05
membrane 77310603 225616 2.92 3.40E-06 5.92 2.30E-07 0.206 0.00026
Reactome category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)
MAPK targets - nuclear events mediated by MAP kinases 866717 2137 2.47 0.0042 4.53 0.0011 0.177 2.00E-04
TRKA signalling from the plasma membrane 2810831 7304 2.6 0.00012 5.01 2.50E-05 0.191 9.90E-05
signalling by NGF 5898144 15797 2.68 1.60E-05 5.26 4.00E-06 0.194 1.90E-05
signaling in immune system 5634277 15421 2.74 8.10E-05 5.43 4.00E-05 0.195 1.60E-05
Cell cycle - mitotic 3075972 7845 2.55 1.10E-07 4.95 6.80E-08 0.196 0.0014
Axon guidance 4478364 12325 2.75 0.00082 5.4 0.00014 0.198 0.0015
KEGG category of nearest genes Enhancer bp SNPs density p (density) het p(het) DAF p (DAF)
Prostate cancer 2015129 5043 2.5 1.40E-05 4.77 3.60E-06 0.185 9.50E-06
Progesterone mediated oocyte maturation 1571198 3966 2.52 8.70E-05 4.93 1.00E-04 0.185 6.00E-05
melanoma 1734357 4392 2.53 9.80E-05 4.83 3.10E-05 0.186 8.80E-05
glioma 1751888 4651 2.65 0.0044 5.07 0.00092 0.188 4.00E-04
neurotrophin signaling pathway 2957205 8134 2.75 0.0042 5.28 0.00025 0.188 4.90E-06
epithelial cell signaling in Helicobacter pylori infection 1198962 3205 2.67 0.0097 4.95 0.00066 0.188 0.0018
ERBB signaling pathway 2588388 7063 2.73 0.0041 5.26 0.00055 0.19 7.70E-05
colorectal cancer 1713597 4348 2.54 0.00027 4.96 0.00029 0.191 0.0017
cell cycle 1867497 4706 2.52 7.00E-05 4.82 3.30E-05 0.191 0.0015
MAPK signaling pathway 6543557 17341 2.65 2.70E-07 5.16 1.60E-08 0.191 6.30E-09
T cell receptor signaling pathway 2661933 6991 2.63 0.00024 5.13 0.00012 0.194 0.0014
endocytosis 4269724 11763 2.75 0.002 5.35 0.00016 0.197 0.0019
regulation of actin cytoskeleton 4479622 12040 2.69 2.20E-05 5.3 9.30E-06 0.197 0.0011
pathways in cancer 8827601 24242 2.75 7.60E-06 5.46 2.10E-06 0.2 0.0012
21
Table S6
Estimated proportion of bases under constraint (PUC) and total nucleotides under
constraint (NUC) in human and primates, by conservation and feature, corrected for
background selection.
22
References
1. E. S. Lander et al.; International Human Genome Sequencing Consortium, Initial sequencing
and analysis of the human genome. Nature 409, 860 (2001). doi:10.1038/35057062
Medline
2. K. Lindblad-Toh et al.; Broad Institute Sequencing Platform and Whole Genome Assembly
Team; Baylor College of Medicine Human Genome Sequencing Center Sequencing
Team; Genome Institute at Washington University, A high-resolution map of human
evolutionary constraint using 29 mammals. Nature 478, 476 (2011).
doi:10.1038/nature10530 Medline
3. C. P. Ponting, R. C. Hardison, What fraction of the human genome is functional? Genome Res.
21, 1769 (2011). doi:10.1101/gr.116814.110 Medline
4. E. Birney et al.; ENCODE Project Consortium; NISC Comparative Sequencing Program;
Baylor College of Medicine Human Genome Sequencing Center; Washington University
Genome Sequencing Center; Broad Institute; Children’s Hospital Oakland Research
Institute, Identification and analysis of functional elements in 1% of the human genome
by the ENCODE pilot project. Nature 447, 799 (2007). doi:10.1038/nature05874 Medline
5. The ENCODE Project Consortium, Nature, 5 September 2012; doi:10.1038/nature11247.
10.1038/nature11247
6. J. Ernst et al., Mapping and analysis of chromatin state dynamics in nine human cell types.
Nature 473, 43 (2011). doi:10.1038/nature09906 Medline
7. M. R. Nelson et al., An abundance of rare functional variants in 202 drug target genes
sequenced in 14,002 people. Science 337, 100 (2012). doi:10.1126/science.1217876
Medline
8. L. A. Hindorff et al., Potential etiologic and functional implications of genome-wide
association loci for human diseases and traits. Proc. Natl. Acad. Sci. U.S.A. 106, 9362
(2009). doi:10.1073/pnas.0903103106 Medline
9. C. B. Lowe et al., Three periods of regulatory innovation during vertebrate evolution. Science
333, 1019 (2011). doi:10.1126/science.1202702 Medline
10. D. Brawand et al., The evolution of gene expression levels in mammalian organs. Nature
478, 343 (2011). doi:10.1038/nature10532 Medline
11. D. Schmidt et al., Five-vertebrate ChIP-seq reveals the evolutionary dynamics of
transcription factor binding. Science 328, 1036 (2010). doi:10.1126/science.1186176
Medline
12. 1000 Genomes Project Consortium, A map of human genome variation from population-
scale sequencing. Nature 467, 1061 (2010). Medline
13. S. R. Eddy, A model of the statistical power of comparative genome sequence analysis. PLoS
Biol. 3, e10 (2005). doi:10.1371/journal.pbio.0030010 Medline
14. S. Asthana et al., Widely distributed noncoding purifying selection in the human genome.
Proc. Natl. Acad. Sci. U.S.A. 104, 12410 (2007). doi:10.1073/pnas.0705140104 Medline
23
15. J. A. Drake et al., Conserved noncoding sequences are selectively constrained and not
mutation cold spots. Nat. Genet. 38, 223 (2006). doi:10.1038/ng1710 Medline
16. D. G. Torgerson et al., Evolutionary processes acting on candidate cis-regulatory regions in
humans inferred from patterns of polymorphism and divergence. PLoS Genet. 5,
e1000592 (2009). doi:10.1371/journal.pgen.1000592 Medline
17. S. Katzman et al., Human genome ultraconserved elements are ultraselected. Science 317,
915 (2007). doi:10.1126/science.1142430 Medline
18. D. Lomelin, E. Jorgenson, N. Risch, Human genetic variation recognizes functional elements
in noncoding sequence. Genome Res. 20, 311 (2010). doi:10.1101/gr.094151.109
Medline
19. X. J. Mu, Z. J. Lu, Y. Kong, H. Y. Lam, M. B. Gerstein, Analysis of genomic variation in
non-coding elements using population-scale sequencing data from the 1000 Genomes
Project. Nucleic Acids Res. 39, 7058 (2011). doi:10.1093/nar/gkr342 Medline
20. K. S. Pollard et al., An RNA gene expressed during cortical development evolved rapidly in
humans. Nature 443, 167 (2006). doi:10.1038/nature05113 Medline
21. P. C. Sabeti et al., Positive natural selection in the human lineage. Science 312, 1614 (2006).
doi:10.1126/science.1124309 Medline
22. Materials and methods are available as Supporting Online Materials on Science Online.
23. G. McVicker, D. Gordon, C. Davis, P. Green, Widespread genomic signatures of natural
selection in hominid evolution. PLoS Genet. 5, e1000471 (2009).
doi:10.1371/journal.pgen.1000471 Medline
24. J. Ernst, M. Kellis, Discovery and characterization of chromatin states for systematic
annotation of the human genome. Nat. Biotechnol. 28, 817 (2010). doi:10.1038/nbt.1662
Medline
25. G. Bejerano et al., A distal enhancer and an ultraconserved exon are derived from a novel
retroposon. Nature 441, 87 (2006). doi:10.1038/nature04696 Medline
26. S. Dorus et al., Accelerated evolution of nervous system genes in the origin of Homo
sapiens. Cell 119, 1027 (2004). doi:10.1016/j.cell.2004.11.040 Medline
27. G. H. Jacobs, The evolution of vertebrate color vision. Adv. Exp. Med. Biol. 739, 156 (2012).
doi:10.1007/978-1-4614-1704-0_10 Medline
28. S. Meader, C. P. Ponting, G. Lunter, Massive turnover of functional sequence in human and
other mammalian genomes. Genome Res. 20, 1335 (2010). doi:10.1101/gr.108795.110
Medline
29. T. S. Mikkelsen et al.; Broad Institute Genome Sequencing Platform; Broad Institute Whole
Genome Assembly Team, Genome of the marsupial Monodelphis domestica reveals
innovation in non-coding sequences. Nature 447, 167 (2007). doi:10.1038/nature05805
Medline
30. X. Y. Li et al., Transcription factors bind thousands of active and inactive regions in the
Drosophila blastoderm. PLoS Biol. 6, e27 (2008). doi:10.1371/journal.pbio.0060027
Medline
24
31. A. R. Quinlan, I. M. Hall, BEDTools: A flexible suite of utilities for comparing genomic
features. Bioinformatics 26, 841 (2010). doi:10.1093/bioinformatics/btq033 Medline
32. J. Harrow et al., GENCODE: Producing a reference annotation for ENCODE. Genome Biol.
7, (Suppl 1), S4, 1 (2006). doi:10.1186/gb-2006-7-s1-s4 Medline
33. D. Karolchik et al., The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32,
(Database issue), D493 (2004). doi:10.1093/nar/gkh103 Medline
34. B. Paten et al., Genome-wide nucleotide-level mammalian ancestor reconstruction. Genome
Res. 18, 1829 (2008). doi:10.1101/gr.076521.108 Medline
35. M. Garber et al., Identifying novel constrained elements by exploiting biased substitution
patterns. Bioinformatics 25, i54 (2009). doi:10.1093/bioinformatics/btp190 Medline
36. R. A. Gibbs et al.; Rhesus Macaque Genome Sequencing and Analysis Consortium,
Evolutionary and biomedical insights from the rhesus macaque genome. Science 316, 222
(2007). doi:10.1126/science.1139247 Medline
37. S. B. Gabriel et al., The structure of haplotype blocks in the human genome. Science 296,
2225 (2002). doi:10.1126/science.1069424 Medline
38. D. L. Hartl, A. G. Clark, Principles of Population Genetics (Sinauer Associates, Sunderland,
Mass., ed. 4, 2007)
39. P. Flicek et al., Ensembl 2012. Nucleic Acids Res. 40, (Database issue), D84 (2012).
doi:10.1093/nar/gkr991 Medline
40. M. Kanehisa, S. Goto, Y. Sato, M. Furumichi, M. Tanabe, KEGG for integration and
interpretation of large-scale molecular data sets. Nucleic Acids Res. 40, (Database issue),
D109 (2012). doi:10.1093/nar/gkr988 Medline
41. D. Croft et al., Reactome: A database of reactions, pathways and biological processes.
Nucleic Acids Res. 39, (Database issue), D691 (2011). doi:10.1093/nar/gkq1018 Medline
42. A. Subramanian et al., Gene set enrichment analysis: A knowledge-based approach for
interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. U.S.A. 102, 15545
(2005). doi:10.1073/pnas.0506580102 Medline
43. J. Berglund, K. S. Pollard, M. T. Webster, Hotspots of biased nucleotide substitutions in
human genes. PLoS Biol. 7, e26 (2009). doi:10.1371/journal.pbio.1000026 Medline
Top Related