Genome scan for cis-regulatory DNA motifs … begin to study the cis-regulatory code underlying this...

6
Genome scan for cis-regulatory DNA motifs associated with social behavior in honey bees Saurabh Sinha* †‡ , Xu Ling*, Charles W. Whitfield †§¶ , Chengxiang Zhai* , and Gene E. Robinson †§¶ Departments of *Computer Science and § Entomology, Institute of Genomic Biology, and Neuroscience Program, University of Illinois at Urbana–Champaign, Urbana, IL 61801 Contributed by Gene E. Robinson, August 28, 2006 Honey bees (Apis mellifera) undergo an age-related, socially reg- ulated transition from working in the hive to foraging, which is associated with changes in the expression of thousands of genes in the brain. To begin to study the cis-regulatory code underlying this massive social regulation of gene expression, we used the newly sequenced honey bee genome to scan the promoter regions of eight sets of behaviorally related genes differentially expressed in the brain in the context of division of labor among worker bees, for 41 cis-regulatory motifs previously characterized in Drosophila melanogaster. Binding sites for the transcription factors Hairy, GAGA, Adf1, Cf1, Snail, and Dri, known to function in nervous system development, olfactory learning, or hormone binding in Drosophila, were significantly associated with one or more gene sets. The presence of some binding sites also predicted expression patterns for as many as 71% of the genes in some gene sets. These results suggest that there is a robust relationship between cis and social regulation of brain gene expression, especially considering that we studied <15% of all known transcription factors. These results also suggest that transcriptional networks involved in the regulation of development in Drosophila are used to regulate behavioral development in adult honey bees. However, differ- ences in gene regulation between these two processes are sug- gested by the finding that the promoter regions for the behavior- ally related bee genes differed in both motif occurrence and GC content relative to their Drosophila orthologs. Apis mellifera gene regulation microarray position weight matrix transcription factors B ehavioral development in honey bees (Apis mellifera) gives rise to an intricate division of labor in the honey bee colony (1). Adult worker honey bees perform a series of tasks in the hive, such as brood care (‘‘nursing’’), when they are young and then shift to foraging for nectar and pollen outside the hive when they are 2–3 weeks of age for the remainder of their 5- to 7-week life. The transition from hive work to foraging involves changes in endocrine activity, brain chemistry, and brain structure. This transition is socially regulated and responsive to the needs of the colony, which are communicated via pheromones and other means. These social factors act directly or indirectly on physiological mechanisms that influence behavioral maturation, including mechanisms involving juvenile hormone (JH), the foraging and malvolio genes, and other genes still to be identified. The transition from hive work to foraging in honey bee colonies also involves changes in the expression of thousands of genes in the brain (2, 3). This finding suggests that honey bees will be useful in elucidating the mechanisms by which social factors regulate gene expression in the brain, one frontier in the study of genes and social behavior. To begin to study the cis-regulatory code underlying this massive social regulation of brain gene expression, we used the newly sequenced honey bee genome (4) to scan for cis-regulatory motifs in the promoter regions of genes previously shown to be differen- tially expressed in the brain in the context of socially regulated division of labor (2, 3). We searched for 41 motifs that were previously well characterized in Drosophila melanogaster, primarily for embryonic development, which represents 15% of all known transcription factors (www.godatabase.org). Starting with these experimentally validated motifs allowed us to employ solely bioin- formatics methods for this study; identification of motifs de novo would require additional molecular validation. With this approach, we also were able to begin to explore the hypothesis that transcrip- tional networks involved in the regulation of embryonic develop- ment in Drosophila are also used to regulate adult behavioral development in honey bees. The premise of this hypothesis is that many transcription factors show high levels of functional conser- vation across broad evolutionary distances (5). Microarray analysis has revealed that nurse and forager honey bees show differences in brain mRNA abundance for about one- third of the genes analyzed, and subsequent experiments indicated that many of these genes are socially regulated and predictive of behavioral status (2). Whitfield et al. (3) generated additional sets of genes implicated in social regulation by performing microarray analyses to identify genes regulated in the brain by physiological, ontogenetic, and genetic factors that are also known to influence the age at onset of foraging. We used these sets of ‘‘behaviorally related’’ genes in our study. We divided eight gene sets (2, 3), drawn from a total of 3,129 genes (estimated to represent 25% of the genes in the bee genome), into those sets that were up-regulated or down-regulated in the brain in response to a specific condition (Table 1). Each gene set contained 100–871 genes. Three additional pairs of gene sets were derived from these gene sets (see Methods) to capture specific combinations of conditions. We identified and quantified patterns of occurrence of the 41 cis-regulatory motifs, listed in Table 5, which is published as supporting information on the PNAS web site, in the promoter regions of these gene sets. We scanned for these motifs by modeling the binding specificity of each transcription factor by using a position-specific weight matrix (PWM). We used the computer algorithm Stubb (6) to score a promoter (2,000-bp region upstream of the translation start site for each gene) for matches to PWMs. The Stubb algorithm was previously found to accurately predict cis-regulatory modules in- volved in the segmentation pathway in Drosophila (7). The general scheme for this study is illustrated in Fig. 1. We performed ‘‘enrichment analysis’’ by using hypergeometric tests to analyze each gene set for statistical enrichment for each of the 41 motifs, compared with the entire complement of 3,129 genes. We also performed ‘‘correlation analysis,’’ analyzing each pair of up- and down-regulated gene sets for correlation between motif counts and gene expression. Because all of the gene sets are related to Author contributions: S.S., C.W.W., C.Z., and G.E.R. designed research; S.S. and X.L. per- formed research; S.S., C.W.W., and C.Z. contributed new reagentsanalytic tools; S.S. and X.L. analyzed data; and S.S. and G.E.R. wrote the paper. The authors declare no conflict of interest. Abbreviations: GO, Gene Ontology; JH, juvenile hormone; PWM, position-specific weight matrix; SVM, support vector machine. To whom correspondence may be addressed at: University of Illinois, 201 North Goodwin Avenue, Urbana, IL 61801. E-mail: [email protected]. To whom correspondence may be addressed. E-mail: [email protected]. © 2006 by The National Academy of Sciences of the USA 16352–16357 PNAS October 31, 2006 vol. 103 no. 44 www.pnas.orgcgidoi10.1073pnas.0607448103

Transcript of Genome scan for cis-regulatory DNA motifs … begin to study the cis-regulatory code underlying this...

Genome scan for cis-regulatory DNA motifsassociated with social behavior in honey beesSaurabh Sinha*†‡, Xu Ling*, Charles W. Whitfield†§¶, Chengxiang Zhai*†, and Gene E. Robinson†§¶�

Departments of *Computer Science and §Entomology, †Institute of Genomic Biology, and ¶Neuroscience Program, University of Illinoisat Urbana–Champaign, Urbana, IL 61801

Contributed by Gene E. Robinson, August 28, 2006

Honey bees (Apis mellifera) undergo an age-related, socially reg-ulated transition from working in the hive to foraging, which isassociated with changes in the expression of thousands of genesin the brain. To begin to study the cis-regulatory code underlyingthis massive social regulation of gene expression, we used thenewly sequenced honey bee genome to scan the promoter regionsof eight sets of behaviorally related genes differentially expressedin the brain in the context of division of labor among worker bees,for 41 cis-regulatory motifs previously characterized in Drosophilamelanogaster. Binding sites for the transcription factors Hairy,GAGA, Adf1, Cf1, Snail, and Dri, known to function in nervoussystem development, olfactory learning, or hormone binding inDrosophila, were significantly associated with one or more genesets. The presence of some binding sites also predicted expressionpatterns for as many as 71% of the genes in some gene sets. Theseresults suggest that there is a robust relationship between cis andsocial regulation of brain gene expression, especially consideringthat we studied <15% of all known transcription factors. Theseresults also suggest that transcriptional networks involved in theregulation of development in Drosophila are used to regulatebehavioral development in adult honey bees. However, differ-ences in gene regulation between these two processes are sug-gested by the finding that the promoter regions for the behavior-ally related bee genes differed in both motif occurrence and G�Ccontent relative to their Drosophila orthologs.

Apis mellifera � gene regulation � microarray � position weight matrix �transcription factors

Behavioral development in honey bees (Apis mellifera) gives riseto an intricate division of labor in the honey bee colony (1).

Adult worker honey bees perform a series of tasks in the hive, suchas brood care (‘‘nursing’’), when they are young and then shift toforaging for nectar and pollen outside the hive when they are 2–3weeks of age for the remainder of their 5- to 7-week life. Thetransition from hive work to foraging involves changes in endocrineactivity, brain chemistry, and brain structure. This transition issocially regulated and responsive to the needs of the colony, whichare communicated via pheromones and other means. These socialfactors act directly or indirectly on physiological mechanisms thatinfluence behavioral maturation, including mechanisms involvingjuvenile hormone (JH), the foraging and malvolio genes, and othergenes still to be identified.

The transition from hive work to foraging in honey bee coloniesalso involves changes in the expression of thousands of genes in thebrain (2, 3). This finding suggests that honey bees will be useful inelucidating the mechanisms by which social factors regulate geneexpression in the brain, one frontier in the study of genes and socialbehavior.

To begin to study the cis-regulatory code underlying this massivesocial regulation of brain gene expression, we used the newlysequenced honey bee genome (4) to scan for cis-regulatory motifsin the promoter regions of genes previously shown to be differen-tially expressed in the brain in the context of socially regulateddivision of labor (2, 3). We searched for 41 motifs that werepreviously well characterized in Drosophila melanogaster, primarily

for embryonic development, which represents �15% of all knowntranscription factors (www.godatabase.org). Starting with theseexperimentally validated motifs allowed us to employ solely bioin-formatics methods for this study; identification of motifs de novowould require additional molecular validation. With this approach,we also were able to begin to explore the hypothesis that transcrip-tional networks involved in the regulation of embryonic develop-ment in Drosophila are also used to regulate adult behavioraldevelopment in honey bees. The premise of this hypothesis is thatmany transcription factors show high levels of functional conser-vation across broad evolutionary distances (5).

Microarray analysis has revealed that nurse and forager honeybees show differences in brain mRNA abundance for about one-third of the genes analyzed, and subsequent experiments indicatedthat many of these genes are socially regulated and predictive ofbehavioral status (2). Whitfield et al. (3) generated additional setsof genes implicated in social regulation by performing microarrayanalyses to identify genes regulated in the brain by physiological,ontogenetic, and genetic factors that are also known to influencethe age at onset of foraging. We used these sets of ‘‘behaviorallyrelated’’ genes in our study.

We divided eight gene sets (2, 3), drawn from a total of 3,129genes (estimated to represent �25% of the genes in the beegenome), into those sets that were up-regulated or down-regulatedin the brain in response to a specific condition (Table 1). Each geneset contained 100–871 genes. Three additional pairs of gene setswere derived from these gene sets (see Methods) to capture specificcombinations of conditions. We identified and quantified patternsof occurrence of the 41 cis-regulatory motifs, listed in Table 5, whichis published as supporting information on the PNAS web site, in thepromoter regions of these gene sets.

We scanned for these motifs by modeling the binding specificityof each transcription factor by using a position-specific weightmatrix (PWM). We used the computer algorithm Stubb (6) to scorea promoter (2,000-bp region upstream of the translation start sitefor each gene) for matches to PWMs. The Stubb algorithm waspreviously found to accurately predict cis-regulatory modules in-volved in the segmentation pathway in Drosophila (7).

The general scheme for this study is illustrated in Fig. 1. Weperformed ‘‘enrichment analysis’’ by using hypergeometric tests toanalyze each gene set for statistical enrichment for each of the 41motifs, compared with the entire complement of 3,129 genes. Wealso performed ‘‘correlation analysis,’’ analyzing each pair of up-and down-regulated gene sets for correlation between motif countsand gene expression. Because all of the gene sets are related to

Author contributions: S.S., C.W.W., C.Z., and G.E.R. designed research; S.S. and X.L. per-formed research; S.S., C.W.W., and C.Z. contributed new reagents�analytic tools; S.S. andX.L. analyzed data; and S.S. and G.E.R. wrote the paper.

The authors declare no conflict of interest.

Abbreviations: GO, Gene Ontology; JH, juvenile hormone; PWM, position-specific weightmatrix; SVM, support vector machine.

‡To whom correspondence may be addressed at: University of Illinois, 201 North GoodwinAvenue, Urbana, IL 61801. E-mail: [email protected].

�To whom correspondence may be addressed. E-mail: [email protected].

© 2006 by The National Academy of Sciences of the USA

16352–16357 � PNAS � October 31, 2006 � vol. 103 � no. 44 www.pnas.org�cgi�doi�10.1073�pnas.0607448103

socially regulated behavioral maturation, in either a correlative orcausal sense, an association between a motif and a gene setdiscovered with either enrichment or correlation analysis wouldimplicate the corresponding transcription factor as being active ina regulatory pathway in the brain that is related to social behavior.In addition, because the PWMs for the motifs are derived fromstudies of Drosophila (with �300 Myr divergence time to honey bee;see ref. 8), an association between a motif and a gene set would alsoprovide evidence for considerable evolutionary conservation ofthat motif.

Results and DiscussionBehaviorally Related Genes in Honey Bees Have High G�C ContentPromoters. Our initial analyses revealed a surprisingly large numberof motif–gene set associations. We detected 41 associations, com-pared with an expected 1 under the null hypothesis (see Table 6,which is published as supporting information on the PNAS website). Most of these associations involved G�C-rich motifs (seeTable 7, which is published as supporting information on the PNASweb site), which led us to explore whether the promoters for thesegene sets were, in general, high in G�C content.

Three gene sets were indeed significantly enriched for G�C-richpromoters (Table 2, hypergeometric test). In addition, G�C contentcorrelated significantly with brain expression levels in four pairs ofgene sets (Table 2, correlation coefficient). By using a supervisedlearning method (2-fold cross-validation with a simple threshold-based classifier), promoter G�C content correctly classified asignificant fraction of genes in some gene sets as being either up-or down-regulated. The strongest result was obtained for manga-nese-responsive genes; classification accuracy was 74.6% (P �1E-38, binomial test; see Table 8, which is published as supportinginformation on the PNAS web site)

We tested whether the G�C enrichment of the above gene setsalso occurred in the Drosophila genome and thus perhaps reflectssome regulatory phenomenon common to both species. Hypergeo-

metric tests showed no association between Drosophila orthologs ofthe bee gene sets and high G�C promoter content (see Table 9,which is published as supporting information on the PNAS website). This difference between honey bee and Drosophila promoterregions was further substantiated with Gene Ontology (GO) anal-ysis (www.godatabase.org). The bee genes with the highest G�Ccontent promoters are significantly overrepresented in the catego-ries of transcriptional regulation and ectoderm, midgut, heart, andnervous system development (based on experimental evidence fortheir Drosophila orthologs; see Table 10, which is published assupporting information on the PNAS web site). However, Drosoph-ila genes with the highest G�C content promoters were not over-represented in any GO category (see Table 11, which is publishedas supporting information on the PNAS web site).

The reasons for these bee�fly differences are not known butmight involve enhanced transcription factor binding to knownG�C-rich motifs in bees, the presence of additional as yet uniden-tified G�C-rich motifs in bees, species differences in methylation, orspecies differences in gene expression. Genes with brain-specificpatterns of expression are known to have high G�C contentpromoters in humans (9).

Behaviorally Related Genes in Honey Bees Are Enriched for cis-Regulatory Motifs of Hairy, Cf1, Adf1, Dri, and Snail. To identifymotif–gene set associations on the basis of the specific identityof the motifs, rather than the G�C content of the motifs andpromoters, we repeated the analyses while controlling for theeffects of G�C content. First, Stubb was run with a local, ratherthan global, background model based on each gene’s pro-moter. This conservative strategy likely resulted in a failure todetect some true motif–gene set associations (G�C-rich motifsin G�C-rich promoters will have suppressed Stubb scores), butit enhances the reliability of our results. Second, the enrich-ment analysis was done with the additional requirement thatthe gene set be more enriched for the motif than for G�C

Table 1. Gene sets associated with honey bee socially regulated division of labor, derived from microarray experiments on braingene expression

Condition Gene setSize

(no. of genes) Description

Behavioral Preforaging maturation1 710 Up�down-regulated in 1-day-old relative to 4-day-old hive (preforaging) beesdevelopment Preforaging maturation2 871

Forager1 560 Up-regulated in foragers relative to nurses; up-regulated in nurses relativeto foragersNurse1 789

Hive bee-to-forager transition1 100 Up�down-regulated in foragers relative to 4-day-old hive beesHive bee-to-forager transition2 100

Treatment Methoprene1 361 Up�down-regulated in bees treated orally with the JH* analog methoprenerelative to sucrose-fed control beesMethoprene2 236

cGMP1 279 Up�down-regulated in bees treated orally with cGMP relative to sucrose-fedcontrol bees (related to foraging gene†)cGMP2 226

cGMP � cAMP 215 Up�down-regulated in bees treated orally with cGMP relative to thosetreated orally with cAMP‡cGMP � cAMP 210

Manganese1 333 Up�down-regulated in bees treated orally with manganese relative tosucrose-fed control bees (related to malvolio gene§)Manganese2 529

Genetic variation A.m. ligustica1 373 Up�down-regulated in A.m. ligustica relative to mellifera (mellifera havefaster behavioral development)A.m. mellifera1 331

Brief behavioral descriptions are given in the text. Microarray experiments on brain gene expression are described in refs. 2 and 3. The most important genesets are the Forager1 and Nurse1 sets, each of which contains genes significantly up-regulated in the brains of bees exhibiting these behaviors; the expressionof these sets of genes also has been shown to be predictive of social behavior (2). Many derivative gene sets were obtained from the primary compendium of16 sets by refining the characteristics of the sets or intersecting them with each other (not shown; see Methods).*JH causes precocious foraging in bees (31).†The foraging gene encodes a cGMP-dependent protein kinase, which is up-regulated in the bee brain during behavioral maturation; cGMP treatment causesprecocious foraging (32).

‡cAMP is similar to cGMP but does not cause early foraging (32).§The malvolio gene, which encodes a manganese transporter in neurons, is up-regulated in the bee brain during behavioral maturation; manganese treatmentcauses early foraging (33).

Sinha et al. PNAS � October 31, 2006 � vol. 103 � no. 44 � 16353

EVO

LUTI

ON

content. Third, the correlation analysis used partial correla-tion coefficients to factor out the correlation between G�Ccontent and expression (10).

Adf1, Hairy, Snail, Dri, and Cf1 were found to be associated withseveral different gene sets (Table 3; and Tables 12 and 13, which arepublished as supporting information on the PNAS web site). Eachof these motif–gene set associations was the most statistically

significant (P � 0.001) among all motifs for that set. Three of theseassociations (involving Adf1, CF1, and Hairy) were significant byboth enrichment and correlation analyses (Table 3).

As an additional negative control apart from that providedinherently by our statistical framework, we performed the sameenrichment analysis on a randomized promoter sequence obtainedby permuting the bases of each of the 3,129 promoters. Hypergeo-metric tests revealed no statistically significant associations betweenthese five (or any other) motifs and gene sets (data not shown).

For these five motifs, the genome has more promoters withextremely high Stubb scores than expected by chance (in a ran-domized genome) (see Fig. 2, which is published as supportinginformation on the PNAS web site). These statistical results suggestthat the motifs have some functional relevance in the honey beegenome. Additional evidence is provided by GO analysis. The 200genes with the strongest association with Adf1 and Dri (in terms ofStubb scores) are significantly associated with the biological process‘‘nervous system development,’’ whereas the genes for Hairy aresignificantly associated with ‘‘sensory organ development’’ (seeTable 14, which is published as supporting information on thePNAS web site), a process that is JH-dependent (11, 12).

The known functions of the transcription factors associated withthese five motifs are consistent with the possibility that these motifsare involved in the regulation of neural and behavioral plasticity in

Fig. 1. Flow chart for the discovery of motif–gene set associations using four statistical tests. We identified and quantified patterns of occurrence oftranscription factor binding motifs in the promoter regions of selected sets of genes (DATA1) related to socially regulated division of labor in honey bee colonies(2, 3). Each motif was scored against each gene’s promoter (DATA2). The top 200 targets of each motif (the genes with the highest Stubb scores for each motif)were analyzed for enrichment for behavior-related gene sets (Tests 1A and 1B). The union of up- and down-regulated gene sets was analyzed for correlationbetween gene expression and each motif’s score (Test 2) and for binary classifiability into up- or down-regulated genes based on each individual motif’s scores(Test 3) or based on scores for all motifs (Test 4).

Table 2. Some gene sets are enriched for high G�C content(fraction of sequence length that is G or C) in their promoters

Gene set Statistical test P value

Preforaging maturation1 Hypergeometric 2.8E-5Manganese1 Hypergeometric 4.5E-11A.m. ligustica1 Hypergeometric 9.3E-4Preforaging maturation1 vs.

preforaging maturation2Correlation coefficient 3.2E-9

Manganese1 vs. manganese2 Correlation coefficient �2.2E-16cGMP � cAMP vs. cGMP � cAMP Correlation coefficient 6.7E-4Forager1 and manganese1 vs.

nurse1 and manganese2Correlation coefficient 6.7E-5

Enrichment was established by using a hypergeometric test (using the top200 genes by G�C content) and with a correlation coefficient.

16354 � www.pnas.org�cgi�doi�10.1073�pnas.0607448103 Sinha et al.

the honey bee. Adf1, Hairy (13, 14), and Snail (15) proteins haveroles in nervous system development in Drosophila, and Adf1 is alsoimplicated in olfactory learning (16) and synapse formation inadults (17). Cf1 (also called ‘‘Ultraspiracle’’ or ‘‘USP’’) bindsspecifically to JH in vitro (18), is rapidly up-regulated by JH in theadult honey bee (19), and is expressed throughout the adult beebrain, especially in the mushroom bodies, a region of multimodalsensory integration and memory (20). JH blood levels increaseduring honey bee behavioral development; levels are lower in nursebees than in foragers, and methoprene (JH analog) treatmentcauses precocious foraging (1). Thus, JH is likely to be a crucialmechanism by which social cues interact with the genome (3).

We were able to find orthologs in the bee genome (http:��racerx00.tamu.edu�bee�resources.html) for four of five correspond-ing Drosophila transcription factors (all but Adf1). We also founda high degree of sequence identity for the DNA binding domainsof these four orthologs, suggesting that they bind to the same motifsas in Drosophila (see Fig. 3, which is published as supportinginformation on the PNAS web site). For example, the basicN-box-specific region of the Hairy protein and the P box of the Cf1protein are completely conserved across insects and humans.

However, despite these similarities, the motif–gene set associa-tions we found for honey bee were not detected in fly. Hypergeo-metric tests on orthologs of the bee gene sets revealed no stronglysignificant associations in Drosophila; for the six significant associ-ations in Table 12, P values in Drosophila were 0.33, 0.54, 0.69, 0.91,0.02, and 0.54, respectively. These results suggest that regulatory

processes related to the observed motif–gene set associations mightbe associated with some specific aspects of honey bee socialbehavior that differs from Drosophila behavior. Consistent with thisfinding is the observation that although �76% of the genesannotated in the honey bee genome have detectable orthologs in D.melanogaster (4), orthologs were present for a significantly lowerfraction of up-regulated genes in our gene sets (69%, P � 0.006; and77%, P � 0.67, for down-regulated genes). Some of the gene setsstudied here may encode a slightly higher proportion of ‘‘novel’’ orrapidly evolving proteins than does the whole genome.

Several gene sets (e.g., those related to ‘‘genetic variation’’) werenot found to be associated with particular cis-regulatory motifs.Possible reasons for this finding include associations with motifsother than those studied here or extensive regulation at levels otherthan transcription.

Patterns of Occurrence of Hairy, Snail, and GAGA Binding FactorsClassify Expression Patterns of Honey Bee Behaviorally Related Genes.Our results provide evidence for a connection between cis andsocial regulation of brain gene expression. If this connection isstrong, then the patterns of motif occurrence we detected shouldhave sufficient discriminative power to classify the expressionpattern of genes in our gene sets as up- or down-regulated. Wetested this hypothesis by looking for cases in which motif-basedclassification accuracy is statistically significant and G�C content-based classification accuracy is less so.

We first tested each motif separately, using a threshold-based

Table 3. Identification of motif–gene set associations by enrichment or correlation analysis

Gene set Statistical test*Statistical

significance Motif†

Test 2P value‡

cGMP � cAMP Hypergeometric 9.98E-7 Dri 0.95cGMP2 Hypergeometric 3.5E-4 CF1 0.008Methoprene2 Hypergeometric 3.4E-4 Hairy 0.009Methoprene1 vs. methoprene2 Partial correlation 6.9E-5 Adf1 0.0007Manganese1 vs. manganese2 Partial correlation 5.4E-4 Hairy 0.22cGMP1 vs. cGMP2 Partial correlation 8.7E-4 Snail 0.47Forager1 and cGMP1 vs.

nurse1 and cGMP2Partial correlation 4.6E-4 Adf1 0.13

*Hypergeometric, hypergeometric test (P � 0.001, Q � 0.150), with the additional requirement that the gene setbe more enriched for the motif than for G�C content; Partial correlation, Pearson’s correlation coefficient (P �0.001, Q � 0.075) for motif score and gene expression, using partial correlation analysis to factor out thecorrelation between G�C content and expression (10).

†Enrichment analysis identified transcription factor binding sites for Dri, CF1, and Hairy, and correlation analysisidentified Hairy (again), Adf1, and Snail. Each of these associations was the most statistically significant amongall motifs for the gene set listed in column 1.

‡The statistical significance from the second test (i.e., partial correlation if the first test is hypergeometric, andconversely).

Table 4. Classification of brain gene expression (up- or down-regulated) on the basis of thepattern of (single) motif occurrences

Gene set MotifClassificationaccuracy, %

Statisticalsignificance

Statistical significancefor G�C content-based

classification

Hive bee-to-forager transition2 vs.hive bee-to-forager transition1*

GAGA 71 1.6E-5 0.62

Methoprene2 vs. methoprene1 Hairy 58 1.9E-5 1.9E-2cGMP2 vs. cGMP1 Snail 57.5 8E-4 0.56Hive bee-to-forager transition2 vs.

hive bee-to-forager transition1GAGA 62 4.2E-4 6.6E-3

Listed are cases in which motif-based classification accuracy is statistically significant (P � 0.001, binomial test)and G�C content-based classification accuracy is less significant.*Analysis was performed on the top 50% of genes in each gene set (based on the magnitude of gene expressiondifferences; see Methods).

Sinha et al. PNAS � October 31, 2006 � vol. 103 � no. 44 � 16355

EVO

LUTI

ON

classifier, and found four such cases (Table 4; and Tables 15 and 16,which are published as supporting information on the PNAS website). These cases involved three different motifs: Hairy and Snail(again) and GAGA. The presence or absence of the GAGA motifalone correctly classified the expression (up- or down-regulated) of71% of 100 of the most differentially expressed genes in the Hivebee-to-forager transition1 vs. Hive bee-to-forager transition2 set,compared with 50% expected on the basis of chance alone (P �1.6E-5, binomial test). Similar results were obtained for Hairy andthe response of genes to methoprene (JH analog), and for Snail andthe response of genes to cGMP (related to the foraging gene; seeref. 1).

Using all 41 motifs in our compendium and a support vectormachine (SVM) classifier (http:��svmlight.joachims.org), we foundone more case: 2-fold cross-validation analysis correctly classifiedthe expression pattern of 71% of the genes up-regulated in foragersand by methoprene compared with genes up-regulated in nursesand down-regulated by methoprene (n � 78 genes, 39 of each; P �0.00018, binomial test). G�C content alone had insignificant pre-dictive power (P � 0.1, binomial test). These findings are remark-able, considering that our motif compendium comprises �15% ofthe estimated transcription factors in the Drosophila (www.godatabase.org) or honey bee genomes.

cis-Regulatory Motifs Show Combinatorial Regulation of Honey BeeBehaviorally Related Genes. The transcriptional pathways involvedin embryonic development are well known for their combinedaction of multiple transcription factors on common gene targets(21). The combinatorial nature of transcriptional pathways is ahighly conserved feature of cis-regulation, so we expected to be ableto detect this in the context of social regulation. Pairs of cis-regulatory binding sites do indeed co-occur in the promoter regionsof genes in our gene sets.

Focusing on the Forager1 and Nurse1 gene sets, we countedthe common gene targets of each pair of transcription factors andfound this number to be significantly higher than by chance for 78of the 820 pairs (P � 0.001, Q � 0.01, hypergeometric test; see ref.22), even after controlling for the effect of local backgroundcomposition. In comparison, a negative control experiment involv-ing randomly permuted versions of the promoters produced onlyfive pairs of transcription factors. After generating a pairwiseinteraction graph from these data and performing further analysis,we found a cohesive set of 7 transcription factors (Adf1, Hairy, Cf1and GAGA again, plus AbdB, Zeste, and Eve) of the 41 studied,with almost every pair having significant patterns of interaction (seeFig. 4, which is published as supporting information on the PNASweb site).

ConclusionsThese results demonstrate a robust association for social behavior,brain gene expression, and distributions of transcription factorbinding sites throughout the honey bee genome and are consistentwith results from studies of social behavior in rodents, which haverevealed important roles for cis regulation for single genes (23, 24).Our finding of motifs by using PWMs primarily derived from studiesof Drosophila developments demonstrates that transcriptional net-works involved in the regulation of embryonic development arereused by nature for adult behavioral functions. We also detecteddifferences in motif–gene set associations and promoter G�Ccontent between honey bee and Drosophila that might reflectunique aspects of gene regulation associated with social regulation.Social behavior is a highly derived trait, and we predict that theevolution of transcriptional combinatorial codes for socially regu-lated gene expression in the brain will involve both conserved andnovel motifs that await discovery.

MethodsScoring Genes for Motif Occurrence. The Stubb algorithm (6) com-putes a statistical (log-likelihood ratio) score for the clustering ofbinding sites of one or more transcription factors in a sequence,accounting for varying numbers and strengths of the sites. Thesoftware was run with a window size of 200, shift of 100, and azeroth-order background Markov model. The training sequence forthe Markov model was either the entire complement of 2-kbppromoters (‘‘global background’’) or the sequence of the promoterbeing analyzed (‘‘local background’’). (The local background of 2kbp provides sufficient statistics to train a zeroth-order Markovmodel of four states.) The best-scoring 200-bp window in a gene’spromoter was used to score that gene. We masked short tandemrepeats (putative micro- and minisatellites) in the sequence toprevent the Stubb algorithm from confounding these repeats withweak binding sites. Tandem repeats were masked using the programTandem Repeat Finder (25), with recommended parameters 2 7 780 10 25 500. The heat shock element was represented by a set ofeight PWMs from TRANSFAC (www.gene-regulation.com�pub�databases.html#transfac), corresponding to trimers of theAGAAN site in all possible orientations (26). Other PWM motifs,corresponding to binding sites in fruitfly, were obtained fromTRANSFAC, JASPAR (27), and Schroeder et al. (28), and redun-dant matrices were removed manually to obtain a set of 41 motifsfor 37 distinct transcription factors. Unless specified otherwise, agene was considered as containing binding sites for a particulartranscription factor if it was among the top 200 genes ranked byStubb score for the corresponding motif; these genes are referredto as ‘‘targets’’ of a motif. Thus, motif targets are decided based onclustering of strong and weak motif occurrences in their promoters.The G�C content of a promoter was measured as the fraction of thepromoter that is Gs or Cs.

Gene Sets. Eight sets of genes established as being significantlyup-regulated or down-regulated in the brain in response to specificconditions (behavioral development, genetic variation, or neuro-active treatment) were taken from refs. 2 and 3. Each set waspartitioned into up- and down-regulated subsets, as listed in Table1. Treatments with cGMP, methoprene (JH analog), or manganeseare known to induce precocious foraging in the honey bee. Hence,we intersected the Forager1 set with each of the cGMP1,Methoprene1, and Manganese1 sets, and likewise the Nurse1set with each of cGMP2, Methoprene2, and Manganese2 sets toobtain six additional sets (set sizes as given in Table 17, which ispublished as supporting information on the PNAS web site), for atotal of 22 sets. Enrichment analysis was done for each of these 22sets separately.

Enrichment Analyses. Hypergeometric test. The set of genes targeted bya motif was intersected with a given gene set, and the P value of theintersection size was calculated with the hypergeometric test: H(N,g, m, i), where N is the total number of genes (3,129), g is the sizeof the given gene set, m is the number of genes targeted by motif(200), and i is the intersection size. For motif enrichment vs. G�Cenrichment, let iG/C be the size of the intersection of the gene setwith the top m G � C-rich genes (promoters). We heuristicallycomputed the P value of the hypergeometric distribution H(N,iG/CN�m, m, i) and required this P value to be �0.05 as an additionalcriterion for reporting the motif as significantly associated with thegene set. Note that the original hypergeometric test H(N, g, m, i)compares the fraction i�m (the specificity of predicting gene setmembership based on motif score) with the fraction g�N, which isthe baseline specificity of a random predictor. In the secondstatistical test, we change the baseline performance to be thatachieved by a G�C content-based predictor, iG/C�m, and comparethe fraction i�m with this fraction, using H(N, iG/CN�m, m, i).

16356 � www.pnas.org�cgi�doi�10.1073�pnas.0607448103 Sinha et al.

Partial correlation analysis. The sets of up- and down-regulated genesfor a particular condition (e.g., methoprene response) were com-bined into one set, and for each gene we noted the expression value(E), the Stubb score (M) for a motif, and the G � C percentage(GC) of its promoter. A partial correlation analysis was performedto assess the correlation between variables E and M, while factoringout the effect of variable GC, and a P value was computed. Thepartial Pearson’s correlation between two variables X and Y, aftercontrolling for a third variable Z, was computed from the pairwisePearson’s correlation coefficients by the formula

rXY,Z �rXY � rXZ � rYZ

��1 � rXZ2 � � �1 � rYZ

2 �. [1]

The significance of a partial correlation rXY,Z with n data points wasassessed with a two-tailed t test on t � rXY,Z ��(n 3)�(1 rXY,Z

2 ),with n 3 degrees of freedom.

Q Value. To account for multiple-hypothesis testing, Q values (22)(estimated false discovery rates) were calculated as Q � (estimatedno. of false positives)�(no. of called positives at a given P value) �(P n)�i, where P is the P value, N is the total number of tests, andi is the sorted rank of the P value.

Orthology of Genes. Predicted honey bee genes were assigned toorthology groups with D. melanogaster genes on the basis ofreciprocal best BLASTX match.

GO Analysis. GO analysis was done using the program GeneMerge(29) that computes the significance (E value) of the enrichment ofa particular set of genes for a GO term. We used the biologicalprocesses ontology (D�melanogaster.BP) and an E value threshold(computed using Bonferroni correction) of 0.1.

Threshold-Based Classification. Up- and down-regulated genes for aparticular condition were labeled as positive and negative, respec-tively, and for any given motif, and a score threshold was sought thatmaximized the correctly classified genes (motif score above thresh-old predicts positive and below threshold predicts negative). Thistechnique is known as the TNoM score (30) and has been used infeature selection tasks related to cancer tissue classification. Two-fold cross-validation was done for each pair of gene sets and eachmotif, and the fraction of correctly classified genes on the test setwas used to evaluate the classification accuracy of the motif. In2-fold cross-validation, both the positively and negatively labeled

genes are partitioned into two equal parts: the training set and thetest set. Parameters of the classifier are trained on the training set,and the results are evaluated on the test set. The roles of the trainingand test sets are then reversed, and the overall results are theaverage of the two experiments.

SVM Classification. The SVM classifier used was SVMlight (http:��svmlight.joachims.org), and 2-fold cross-validation was done. Thefraction of correctly classified genes in the test set gives the accuracyof the classifier.

Data Sets Used in Classification. For both classification exercises(threshold-based and SVM), the data sets were modified from theoriginal, as follows. Each pair of up- and down-regulated gene sets(eight pairs listed in Table 1, as well as the three additional pairsderived from these, as described above) was taken, and the largerof the two was shrunk to match the size of the smaller set, retainingthe most up- or down-regulated genes. Thus, every classification testwas done on positive and negative sets of the same size. The test wasalso repeated with each of these gene sets further shrunk to half ofits original size, retaining the most up- or down-regulated genes(‘‘top 50%’’). Threshold-based classification was also done on theoriginal gene sets (before the above modifications), and results areshown in Fig. 5, which is published as supporting information on thePNAS web site.

Pairwise Associations Between Motifs. Association among motifs wasmeasured as follows. For every pair of motifs, the intersection setX of their target gene sets was assessed for significance, using thehypergeometric test. To factor out the effect of G�C content, (i) theStubb score of each promoter sequence was computed by randomlyshuffling the bases in the sequence, (ii) the intersection of the twomotifs’ target gene sets according to these newly computed ran-domized Stubb scores was obtained, and (iii) this intersection setwas subtracted from the original intersection set X. Hence, thehypergeometric test on the shrunk intersection is made stronger byremoving certain common gene targets that might be due to similarG�C content.

We thank P. Kheradpour for early assistance; R. Velarde for the USPalignment; and Y. Ben-Shahar, J. H. Hunt, S. Zhang, members of theRobinson laboratory, and G. D. Stormo for comments that improved themanuscript. This work was supported by the University of Illinois Sociog-enomic Initiative (G.E.R.) and by a National Science Foundation Frontiersin Integrative Biological Research grant.

1. Robinson GE, Grozinger CM, Whitfield CW (2005) Nat Rev Genet 6:257–270.2. Whitfield CW, Cziko AM, Robinson GE (2003) Science 302:296–299.3. Whitfield CW, Ben-Shahar Y, Brillet C, Leoncini I, Caruser D, LeConte Y,

Rodriguez-Zas S, Robinson GE (2006) Proc Natl Acad Sci USA 103:16068–16075.

4. Honeybee Genome Sequencing Consortium (2006) Nature 443:931–949.5. Carroll SB, Grenier JK, Weatherbee SD (2001) From DNA to Diversity:

Molecular Genetics and the Evolution of Animal Design (Blackwell Science,Malden, MA).

6. Sinha S, van Nimwegen E, Siggia ED (2003) Bioinformatics 19(Suppl 1):i292–i301.7. Sinha S, Schroeder M, Unnerstall U, Gaul U, Siggia ED (2004) BMC

Bioinformatics 5:E129.8. Grimaldi D, Engel MS (2005) Evolution of the Insects (Cambridge Univ Press,

Cambridge, UK).9. Vinogradov AE (2003) Nucleic Acids Res 31:5212–5220.

10. Lee S, Kohane I, Kasif S (2005) BMC Genomics 6:E168.11. Orenic TV, Jr, Held LI, Paddock SW, Carroll SB (1993) Development (Cam-

bridge, UK) 118:9–20.12. Zhou X, Riddiford LM (2002) Development (Cambridge, UK) 129:2259–2269.13. Heng JIT, Tang SS (2003) Bioessays 25:709–716.14. Frankfort BJ, Mardon G (2002) Development (Cambridge, UK) 129:1295–1306.15. Ashraf SI, Hu X, Roote J, Ip YT (1999) EMBO J 18:6426–6438.16. Roman G, Davis RL (2001) Bioessays 23:571–581.17. DeZazzo J, Sandstrom D, de Belle S, Velinzon K, Smith P, Grady L,

DelVecchio M, Ramaswami M, Tully T (2000) Neuron 27:145–158.

18. Jones G, Sharp PA (1997) Proc Natl Acad Sci USA 94:13499–13503.19. Barchuk AR, Maleszka R, Simoes ZL (2004) Insect Mol Biol 13:459–467.20. Velarde R, Robinson GE, Fahrbach SE Insect Mol Biol, in press.21. Stathopoulos A, Levine M (2005) Dev Cell 9:449–462.22. Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D,

Barrette T, Pandey A, Chinnaiyan AM (2004) Proc Natl Acad Sci USA101:9309–9314.

23. Insel TR, Young LJ (2001) Nat Rev Neurosci 2:129–136.24. Weaver IC, Szyf M, Meaney MJ (2002) Endocr Res 28:699.25. Benson G (1999) Nucleic Acids Res 27:573–580.26. Fernandes M, Xiao H, Lis JT (1994) Nucleic Acids Res 22:167–173.27. Sandelin A, Alkema W, Engstrom P, Wasserman WW, Lenhard B (2004)

Nucleic Acids Res 32:D91–D94.28. Schroeder MD, Pearce M, Fak J, Fan H, Unnerstall U, Emberly E, Rajewsky

N, Siggia ED, Gaul U (2004) PLoS Biol 2:E271.29. Castillo-Davis CI, Hartl DL (2003) Bioinformatics 19:891–892.30. Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z (200)

J Comput Biol 7:559–583.31. Bloch G, Wheeler DL, Robinson GE (2002) in Hormones, Brain, and Behavior,

eds Pfaff DW, Arnold AP, Etgen AM, Fahrbach SE, Rubin RT (Academic,New York), Vol 3, pp 195–236.

32. Ben-Shahar Y, Robichon A, Sokolowski MB, Robinson GE (2002) Science296:741–744.

33. Ben-Shahar Y, Dudek NL, Robinson GE (2004) J Exp Biol 207:3281–3288.

Sinha et al. PNAS � October 31, 2006 � vol. 103 � no. 44 � 16357

EVO

LUTI

ON