Orphan genes in Leishmania major

download Orphan genes in Leishmania major

of 8

description

Orphan genes are protein coding genes that lack recognizable homologs in other organisms. These genes were reported to comprise a considerable fraction of coding regions in all sequenced genomes and thought to be allied with organism’s lineage-specific traits. However, their evolutionary persistence and functional significance still remain elusive. Due to lack of homologs with the host genome and for their probable lineage-specific functional roles, orphan gene product of pathogenic protozoan might be considered as the possible therapeutic targets. L. major is an important parasitic protozoan of the genus Leishmania that is associated with the disease cutaneous leishmaniasis. Therefore, evolutionary and functional characterization of orphan genes in this organism may help in understanding the factors prevailing pathogen evolution and parasitic adaptation. In this study, we systematically identified orphan genes of L. major and employed several in-silico analyses for understanding their evolutionary and functional attributes. To trace the signatures of molecular evolution, we compared their evolutionary rate with non-orphan genes. In agreement with prior observations, here we noticed that orphan genes evolve at a higher rate as compared to non-orphan genes. Lower sequence conservation of orphan genes was previously attributed solely due to their younger gene age. However, here we observed that together with gene age, a number of genomic (like expression level, GC content, variation in codon usage) and proteomic factors (like protein length, intrinsic disorder content, hydropathicity) could independently modulate their evolutionary rate. We considered the interplay of all these factors and analyzed their relative contribution on protein evolutionary rate by regression analysis. On the functional level, we observed that orphan genes are associated with regulatory, growth factor and transport related processes. Moreover, these genes were found to be enriched with various types of interaction and trafficking motifs, implying their possible involvement in host-parasite interactions. Thus, our comprehensive analysis of L. major orphan genes provided evidence for their extensive roles in host-pathogen interactions and virulence.

Transcript of Orphan genes in Leishmania major

  • cbDepartment of Physical Sciences, Indian Institute of Science Education and Research-Kolkata, Mohanpur 741246, Nadia, West Bengal, India

    a r t i c l e i n f o

    Article history:Received 13 January 2015Received in revised form 25 March 2015Accepted 26 March 2015Available online 2 April 2015

    Keywords:Orphan genesEvolutionary rate

    tion these genes are also called as lineage-specic or taxonomicallyrestricted genes (Wilson et al., 2005). Orphan genes comprise aconsiderable fraction of genes in all domains of life including

    ed into two cate-at lack homs-specicmology w

    gene in any other species (Wissler et al., 2013). Several hypohave been put forward to explain the origin of orphan geninstance, gene duplication and rearrangement processes followedby rapid divergence were considered to be an important pathwayfor the emergence of orphan genes in primates, Arabidopsis andzebrash (Donoghue et al., 2011; Toll-Riera et al., 2009; Yanget al., 2013). In primates, it has been found that the majority oforphan genes arise from frequent recruitment of transposable ele-ments (Toll-Riera et al., 2009). Orphan genes may also arise de

    Abbreviations: L. major, Leishmania major; BLAST, Basic Local Alignment SearchTool; GRAVY, grand average of hydropathy index; Nc, effective number of codon;CAI, Codon Adaptation Index; FPKM, Fragments Per Kilobase of exon per Millionfragments mapped. Corresponding author. Tel.: +91 33 2355 6626; fax: +91 33 2355 3886.

    E-mail address: [email protected] (T.C. Ghosh).

    Infection, Genetics and Evolution 32 (2015) 330337

    Contents lists availab

    Infection, Genetic

    journal homepage: www.elsOrphan genes are protein coding genes that do not share detect-able sequence similarity with the genomes of other organisms(Tautz and Domazet-Loso, 2011). Due to their phylogenetic restric-

    Fischer, 2008). These genes can be broadly classigories (i) taxon-specic orphan genes (TSOGs) thoutside of a focal taxonomic group and (ii) speciegenes (SSOGs), a subset of SSOGs sharing no hohttp://dx.doi.org/10.1016/j.meegid.2015.03.0311567-1348/ 2015 Elsevier B.V. All rights reserved.ologyorphanith anytheseses. Forvirulence. 2015 Elsevier B.V. All rights reserved.

    1. Introduction viruses (Khalturin et al., 2009; Wilson et al., 2005; Yin andProtein disorderInteraction and trafcking motifsHostparasite interactionLineage-specic adaptationa b s t r a c t

    Orphan genes are protein coding genes that lack recognizable homologs in other organisms. These geneswere reported to comprise a considerable fraction of coding regions in all sequenced genomes andthought to be allied with organisms lineage-specic traits. However, their evolutionary persistenceand functional signicance still remain elusive. Due to lack of homologs with the host genome and fortheir probable lineage-specic functional roles, orphan gene product of pathogenic protozoan might beconsidered as the possible therapeutic targets. Leishmania major is an important parasitic protozoan ofthe genus Leishmania that is associated with the disease cutaneous leishmaniasis. Therefore, evolutionaryand functional characterization of orphan genes in this organism may help in understanding the factorsprevailing pathogen evolution and parasitic adaptation. In this study, we systematically identied orphangenes of L. major and employed several in silico analyses for understanding their evolutionary and func-tional attributes. To trace the signatures of molecular evolution, we compared their evolutionary ratewith non-orphan genes. In agreement with prior observations, here we noticed that orphan genes evolveat a higher rate as compared to non-orphan genes. Lower sequence conservation of orphan genes waspreviously attributed solely due to their younger gene age. However, here we observed that together withgene age, a number of genomic (like expression level, GC content, variation in codon usage) and pro-teomic factors (like protein length, intrinsic disorder content, hydropathicity) could independentlymodulate their evolutionary rate. We considered the interplay of all these factors and analyzed their rela-tive contribution on protein evolutionary rate by regression analysis. On the functional level, we observedthat orphan genes are associated with regulatory, growth factor and transport related processes.Moreover, these genes were found to be enriched with various types of interaction and trafcking motifs,implying their possible involvement in hostparasite interactions. Thus, our comprehensive analysis of L.major orphan genes provided evidence for their extensive roles in hostpathogen interactions andSumit Mukherjee a,b, Arup Panda a, Tapash Chandra Ghosh a,aBioinformatics Centre, Bose Institute, P 1/12, C.I.T. Scheme VII M, Kolkata 700 054, West Bengal, IndiaElucidating evolutionary features and fungenes in Leishmania majortional implications of orphan

    le at ScienceDirect

    s and Evolution

    evier .com/locate /meegid

  • pathogen evolution and reveals their contribution in parasitic

    ticsnovo from non-coding regions (Cai et al., 2008; Heinen et al., 2009;Knowles and McLysaght, 2009; Neme and Tautz, 2013; Wu et al.,2011; Xie et al., 2012; Yang and Huang, 2011). These genes werealso found to emerge from overlapping of anti-sense readingframes and frameshift mutations in protein coding sequences(Wissler et al., 2013).

    Orphan genes are emerging to play critical roles in lineage-specic adaptation of different species to a broad range of ecologi-cal conditions (Khalturin et al., 2009). These genes were reportedto play substantial roles in response to a variety of abiotic stressesin plant genomes (Donoghue et al., 2011). Imperative roles oforphan genes were also evidenced in several development pro-cesses. For instance, orphan gene products were found to be crucialfor human early brain development (Zhang et al., 2011) and alsofor regulation of tentacle formation in Hydra species (Khalturinet al., 2008). Lineage-specic putative surface antigen of plasmod-ium were shown to be involved in hostparasite interactions(Kuo and Kissinger, 2008). In 2010, Zhang et al. ectopicallyexpressed 14 Leishmania donovani-specic genes in Leishmaniamajor and observed that two of these genes could increase L. majorsurvival in visceral organs (Zhang and Matlashewski, 2010).

    Studies conducted on different eukaryotes demonstrated thatorphan genes evolve faster than non-orphan genes (Cai et al.,2006; Domazet-Loso and Tautz, 2003; Donoghue et al., 2011;Kuo and Kissinger, 2008; Toll-Riera et al., 2009). An inverserelationship between gene age and protein evolutionary rate hasbeen widely observed in a broad range of organisms including pri-mates (Toll-Riera et al., 2009), mammals (Alba and Castresana,2005), drosophila (Domazet-Loso and Tautz, 2003), Plasmodium(Kuo and Kissinger, 2008), fungi (Cai et al., 2006) and bacteria(Daubin and Ochman, 2004). Since, orphan genes are youngergenes in a particular lineage it was hypothesized that these genesevolve faster mainly due to their recent evolutionary origin (Caiet al., 2006; Domazet-Loso and Tautz, 2003; Toll-Riera et al.,2009). Later it was found that protein evolutionary rate could notbe determined by a single factor, rather proteins intrinsic proper-ties as well as their evolutionary age independently modulate therates of protein evolution (Toll-Riera et al., 2012). Proteinevolutionary rate was shown to correlate with a number of genelevel and protein level attributes, such as expression level(Drummond et al., 2005; Drummond et al., 2006; Pal et al.,2001), number of proteinprotein interactions (Fraser et al.,2002), protein complex number (Chakraborty and Ghosh, 2013),its centrality in the protein interaction network (Hahn and Kern,2005), protein dispensability (Hirsh and Fraser, 2001), sequencelength (Marais and Duret, 2001), Codon Adaptation Index (CAI),effective number of codons (Nc) (Pal et al., 2001; Wall et al.,2005), protein disorder content (Chen et al., 2011; Podder andGhosh, 2010), etc. In spite of all these ndings, factors determiningthe evolutionary rate of orphan genes are still under debate andthe relative contribution of different genomic and proteomic attri-butes on the evolutionary rates of orphan genes remains elusive.

    With the availability of high-throughput genomic sequencestogether with expression data and bioinformatics prediction tools,it has now become easier to identify and characterize orphan genesin different species. L. major is one of the most important proto-zoan parasites of the genus Leishmania. It is associated with thedisease cutaneous leishmaniasis, affecting more than 2 millionpeople throughout the world every year (Ivens et al., 2005). In spiteof multiple research endeavors, till date, there is no available vac-cine for this disease. Because of their absence in the host genomesorphan gene products in pathogenic protozoan were considered tobe possible therapeutic targets (Kuo and Kissinger, 2008).

    S. Mukherjee et al. / Infection, GeneTherefore, proling orphan genes of L. major from the perceptionof protein evolutionary rates and comparing them with non-or-phan genes along with understanding their functional roles willadaptations.

    2. Materials and methods

    2.1. Collection of dataset and gene expression data

    We retrieved the protein coding sequences of L. major (strainFriedlin) from TriTrypDB version 7.0 (http://tritrypdb.org/trit-rypdb/) (Aslett et al., 2010). CDS sequences containing internal stopcodons and partial codons were removed using CodonW(http://codonw.sourceforge.net). Signal peptide, transmembranedomain, epitope, paralogs and pathway informations of all L. majorgenes were downloaded from TriTrypDB version 7.0. To computegene expression level, we retrieved high-throughput RNA-seqexpression prole data of L. major promastigote stage from thedataset of Rastrojo et al. (2013). We searched for protein domainsvia InterProScan (Zdobnov and Apweiler, 2001).

    2.2. Identication of orphan genes

    To identify orphan gene models which are restricted to theLeishmania genus, we used a systematic way based on homologysearch. First, BLASTP followed by TBLASTN ltering approach(E < 105 and use of low-complexity lters) was used againstNCBI nr databases. Additionally, to further screen for similaritybetween sequences we employed Position-Specic Iterated BLAST(PSI-BLAST) (Altschul et al., 1997) that can detect weaker homolo-gous relationships that would otherwise be missed by the standardBLAST algorithms.

    2.3. Calculation of nucleotide substitution ratebe helpful to recognize the molecular signature of parasitic adapta-tion. With this aim we carried out rigorous analysis to understandthe functionality of orphan genes and investigated the evolution-ary forces affecting orphan gene evolution. To evaluate the attri-butes of orphan genes in the evolutionary framework weperformed a comprehensive analysis comparing orphan genes withthe non-orphan genes. In this study our primary objective is tocharacterize all the possible determinants that may have shapedthe evolutionary rate of orphan genes in L. major. One of the mainobstacles to such a study is the limitation of required data onorphan genes. Therefore, in this study we consider several genomicand proteomic attributes that could be easily identied from cod-ing sequences and analyzed their relative inuence on theevolutionary rate heterogeneity between orphan and non-orphangenes.

    Conrming earlier observations our study revealed that orphangenes evolve faster than non-orphan genes (Domazet-Loso andTautz, 2003; Toll-Riera et al., 2009). However, in contrary to thesuggestions of those studies, here, we found that gene age couldaccount for a fraction of variation of their evolutionary rate.Instead, together with gene age, a number of factors like geneexpression, codon bias, genic GC content, protein hydropathicity,protein disorder content and protein length were found to havesubstantial contribution on the evolutionary rate differencebetween orphan and non-orphan genes. On functional level, wefound that sequences of orphan genes are endowed with host tar-geting motifs, prenylation motifs, heparin-binding consensussequences, signal peptides and transmembrane domains, implyingtheir possible roles in hostparasite interactions. Thus, our studyon orphan genes of L. major shed light on the factors governing

    and Evolution 32 (2015) 330337 331The ratio of the rate of non-synonymous substitutions (dN) tothe rate of synonymous substitutions (dS) was widely used as an

  • hydrophobic amino acid residue and B is a basic amino acid resi-due) which were implicated in hairpin binding (de Castro Cortes

    CaaX motifs as farnesyltransferase (FT), geranylgeranyltransferase

    (KolmogorovSmirnov test, P < 0.05). Here, P < 0.05 denotes the

    methods to identify orphan genes in various species (Lin et al.,2010; Yang et al., 2013). Following these studies, we used rigorous

    ticsindicator of selective pressure acting on a protein-coding genes(Kryazhimskiy and Plotkin, 2008). dN/dS values of L. major geneswere calculated with respect to their one-to-one orthologoussequences in four other Leishmania species: Leishmania infantum,Leishmania braziliensis, Leishmania mexicana and L. donovani. To cal-culate dN/dS values, each set of orthologous gene pair was alignedusing ClustalW (Larkin et al., 2007). dN/dS values were calculatedby Yang and Nielsen method using the PAML package v-4 (Yangand Nielsen, 2000). We averaged the dN/dS values of each geneand represented that as their evolutionary rate (Supplementarydataset).

    2.4. Codon usage indices calculation

    Codon Adaptation Index (CAI) and effective number of codons(Nc) of L. major genes were computed using CodonW (Sharpet al., 1986). For calculating CAI values highly expressed gene setof L. major was prepared based on RNA-seq data of promastigotestage (Rastrojo et al., 2013). Overall genic GC content and an aver-age protein hydropathy index were also computed using CodonW(Sharp et al., 1986).

    2.5. Calculation of gene age

    To calculate phylogenetic age of L. major genes we used phy-lostratigraphic approach (Domazet-Loso and Tautz, 2010). Briey,according to the signicant BLAST hits found in most remote spe-cies (as documented in NCBI nr databases) L. major genes wereclassied into four taxonomic levels: genes shared by onlyLeishmanial species, genes shared by Trypanosomatidae, genes dis-tributed among basal Eukaryota and genes distributed among allorganisms. Genes for which BLAST hits were found only inLeishmania genus (genus restricted orphan genes) were consideredas the youngest genes; whereas, genes distributed in prokaryoticspecies were considered as the oldest genes.

    2.6. Prediction of protein intrinsic disorder

    Protein disorder content was predicted using IUPred algorithm(Dosztanyi et al., 2005). Based on pairwise interaction energyIUPred assigns a score to each amino acid (Dosztanyi et al., 2005).For each protein we calculated the proportion of amino acids withdisorder scoreP0.5 and represented this as its disorder content.

    2.7. Prediction of GO term, subcellular localization and pathogenicprotein

    For Gene Ontology (GO) annotations of orphan genes we pri-marily focused on TriTrypDB v 7.0. However, we found only 43orphan genes have annotated GO terms. Therefore, for the rest oforphan genes in our dataset we predicted GO categories usingProtFun 2.2 webserver (http://www.cbs.dtu.dk/services/ProtFun/)(Jensen et al., 2003). Protfun 2.2 is a homology independentmethod and predicts protein function based on their physico-chemical properties. Therefore, this algorithm was considered tobe useful for prediction of protein function even of orphan genes(Yang et al., 2013).

    Subcellular localization of orphan genes was predicted usingtwo independent web servers: CELLO v.2.5 (http://cello.life.nctu.edu.tw/) (Yu et al., 2004) and SubCellProt (http://www.databases.niper.ac.in/SubCellProt) (Garg et al., 2009). CELLO predicts proteinsubcellular localization using two-level support vector machine

    332 S. Mukherjee et al. / Infection, Gene(SVM). While, SubCellProt is based on two machine learningapproaches, k Nearest Neighbor (k-NN) and Probabilistic NeuralNetwork (PNN). When two of these three approaches (k-NN, PNNBLAST searches against NCBI non-redundant (nr) databases toidentify orphan genes in L. major genome. To identify matchesmissed by the BLASTP and tBLASTn searches we performed PSI-BLAST search with a cut-off of E-value < 1 105. Finally, we iden-tied 881 genus specic orphan genes, corresponding to 10.65% ofall L. major protein coding sequences (Supplementary_dataset).According to high-throughput RNA-seq expression data of pro-measure of signicance at 95% condence level. All tests were doneusing the software SPSS (v-13.0).

    3. Results and discussion

    3.1. Searching for orphan genes in L. major

    Basic Local Alignment Search Tool (BLAST) is a standalonemethod to identify orphan genes in any sequenced genome(Tautz and Domazet-Loso, 2011). The number of orphan genes inan organism depends on different ltering procedure during iden-tication steps. Previous studies have used BLASTP and TBLASTNI (GGT1) and geranylgeranyltransferase II (GGT2) motifs.

    2.10. Statistical analysis

    For correlation analyses we used non-parametric Spearmansrank correlation q. Signicant differences of variables betweenorphan and non-orphan genes were evaluated usingMannWhitney U test following their non-parametric distributionet al., 2012). We searched all of these sequence patterns usingin-house Perl scripts.

    2.9. Identication of CAAX prenylation motifs

    We searched for the CaaX prenylation motifs (C is Cysteine, ais an aliphatic amino acid, and X is any amino acid) in orphangenes using PrePS webserver (http://mendel.imp.univie.ac.at/sat/PrePS) (Maurer-Stroh and Eisenhaber, 2005). This server classiedand SVM) predict same localization of a gene we took that as itssubcellular localization.

    We predicted pathogenic ability of the orphan gene using MP3server (http://metagenomics.iiserb.ac.in/mp3/index.php) (Guptaet al., 2014). This server predicts pathogenic and virulent proteinsfrom genomic and metagenomic datasets using an integratedSVM-HMM approach.

    2.8. Identication of interaction and trafcking motifs

    We investigated for host-cell targeting motifs RXLXE/D/Q(where X is a neutral or a hydrophobic amino acid residue) thatwere previously reported for their activity to export Plasmodiumfalciparum proteins from the intracellular parasites(Bhattacharjee et al., 2012) to the surrounding erythrocytes. Wealso searched for the presence of consensus sequences XBBXBX,XBBBXXBX and XBBBXXBBBXXBBX (where X is a neutral or

    and Evolution 32 (2015) 330337mastigote stage (Rastrojo et al., 2013), we found gene expressionintensity (FPKM value) of 864 (out of 881) orphan genes whichindicates that orphan genes are not artifact of genome annotations.

  • 3.2. Evolutionary rate heterogeneity of orphan and non-orphan genes:effect of gene age

    To examine whether orphan genes of L. major had come underselective constraints, we computed their evolutionary rate andcompared with that of non-orphan genes. Consistent with the pre-vious reports on other genomes (Domazet-Loso and Tautz, 2003;

    S. Mukherjee et al. / Infection, GeneticsToll-Riera et al., 2009; Wissler et al., 2013), here we found thatevolutionary rate of orphan genes is signicantly higher ascompared to the rest of the genes in L. major (0.55 0.06 vs.0.24 0.002 for orphan vs. non-orphan genes, P = 1 106,MannWhitney U test). Previously, it was attributed that phyloge-netically conserved genes are associated with basic cellularprocesses and are functionally more constrained. Whereas, lin-eage-specic genes are functionally less constrained and couldevolve many new functions (Kuo and Kissinger, 2008). Thus, highersequence divergence of orphan genes was regarded to be due tolesser selection on functional requirement (Toll-Riera et al.,2009). Since, newly evolved genes tend to be under weaker purify-ing selection earlier studies reported that proteins with acceleratedevolutionary rate are younger in gene age (Alba and Castresana,2005; Cai et al., 2006). Thus, proteins evolutionary origin i.e. theirgene age was considered to be a potential determinant of theirevolutionary rate (Vishnoi et al., 2010). Orphan genes are evolutio-narily younger genes that appeared at some time within a phyloge-netic lineage towards an extant species (Toll-Riera et al., 2009).Therefore, their accelerated evolutionary rates were thought tobe due to their recent evolutionary origin (Cai et al., 2006; Tautzand Domazet-Loso, 2011). To test this hypothesis we assigned aphyletic age to all the protein coding genes of L. major accordingto their phylogenetic distribution (i) genes restricted toLeishmania genus (orphan genes), (ii) trypanosomatidae restrictedgenes, (iii) extensively distributed eukaryotic genes and (iv) genesdistributed among all organisms (including virus, bacteria and allother life forms). Thus, we found that evolutionary rate decreasewith increasing gene age (Table 1). Consequently, we found a nega-tive correlation between gene age and evolutionary rate(Spearmans q = 0.519, P = 1 106) which suggests that geneage is an important correlate of protein evolutionary rate. To testits independent impact on the evolutionary rate of L. major pro-teins, we performed linear regression taking gene age as a predic-tor variable and protein evolutionary as a dependent variable.Results revealed that gene age could account for 12.4% variationof protein evolutionary rate in our dataset (R2 = 0.124,F = 1054.839, P < 1 106). Thus, it is apparent that although geneage has a major contribution on the variation of protein evolution-ary rate between orphan and non-orphan genes, there could beother evolutionary force(s) behind this variation.

    3.3. Protein evolutionary rate: impact of gene level variants

    Previously, it was shown that genes with higher GC contenttend to evolve at a slower rate as compared to lower GC genes(Xia et al., 2009). Thus, GC content was regarded as a strong predic-tor of protein evolutionary rate. In many species orphan genes

    Table 1Evolutionary rate of L. major genes according to their phylogenetic distribution.

    Phylogenetic class of L. major genes dN/dS(Mean SE)

    P-values

    Genes restricted to Leishmania genus (orphangenes)

    0.55 0.0068

  • et al., 2003). Using ProtFun we were able to predict the GO annota-tions for 674 orphan genes. Similar to the study Yang et al., in zeb-rash (Yang et al., 2013) our analysis revealed a non-randomdistribution of orphan genes across different functional categories.Here, we observed that growth factors are the most abundantfunctional categories for orphan genes (29.2%), followed by tran-scription regulation (24.18%), and cellular transportation (20.02%)(Table 4). Therefore, predicted functional annotations indicate thatmost of the orphan genes lie in the growth factor categories which

    Table 2Categorical regression to illustrate independent inuence of different variables onprotein evolutionary rate.

    Parameter b score P-values

    Protein level propertiesIntrinsic disorder content 0.371 Nc > CAI > protein hydrophilicity > gene expres-sion level > gene age > protein length.

    3.6. Functional attributes of orphan genes

    Vast amount of orphan genes detected in different taxa withtheir possible lineage-specic functions motivated us to investi-gate if orphan genes confer any advantage for lineage-specicadaptation of L. major. We searched for the Gene Ontology (GO)annotations of orphan genes to explore their functional signi-cance. Gene ontology (GO) annotations are currently available onlyfor a limited number of L. major orphan genes (43 genes with anno-

    334 S. Mukherjee et al. / Infection, Genetated GO terms out of 881 orphan genes in L. major) (Table 3).Therefore, to trace the potential functional roles of orphan geneswe predicted GO annotations from ProtFun 2.2 server (Jensencould stimulate cell growth and proliferation and are important forregulating a variety of cellular processes, suggesting that thesegenes could involve in various biochemical pathways leading toparasitic lineage-specic adaptations.

    Prediction of protein subcellular localization is an importantcomponent of in silico prediction of protein function (Yu et al.,2006). Computational prediction of subcellular localization of pro-teins may be error prone (Nair and Rost, 2003). Therefore, for theprediction of subcellular localization of orphan genes here weemployed two prediction servers: CELLO (Yu et al., 2004) andSubCellProt (Garg et al., 2009) which are based on three differentmethods (k-NN, P-NN and SVM). We assigned subcellular localiza-tion for a protein if at least two of those three methods predict thesame. Predictions from these two web servers unanimously sug-gest that orphan genes are mainly located within nucleus andplasma membrane (Supplementary_dataset). Further gene

    Table 3Functional categorization of orphan genes as per annotated GO term in TriTrypDB.

    Annotated GO function Number of orphangenes

    Acid-amino acid ligase activity 1ATP binding 5Calcium ion binding 1Note: Total 43 orphan genes were assigned to various GO terms in TriTrypDB. Someorphan genes were assigned into multiple GO functional terms in TriTrypDB.

  • ticsontology (GO) analysis with the orphan genes of plasma mem-brane revealed that most of these genes are involved in processeslike transport, ion channel and voltage-gated ion channel, etc.

    Involvement of genes in metabolic pathways indicates theirimportant functional consequences in several biosynthetic pro-cesses. Here, we investigated whether orphan genes have any rolein metabolic pathways of L. major. Therefore, from TriTrypDB weretrieved the list of L. major genes which are involved in variouspathways. By this way we found evidence for the involvement ofthree orphan genes in different biosynthetic pathways of L. major.For instance, one orphan gene (Gene ID: LmjF.36.4180) was foundto be associated with N-Glycan biosynthesis pathways. Anothertwo orphan genes (Gene ID: LmjF.06.0780 and LmjF.35.0550) werefound to be associated with ascorbate and aldarate metabolismpathways, ubiquinone and other terpenoid-quinone biosynthesispathways as well as in glycosaminoglycan degradation pathways.To search for their functional roles in those pathways, we consid-ered their GO annotations and Enzyme Commission (EC) numbers.TriTrypDB annotated GO process indicates that gene LmjF.36.4180is involved in dolichol-linked oligosaccharide biosynthetic processin N-Glycan biosynthesis pathways (annotated GO function:UDP-N-acetylglucosamine-dolichyl-phosphate N-acetylglu-cosaminephosphotransferase activity and EC number is 2.7.8.15(UDP-N-acetylglucosamine-dolichyl-phosphate N-acetylglu-cosaminephosphotransferase)). Annotated GO terms and EC num-bers are unavailable for the genes LmjF.06.0780 and LmjF.

    Table 4Functional categorization of orphan genes as predicted by ProtFun. Using ProtFun wewere able to predict the GO term for 674 orphan genes out of 838 orphan genes withunknown GO term. Percentage of orphan genes indicates that distribution of each GOcategory within 674 orphan genes.

    Predicted GO function Number oforphan genes

    Percentage oforphan genes

    Growth_factor 197 29.22Transcription_regulation 163 24.18Transporter 135 20.02Structural_protein 64 9.49Transcription 51 7.56Central_intermediary_metabolism 15 2.22Receptor 11 1.63Signal_transducer 9 1.33Cation_channel 8 1.18Ion_channel 8 1.18Stress_response 7 1.03Voltage-gated_ion_channel 6 0.89

    S. Mukherjee et al. / Infection, Gene35.0550 in TriTrypDB. Therefore, we considered EC numberinferred from OrthoMCL (Li et al., 2003) for their functional assign-ments. This observation indicates that these two genes are possiblyinvolved in glycosidase activity in those pathways (EC Numbersinferred from OrthoMCL: 3.2.1. (glycosidases, i.e. enzymeshydrolyzing O- and S-glycosyl compounds)). Thus, our ndingssuggest that orphan genes of L. major could integrate into itsmetabolic pathway to play important functional roles.

    3.7. Orphan genes of L. major putatively involved in secretorypathways, hostparasite interactions and virulence

    Hostpathogen interactions are types of environmental interac-tion where intracellular pathogens exploit host cells to ensure theirsurvival and replication within the host genome (Tautz andDomazet-Loso, 2011). Virulence is one of the potential results ofhostpathogen interaction (Casadevall and Pirofski, 2001).Therefore, involvement of orphan genes in hostpathogen interac-tions may suggest their crucial role in parasitic adaptation to thehost systems. In our endeavor to understand the role of orphangenes in hostparasite interactions, we investigated presence ofdifferent interaction and trafcking motifs in orphan genes.Prenylation is a post-translational modication that leads to farne-sylation or geranylgeranylation, which are required for proteins tobe fully functional (Zhang and Casey, 1996). Pathogen proteinsundergoing this type of host directed post-translational modi-cation were considered to be crucial for hostpathogen interac-tions (Amaya et al., 2011). We found prenylation motifs in veorphan genes (Supplementary_dataset). This provides evidencethat orphan genes may take part in protein prenylation facilitatinginteractions with the host genome. Heparin binding proteins, pre-sent at the surface of Leishmania were shown to be involved inhostpathogen interactions (de Castro Cortes et al., 2012).Therefore, we searched for heparin-binding consensus sequencein orphan genes. Thus, we found that heparin binding consensussequences XBBXBX are present within 143 orphan genes; whereas,XBBBXXBX are present in 15 orphan genes (Supplementary_dataset) which suggests that these genes possibly interact withheparin/heparin sulfate at the surface of mammalian host. It hasbeen reported that P. falciparum employ host targeting (HT) motifs(RXLXE/D/Q) on secretory proteins to deliver hundreds of effectorsthat must cross the haustorial/vacuolar membrane to enter thecytoplasm of host cells (Bhattacharjee et al., 2012). Thus, presenceof HT motif was regarded as a key signature of virulence factors.We found HT motifs in protein sequences of 138 L. major orphangenes (Supplementary_dataset), indicating that these genes possi-bly exported to the host during intracellular stage of infections.Membrane proteins and proteins destined for secretion are tar-geted to the appropriate intracellular membrane by their signalpeptides (Martoglio and Dobberstein, 1998). Transmembrane pro-teins mainly function as gateways to deny or permit the transportof specic substances across the biological membrane and areinvolved in a broad range of biological processes (Arinaminpathyet al., 2009). Their imperative roles make them rewarding drug tar-gets. Our analysis demonstrated that 239 orphan genes contain atleast a putative signal peptide or one transmembrane domain(Supplementary_dataset), which suggests that these proteins couldbe exported to the host cell or integrated on the extracellular sur-face of the parasite where it can interact with host cell receptors.Secreted and surface-exposed antigens of parasites are thoughtto provide targeted structures for detection by the host immunesystem (Silverman et al., 2010). Previous studies showed thatmany lineage-specic genes of Plasmodium and Theileria areencoded surface antigens that interact with their host genomes(Kuo and Kissinger, 2008). We also found that orphan genes of L.major contain several genus restricted surface antigens and hydro-philic surface proteins (HASPA2) (Supplementary_dataset) whichimply their possible involvement in hostparasite interactions.Proteophosphoglycan (PPG) are surface glycoproteins of L. majorwhich contribute to the binding of Leishmania to host cells and playa vital role in immunomodulatory effect on macrophage function(Piani et al., 1999). We noticed that one of the orphan gene(Gene ID: LmjF.35.0550) are encoded proteophosphoglycan whichindicates its possible role in attachment of parasite with the hostcells. 57-residue small hydrophilic endoplasmic reticulum-as-sociated proteins (SHERP) were shows high level of stage-specicexpression and considered to be important for modulating cellularprocesses related to membrane organization and acidication dur-ing vector transmission of infective Leishmania (Moore et al., 2011).Here, we noticed that some orphan genes of L. major are encoded57-residue small hydrophilic endoplasmic reticulum-associatedproteins. Therefore, this result illustrates that these genes couldplay a crucial function in metacyclic parasites during transmissionto the mammalian host. Amastin like surface proteins were

    and Evolution 32 (2015) 330337 335assumed to evolve novel functions crucial to the growth of leish-manial parasites after the acquisition of vertebrate host (Jackson,2010). We found multiple copies of amastin like surface proteins

  • degradation via the proteasome, alter their cellular location, affect

    L. major. Assessing the results from our multivariate regression

    ticsanalysis we concluded that together with gene age, different geno-mic and proteomic attributes of orphan genes are responsible fortheir faster evolutionary rate. Our functional analysis highlightedthe role of orphan genes in parasitic lineage-specic adaptionwhich inuences both survival and virulence in the host genome.Our study provides valuable information on L. major orphan genesand advocates for further analysis and experimental studies tofacilitate the development of novel therapeutic targets in nearfuture.

    Acknowledgements

    We are thankful to the anonymous reviewers for their valuablecomments which helped us immensely to improve our manuscript.We thank Department of Biotechnology, Govt. of India and BoseInstitute for nancial support.

    Appendix A. Supplementary data

    All of the data sets supporting the results of this article areavailable in the supplementary datasets as well as TriTrypDBcommunity les portal under title: Orphan genes of Leishmaniamajor (http://tritrypdb.org/tritrypdb/showApplication.do).Supplementary data associated with this article can be found, inthe online version, at http://dx.doi.org/10.1016/j.meegid.2015.03.031.

    Referencestheir activity, and promote or prevent proteinprotein interactions(Mukhopadhyay and Riezman, 2007). We found that three orphangenes (Gene id: LmjF.05.0620; LmjF.13.0730 and LmjF.35.2610)contains ubiquitin domain which indicate their possible crucialrole in host protein targeting and virulence. Finally, to test whetherorphan genes have virulent like properties we predicted theirpropensity for pathogenic proteins using MP3 web server (Guptaet al., 2014). Interestingly, we found that 90.98% proteins in orphandataset are pathogenic proteins; whereas it limits to 54.29% fornon-orphan dataset. The difference is statistically signicant at99% level of condence by z-test (Z score = 19.963). Overall theseresults suggest that orphan genes of L. major are likely to beinvolved in the parasites virulence. So, various intriguing roles oforphan genes in several biochemical processes of L. major providean indication that these genes could be the potential target for thedevelopment of new therapeutics.

    4. Conclusion

    Our studies constitute the rst attempt to explore evolutionaryand functional analysis of genus restricted orphan genes inin orphan gene dataset (Supplementary_dataset) which indicatesthat these genes are essential for survival of L. major within host.To further understand the role of orphan genes in virulence weretrieved predicted epitope information of L. major genes fromTriTrypDB. Prediction of epitope in a protein-coding genes is apowerful approach for unbiased antigen discoveries (Dumonteil,2009). Our results suggest that ve orphan genes contain epitopein their sequences (Supplementary_dataset) which indicates theirpossible role in immunogenicity within the host genome.Ubiquitination is a post-translational modication where ubiquitinis attached to a substrate protein and can affect the proteins by

    336 S. Mukherjee et al. / Infection, GeneAlba, M.M., Castresana, J., 2005. Inverse relationship between evolutionary rate andage of mammalian genes. Mol. Biol. Evol. 22, 598606.Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J.H., Zhang, Z., Miller, W., Lipman,D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein databasesearch programs. Nucleic Acids Res. 25, 33893402.

    Amaya, M., Baranova, A., van Hoek, M.L., 2011. Protein prenylation: a new mode ofhostpathogen interaction. Biochem. Biophys. Res. Commun. 416, 16.

    Arinaminpathy, Y., Khurana, E., Engelman, D.M., Gerstein, M.B., 2009.Computational analysis of membrane proteins: the largest class of drugtargets. Drug Discovery Today 14, 11301135.

    Aslett, M., Aurrecoechea, C., Berriman, M., Brestelli, J., Brunk, B.P., Carrington, M.,Depledge, D.P., Fischer, S., Gajria, B., Gao, X., Gardner, M.J., Gingle, A., Grant, G.,Harb, O.S., Heiges, M., Hertz-Fowler, C., Houston, R., Innamorato, F., Iodice, J.,Kissinger, J.C., Kraemer, E., Li, W., Logan, F.J., Miller, J.A., Mitra, S., Myler, P.J.,Nayak, V., Pennington, C., Phan, I., Pinney, D.F., Ramasamy, G., Rogers, M.B.,Roos, D.S., Ross, C., Sivam, D., Smith, D.F., Srinivasamoorthy, G., Stoeckert Jr., C.J.,Subramanian, S., Thibodeau, R., Tivey, A., Treatman, C., Velarde, G., Wang, H.,2010. TriTrypDB: a functional genomic resource for the Trypanosomatidae.Nucleic Acids Res. 38, D457D462.

    Bhattacharjee, S., Stahelin, R.V., Speicher, K.D., Speicher, D.W., Haldar, K., 2012.Endoplasmic reticulum PI(3)P lipid binding targets malaria proteins to the hostcell. Cell 148, 201212.

    Botzman, M., Margalit, H., 2011. Variation in global codon usage bias amongprokaryotic organisms is associated with their lifestyles. Genome Biol. 12, 109.

    Cai, J., Zhao, R., Jiang, H., Wang, W., 2008. De novo origination of a new protein-coding gene in Saccharomyces cerevisiae. Genetics 179, 487496.

    Cai, J.J., Woo, P.C.Y., Lau, S.K.P., Smith, D.K., Yuen, K.-Y., 2006. Acceleratedevolutionary rate may be responsible for the emergence of lineage-specicgenes in Ascomycota. J. Mol. Evol. 63, 111.

    Casadevall, A., Pirofski, L.A., 2001. Hostpathogen interactions: the attributes ofvirulence. J. Infect. Dis. 184, 337344.

    Chakraborty, S., Ghosh, T.C., 2013. Evolutionary rate heterogeneity of core andattachment proteins in yeast protein complexes. Genome Biol. Evol. 5, 13661375.

    Chen, S.C.-C., Chuang, T.-J., Li, W.-H., 2011. The relationships among microRNAregulation, intrinsically disordered regions, and other indicators of proteinevolutionary rate. Mol. Biol. Evol. 28, 25132520.

    Daubin, V., Ochman, H., 2004. Bacterial genomes as new gene homes: the genealogyof ORFans in E-coli. Genome Res. 14, 10361042.

    de Castro Cortes, L.M., de Souza Pereira, M.C., da Silva, F.S., Santini Pereira, B.A., deOliveira Junior, F.O., de Araujo Soares, R.O., Brazil, R.P., Toma, L., Vicente, C.M.,Nader, H.B., Madeira, M.d.F., Bello, F.J., Alves, C.R., 2012. Participation of heparinbinding proteins from the surface of Leishmania (Viannia) braziliensispromastigotes in the adhesion of parasites to Lutzomyia longipalpis cells (Lulo)in vitro. Parasit. Vectors 5, 142.

    Domazet-Loso, T., Tautz, D., 2003. An evolutionary analysis of orphan genes inDrosophila. Genome Res. 13, 22132219.

    Domazet-Loso, T., Tautz, D., 2010. Phylostratigraphic tracking of cancer genessuggests a link to the emergence of multicellularity in metazoa. BMC Biol. 8, 66.

    Donoghue, M.T.A., Keshavaiah, C., Swamidatta, S.H., Spillane, C., 2011. Evolutionaryorigins of Brassicaceae specic genes in Arabidopsis thaliana. BMC Evol. Biol. 11,47.

    Dosztanyi, Z., Csizmok, V., Tompa, P., Simon, I., 2005. IUPred: web server for theprediction of intrinsically unstructured regions of proteins based on estimatedenergy content. Bioinformatics 21, 34333434.

    Drummond, D.A., Bloom, J.D., Adami, C., Wilke, C.O., Arnold, F.H., 2005. Why highlyexpressed proteins evolve slowly. Proc. Natl. Acad. Sci. U.S.A. 102, 1433814343.

    Drummond, D.A., Raval, A., Wilke, C.O., 2006. A single determinant dominates therate of yeast protein evolution. Mol. Biol. Evol. 23, 327337.

    Dumonteil, E., 2009. Vaccine development against Trypanosoma cruzi andLeishmania species in the post-genomic era. Infect. Genet. Evol. 9, 10751082.

    Dunker, A.K., Brown, C.J., Lawson, J.D., Iakoucheva, L.M., Obradovic, Z., 2002.Intrinsic disorder and protein function. Biochemistry 41, 65736582.

    Dyson, H.J., Wright, P.E., 2005. Intrinsically unstructured proteins and theirfunctions. Nat. Rev. Mol. Cell Biol. 6, 197208.

    Fraser, H.B., Hirsh, A.E., Steinmetz, L.M., Scharfe, C., Feldman, M.W., 2002.Evolutionary rate in the protein interaction network. Science 296, 750752.

    Garg, P., Sharma, V., Chaudhari, P., Roy, N., 2009. SubCellProt: predicting proteinsubcellular localization using machine learning approaches. In Silico Biol. 9, 3544.

    Gouy, M., Gautier, C., 1982. Codon usage in bacteria: correlation with geneexpressivity. Nucleic Acids Res. 10, 70557074.

    Gupta, A., Kapil, R., Dhakan, D.B., Sharma, V.K., 2014. MP3: a software tool for theprediction of pathogenic proteins in genomic and metagenomic data. PLoS ONE9, e93907.

    Hahn, M.W., Kern, A.D., 2005. Comparative genomics of centrality and essentialityin three eukaryotic protein-interaction networks. Mol. Biol. Evol. 22, 803806.

    Heinen, T.J.A.J., Staubach, F., Haeming, D., Tautz, D., 2009. Emergence of a new genefrom an intergenic region. Curr. Biol. 19, 15271531.

    Hirsh, A.E., Fraser, H.B., 2001. Protein dispensability and rate of evolution. Nature411, 10461049.

    Ikemura, T., 1985. Codon usage and tRNA content in unicellular and multicellularorganisms. Mol. Biol. Evol. 2, 1334.

    Ivens, A.C., Peacock, C.S., Worthey, E.A., Murphy, L., Aggarwal, G., Berriman, M., Sisk,

    and Evolution 32 (2015) 330337E., Rajandream, M.A., Adlem, E., Aert, R., Anupama, A., Apostolou, Z., Attipoe, P.,Bason, N., Bauser, C., Beck, A., Beverley, S.M., Bianchettin, G., Borzym, K., Bothe,G., Bruschi, C.V., Collins, M., Cadag, E., Ciarloni, L., Clayton, C., Coulson, R.M.R.,

  • Cronin, Cruz, A.K., Davies, R.M., De Gaudenzi, J., Dobson, D.E., Duesterhoeft, A.,Fazelina, G., Fosker, N., Frasch, A.C., Fraser, A., Fuchs, M., Gabel, C., Goble, A.,Goffeau, A., Harris, D., Hertz-Fowler, C., Hilbert, H., Horn, D., Huang, Y.T., Klages,S., Knights, A., Kube, M., Larke, N., Litvin, L., Lord, A., Louie, T., Marra, M., Masuy,D., Matthews, K., Michaeli, S., Mottram, J.C., Muller-Auer, S., Munden, H., Nelson,S., Norbertczak, H., Oliver, K., ONeil, S., Pentony, M., Pohl, T.M., Price, C.,Purnelle, B., Quail, M.A., Rabbinowitsch, E., Reinhardt, R., Rieger, M., Rinta, J.,Robben, J., Robertson, L., Ruiz, J.C., Rutter, S., Saunders, D., Schafer, M., Schein, J.,Schwartz, D.C., Seeger, K., Seyler, A., Sharp, S., Shin, H., Sivam, D., Squares, R.,Squares, S., Tosato, V., Vogt, C., Volckaert, G., Wambutt, R., Warren, T., Wedler,H., Woodward, J., Zhou, S.G., Zimmermann, W., Smith, D.F., Blackwell, J.M.,

    Piani, A., Ilg, T., Elefanty, A.G., Curtis, J., Handman, E., 1999. Leishmania majorproteophosphoglycan is expressed by amastigotes and has animmunomodulatory effect on macrophage function. Microbes Infect. 1, 589599.

    Podder, S., Ghosh, T.C., 2010. Exploring the differences in evolutionary ratesbetween monogenic and polygenic disease genes in human. Mol. Biol. Evol. 27,934941.

    Rastrojo, A., Carrasco-Ramiro, F., Martin, D., Crespillo, A., Reguera, R.M., Aguado, B.,Requena, J.M., 2013. The transcriptome of Leishmania major in the axenicpromastigote stage: transcript annotation and relative expression levels by

    S. Mukherjee et al. / Infection, Genetics and Evolution 32 (2015) 330337 337Stuart, K.D., Barrell, B., Myler, P.J., . The genome of the kinetoplastid parasite,Leishmania major. Science 309, 436442.

    Jackson, A.P., 2010. The evolution of amastin surface glycoproteins intrypanosomatid parasites. Mol. Biol. Evol. 27, 3345.

    Jensen, L.J., Gupta, R., Staerfeldt, H.H., Brunak, S., 2003. Prediction of human proteinfunction according to Gene Ontology categories. Bioinformatics 19,635642.

    Johnson, B.R., Tsutsui, N.D., 2011. Taxonomically restricted genes are associatedwith the evolution of sociality in the honey bee. BMC Genomics 12, 164.

    Khalturin, K., Anton-Erxleben, F., Sassmann, S., Wittlieb, J., Hemmrich, G., Bosch,T.C.G., 2008. A novel gene family controls species-specic morphological traitsin hydra. PLoS Biol. 6, 24362449.

    Khalturin, K., Hemmrich, G., Fraune, S., Augustin, R., Bosch, T.C.G., 2009. More thanjust orphans: are taxonomically-restricted genes important in evolution?Trends Genet. 25, 404413.

    Knowles, D.G., McLysaght, A., 2009. Recent de novo origin of human protein-codinggenes. Genome Res. 19, 17521759.

    Kryazhimskiy, S., Plotkin, J.B., 2008. The population genetics of dN/dS. PLoS Genet. 4,e1000304.

    Kuo, C.-H., Kissinger, J.C., 2008. Consistent and contrasting properties of lineage-specic genes in the apicomplexan parasites Plasmodium and Theileria. BMCEvol. Biol. 8, 108.

    Kyte, J., Doolittle, R.F., 1982. A simple method for displaying the hydropathiccharacter of a protein. J. Mol. Biol. 157, 105132.

    Larkin, M.A., Blackshields, G., Brown, N.P., Chenna, R., McGettigan, P.A., McWilliam,H., Valentin, F., Wallace, I.M., Wilm, A., Lopez, R., Thompson, J.D., Gibson, T.J.,Higgins, D.G., 2007. Clustal W and clustal X version 2.0. Bioinformatics 23,29472948.

    Li, L., Stoeckert, C.J., Roos, D.S., 2003. OrthoMCL: identication of ortholog groups foreukaryotic genomes. Genome Res. 13 (9), 21782189.

    Lin, H., Moghe, G., Ouyang, S., Iezzoni, A., Shiu, S.-H., Gu, X., Buell, C.R., 2010.Comparative analyses reveal distinct sets of lineage-specic genes withinArabidopsis thaliana. BMC Evol. Biol. 10, 41.

    Marais, G., Duret, L., 2001. Synonymous codon usage, accuracy of translation, andgene length in Caenorhabditis elegans. J. Mol. Evol. 52, 275280.

    Martoglio, B., Dobberstein, B., 1998. Signal sequences: more than just greasypeptides. Trends Cell Biol. 8, 410415.

    Maurer-Stroh, S., Eisenhaber, F., 2005. Renement and prediction of proteinprenylation motifs. Genome Biol. 6, R55.

    Mohan, A., Sullivan Jr., W.J., Radivojac, P., Dunker, A.K., Uversky, V.N., 2008. Intrinsicdisorder in pathogenic and non-pathogenic microbes: discovering andanalyzing the unfoldomes of early-branching eukaryotes. Mol. BioSyst. 4,328340.

    Moore, B., Miles, A.J., Guerra-Giraldez, C., Simpson, P., Iwata, M., Wallace, B.A.,Matthews, S.J., Smith, D.F., Brown, K.A., 2011. Structural basis of molecularrecognition of the leishmania small hydrophilic endoplasmic reticulum-associated protein (SHERP) at membrane surfaces. J. Biol. Chem. 286, 92469256.

    Mukhopadhyay, D., Riezman, H., 2007. Proteasome-independent functions ofubiquitin in endocytosis and signaling. Science 315, 201205.

    Nair, R., Rost, B., 2003. LOC3D: annotate sub-cellular localization for proteinstructures. Nucleic Acids Res. 31, 33373340.

    Neme, R., Tautz, D., 2013. Phylogenetic patterns of emergence of new genes supporta model of frequent de novo evolution. BMC Genomics 14, 117.

    Pal, C., Papp, B., Hurst, L.D., 2001. Highly expressed genes in yeast evolve slowly.Genetics 158, 927931.

    Palmieri, N., Kosiol, C., Schloetterer, C., 2014. The life cycle of Drosophila orphangenes. Elife 3, e01311.

    Panda, A., Ghosh, T.C., 2014. Prevalent structural disorder carries signature ofprokaryotic adaptation to oxic atmosphere. Gene 548, 134141.RNA-seq. BMC Genomics 14, 223.Sharp, P.M., Tuohy, T.M.F., Mosurski, K.R., 1986. Codon usage in yeast: cluster

    analysis clearly differentiates highly and lowly expressed genes. Nucleic AcidsRes. 14, 51255143.

    Silverman, J.M., Clos, J., deOliveira, C.C., Shirvani, O., Fang, Y., Wang, C., Foster, L.J.,Reiner, N.E., 2010. An exosome-based secretion pathway is responsible forprotein export from Leishmania and communication with macrophages. J. CellSci. 123, 842852.

    Tautz, D., Domazet-Loso, T., 2011. The evolutionary origin of orphan genes. Nat. Rev.Genet. 12, 692702.

    Toll-Riera, M., Bosch, N., Bellora, N., Castelo, R., Armengol, L., Estivill, X., Mar Alba,M., 2009. Origin of primate orphan genes: a comparative genomics approach.Mol. Biol. Evol. 26, 603612.

    Toll-Riera, M., Bostick, D., Mar Alba, M., Plotkin, J.B., 2012. Structure and age jointlyinuence rates of protein evolution. PLoS Comput. Biol. 8, e1002542.

    Uversky, V.N., Gillespie, J.R., Fink, A.L., 2000. Why are natively unfolded proteinsunstructured under physiologic conditions? Proteins 41, 415427.

    Vishnoi, A., Kryazhimskiy, S., Bazykin, G.A., Hannenhalli, S., Plotkin, J.B., 2010. Youngproteins experience more variable selection pressures than old proteins.Genome Res. 20, 15741581.

    Wall, D.P., Hirsh, A.E., Fraser, H.B., Kumm, J., Giaever, G., Eisen, M.B., Feldman, M.W.,2005. Functional genomic analysis of the rates of protein evolution. Proc. Natl.Acad. Sci. U.S.A. 102, 54835488.

    Wilson, G.A., Bertrand, N., Patel, Y., Hughes, J.B., Feil, E.J., Field, D., 2005. Orphans astaxonomically restricted and ecologically important genes. Microbiology-Sgm151, 24992501.

    Wissler, L., Gadau, J., Simola, D.F., Helmkampf, M., Bornberg-Bauer, E., 2013.Mechanisms and dynamics of orphan gene emergence in insect genomes.Genome Biol. Evol. 5, 439455.

    Wright, F., 1990. The effective number of codons used in a gene. Gene 87, 2329.Wu, D.-D., Irwin, D.M., Zhang, Y.-P., 2011. De novo origin of human protein-coding

    genes. PLoS Genet. 7, e1002379.Xia, Y., Franzosa, E.A., Gerstein, M.B., 2009. Integrated assessment of genomic

    correlates of protein evolutionary rate. PLoS Comput. Biol. 5, e1000413.Xie, C., Zhang, Y.E., Chen, J.-Y., Liu, C.-J., Zhou, W.-Z., Li, Y., Zhang, M., Zhang, R., Wei,

    L., Li, C.-Y., 2012. Hominoid-specic de novo protein-coding genes originatingfrom long non-coding RNAs. PLoS Genet. 8, e1002942.

    Yang, L., Zou, M., Fu, B., He, S., 2013. Genome-wide identication, characterization,and expression analysis of lineage-specic genes within zebrash. BMCGenomics 14, 65.

    Yang, Z., Huang, J., 2011. De novo origin of new genes with introns in Plasmodiumvivax. FEBS Lett. 585, 641644.

    Yang, Z.H., Nielsen, R., 2000. Estimating synonymous and nonsynonymoussubstitution rates under realistic evolutionary models. Mol. Biol. Evol. 17, 3243.

    Yin, Y., Fischer, D., 2008. Identication and investigation of ORFans in the viralworld. BMC Genomics 9, 24.

    Yu, C.-S., Chen, Y.-C., Lu, C.-H., Hwang, J.-K., 2006. Prediction of protein subcellularlocalization. Proteins 64, 643651.

    Yu, C.S., Lin, C.J., Hwang, J.K., 2004. Predicting subcellular localization of proteins forGram-negative bacteria by support vector machines based on n-peptidecompositions. Protein Sci. 13, 14021406.

    Zdobnov, E.M., Apweiler, R., 2001. InterProScan an integration platform for thesignature-recognition methods in InterPro. Bioinformatics 17, 847848.

    Zhang, F.L., Casey, P.J., 1996. Protein prenylation: molecular mechanisms andfunctional consequences. Annu. Rev. Biochem. 65, 241269.

    Zhang, W.-W., Matlashewski, G., 2010. Screening Leishmania donovani-specicgenes required for visceral infection. Mol. Microbiol. 77, 505517.

    Zhang, Y.E., Landback, P., Vibranovski, M.D., Long, M., 2011. Accelerated recruitmentof new brain development genes into the human genome. PLoS Biol. 9,e1001179.

    Elucidating evolutionary features and functional implications of orphan genes in Leishmania major1 Introduction2 Materials and methods2.1 Collection of dataset and gene expression data2.2 Identification of orphan genes2.3 Calculation of nucleotide substitution rate2.4 Codon usage indices calculation2.5 Calculation of gene age2.6 Prediction of protein intrinsic disorder2.7 Prediction of GO term, subcellular localization and pathogenic protein2.8 Identification of interaction and trafficking motifs2.9 Identification of CAAX prenylation motifs2.10 Statistical analysis

    3 Results and discussion3.1 Searching for orphan genes in L. major3.2 Evolutionary rate heterogeneity of orphan and non-orphan genes: effect of gene age3.3 Protein evolutionary rate: impact of gene level variants3.4 Protein evolutionary rate: impact of protein level variants3.5 Relative contribution of the factors in determining evolutionary rate variation3.6 Functional attributes of orphan genes3.7 Orphan genes of L. major putatively involved in secretory pathways, hostparasite interactions and virulence

    4 ConclusionAcknowledgementsAppendix A Supplementary dataReferences