Post on 11-Jul-2020
Hindawi Publishing CorporationComparative and Functional GenomicsVolume 2010, Article ID 520238, 13 pagesdoi:10.1155/2010/520238
Research Article
Comparative Analysis and EST Mining Reveals High Degree ofConservation among Five Brassicaceae Species
Jyotika Bhati, Humira Sonah, Tripta Jhang, Nagender Kumar Singh, and Tilak Raj Sharma
Genoinformatics Laboratory, National Research Centre on Plant Biotechnology, Pusa Campus (IARI), New Delhi 110012, India
Correspondence should be addressed to Tilak Raj Sharma, trsharma@nrcpb.org
Received 30 January 2010; Revised 21 May 2010; Accepted 11 July 2010
Academic Editor: Graziano Pesole
Copyright © 2010 Jyotika Bhati et al. This is an open access article distributed under the Creative Commons Attribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Brassicaceae is an important family of the plant kingdom which includes several plants of major economic importance. The Brassicaspp. and Arabidopsis share much-conserved colinearity between their genomes which can be exploited for the genomic research inBrassicaceae crops. In this study, 131,286 ESTs of five Brassicaceae species were assembled into unigene contigs and compared withArabidopsis gene indices. Almost all the unigenes of Brassicaceae species showed high similarities with Arabidopsis genes exceptthose of B. napus, where 90% of unigenes were found similar. A total of 9,699 SSRs were identified in the unigenes. PCR primerswere designed based on this information and amplified across species for validation. Functional annotation of unigenes showedthat the majority of the genes are present in metabolism and energy functional classes. It is expected that comparative genomeanalysis between Arabidopsis and related crop species will expedite research in the more complex Brassica genomes. This would behelpful for genomics as well as evolutionary studies, and DNA markers developed can be used for mapping, tagging, and cloningof important genes in Brassicaceae.
1. Introduction
Brassicaceae species consisting of various agronomicallyimportant crops like oilseeds, broccoli, cabbage, black mus-tard, and other leafy vegetables are cultivated in most partsof the world. The genus Brassica is evolutionarily closelyrelated to model crucifer plant Arabidopsis thaliana, sinceboth are members of the family Brassicaceae and reportedto have diverged 14−20 million years ago [1]. The majorcenters of diversity of Brassicaceae family are southwesternand central Asia and the Mediterranean region whereas thearctic, western North America, and the mountains of SouthAmerica are secondary centers of diversity [2]. The genusBrassica is a monophyletic group within the Brassicaceae. Itincludes the cultivated oil seeded species, Brassica juncea,B. napus, and B. rapa and vegetable B. oleracea, which arealso very closely related to A. thaliana. The genomes of thethree diploid Brassica species, that is, B. rapa, B. nigra, andB. oleracea, have been designated as A, B, and C, respectively,where as the genomes of the amphidiploids, B. juncea andB. napus, have been designated as AB and AC, respectively[3–5].
Comparative genomics is a powerful tool for genomeanalysis and annotation. There are two basic objectives forcomparative genomics. First, to understand the detailedprocess of evolution at the gross level (the origin of the majorclasses of organism) and at a local level (what makes relatedspecies unique) [6]. Second, to translate DNA sequence datainto proteins of known functions. The rationale here is thatDNA sequences encoding important cellular functions aremore likely to be conserved between species than sequencesencoding dispensable functions or noncoding sequences.
The biology of Arabidopsis and Brassica are very simi-lar. However, because of polyploidy nature of Brassicaceaespecies, their genomes are more complex compared to A.thaliana. A. thaliana serves as a model for comparativemicrosynteny studies with Brassica species because of itssmall genome (with less repetitive DNA), short generationtime, and well-established genetic and genomics resources[7]. A pattern of chromosomal colinearity has been identifiedbetween Arabidopsis and Brassica plants [7]. Since theBrassica and Arabidopsis belong to the same Brassicaceaefamily, the level of synteny between them may provide agood opportunity to study how genetic and morphological
2 Comparative and Functional Genomics
variation has developed during the evolution of the genome,including the endurance of certain genetic structures in Ara-bidopsis and related Brassica species [7]. Hence, comparativegenome analysis may lead to a better understanding of plantof closely related species.
ESTs are considered as important genomic resources formining DNA markers based on simple sequence repeats(SSRs). The SSRs are present and distributed in the genomesof all eukaryotes. Because of the abundance and specificityof SSRs, these are considered as important DNA markersfor genetic mapping and population studies. The importantfeatures of SSR markers coupled with their ease of detectionhave made them useful molecular marker in different crops[8]. Therefore, detection of SSRs in the unigenes and ESTs ofBrassicaceae species may help in designing a new set of DNAmarkers and may provide more insight in the evolution ofthese species. Once validated, these markers can be used bythe breeders in different Brassica improvement programmes.
The analysis of GC contents among unigenes and ESTsgives important indication about the gene and genomecompositions. The GC content of the sequence gives a fairindication of the melting temperature (Tm) and stabilityof the DNA molecules. The positive correlation has beenobtained with the higher GC content and absolute valuesof thermostability, bendability, and ability to B−Z transitionof DNA structure whereas negative correlation has beenobtained between the curvature and high GC content ofthe DNA molecule. The GC-rich DNA constitutes gene-rich,actively transcribed genomic regions hence considered goodas functional or expressed DNA [9]. The GC content ofsequences surrounding to the gene(s) also considered as thebest predictor of the rates of substitution during evolution[10]. However, such analysis is lacking in case of differentBrassica species.
In this study, the gene indices were constructed andcomparative analysis for five Brassicaceae species, namely,B. juncea, B. napus, B. oleracea, B. rapa, and R. sativus wasreported for the first time. These gene indices constitutea total of 131,286 nonredundant sequences which wasutilized to assess sequence conservation among Brassicaceaeon a genomic scale, mining SSRs, frequency and type ofrepeat elements, and finding GC contents. DNA markerswere designed and validated across Brassica species usingPCR. Using the computational method, we have identifiedsequence and functional similarity of Brassicaceae transcriptsto that of Arabidopsis, suggesting that a portion of thesetranscripts have a high degree of conservation with Arabidop-sis genome. These analyses provide insight into the overallsequence conservation among Arabidopsis and Brassicaceaeand within Brassicaceae.
2. Materials and Methods
2.1. Clustering of ESTs of Brassicaceae Species. For this study, atotal of 131,286 ESTs deposited till August 2006 in the publicdatabase NCBI (http://www.ncbi.nlm.nih.gov/) representingthe Brassicaceae species; B. juncea (235), B. napus (88,573),B. oleracea (20,923), B. rapa (21,422), and R. sativus (133)
were downloaded. The available ESTs of these species wereclustered into gene indices that represent a nonredundantset of transcripts or unigenes. Batch files of EST sequencesfor these species were downloaded in FASTA format. Thesequences were clustered by using the SeqMan programmeof DNASTAR software (http://www.dnastar.com/) to elim-inate redundancies and generate unigene sequences. Forclustering, we optimized clustering parameters in DNA Starsoftware by using sample data created by taking randomsequences of known genes. The optimized parameters werefound to be efficient to cluster ESTs to a specific expectedcluster and did not produce false joins among the ESTs.
2.2. Analysis of GC Content and SSR. The GC content ofall the five Brassicaceae species was calculated using theformulae in excel sheet. We calculated the number of G andC separately, summing the two quantities and dividing bythe total number of bases in that unigene sequence and thencomputing the percentage of GC contents.
The unigene sequences were used to identify SSRs usingMISA software (http://pgrc.ipk-gatersleben.de/misa/). Sixclasses of SSRs, that is, mono-, di-, tri-, tetra-, penta-,and hexanucleotide repeats were targeted for identificationusing this tool. The default setting used in the program forminimum number of repeats was 10 for mononucleotide, 6for dinucleotide, and 5 for tri-, tetra-, penta-, and hexanu-cleotides. In addition, this program also identifies complexrepeats. Batch files of the target species were exported to thelocal database in Sun server using FTP and were run throughMISA by passing the sequence file as input to the programat the command prompt. The output files were transferredto desktop using FTP and opened using excel sheets forvisualizing the results. The four classes of mononucleotideSSRs were defined based on the repeat length, that is,mononucleotides 15 or less bp, 16−30, 31−45, and 46 or morebp repeats. The class chosen for dinucleotide repeats were5−10 bp repeats, 11−16 bp repeats, and 17 or more bp repeats,while that for trinucleotide repeats were 5−10 and 11−16 bp.Results on repeat types, number of repeats, and frequencyacross all species were tabulated and significant resultsand observations were depicted in the form of differentfigures.
2.3. Functional Annotation of Unigenes. The unigene sequen-ces of the five Brassicaceae species were matched with Ara-bidopsis gene sequence database at local BLAST server usingBLASTN (with advanced options: -G5, -E1, -q1, -r1, -v1, and-b1). The results were extracted using in-house developedPerl scripts, and tabulated in excel sheet. The Arabidopsisunigene set was used as a reference, and the sequences ofeach of the five crops were split into batches of 200 each forcomparisons. The results were tabulated and the bit scorecutoff of 100 was applied to filter significant matches. Thesesieved hits were then BLAST searched against nr databaseusing BLASTX (http://blast.ncbi.nlm.nih.gov/Blast.cgi) forannotation. The annotated genes were classified into 28different functional categories based on their homology toknown proteins.
Comparative and Functional Genomics 3
Table 1: List of eighteen cultivars belonging to seven differentspecies of Brassica used for the analysis of SSR cross transferability.
Sr.no.
SpeciesGenome
(Chr. no.)Cultivars
1Brassicarapa
AA(n = 10)
B. rapa toria PT303
B. rapa toria TL-15
B. rapa toria Sangram
B. rapa cv Yellow sarson Pusa Gold
B. rapa cv Yellow sarson YSPB-24
2Brassicacarinata
BBCC(n = 17)
B. carinata Kiran
B. carinata NPC-9
B. carinata KC-01
3Brassicajuncea
AABB(n = 18)
B. juncea Pusa Agram
B. juncea Bio 902
B. juncea BEC1.44
4Brassicanapus
AACC(n = 19)
B. napus ISN-129
B. napus GSL-1
B. napus GSL-2
5Brassicaoleracea
CC(n = 9)
B. oleracea italica Palam Smridhi
B. oleracea botrytis Pusa Sharad
B. oleracea capitata Pusa Ageti
6 Raphanussativus
(n = 9) Raphanus sativus
2.4. Validation of SSR Markers. Five different species ofBrassica, namely, B. rapa, B. carinata, B. juncea, B. napus,and B. oleracea as well as R. sativa were used in thepresent study. All the species were subdivided into 2 to3 groups (Table 1). Total genomic DNA was extractedfrom the fresh leaves of all Brassica species using CTABmethod. Thirty four Genomic SSR markers, 15 unigene-derived and 39 genomic survey sequences (GSS) SSR wereused to study their transferability across the species. Thepolymerase chain reaction (PCR) conditions, particularlyannealing temperature for each primer, were standardizedusing gradient temperature ranges from 50◦C to 60◦C. ThePCR reactions were performed using PTC 225 gradientcycler (BIO-RAD Inc.) in 10 µL volumes containing 30 ngof brassica genomic DNA, 5 pmole, each of the forwardand reverse primers, 0.1 mM dNTPs, 1x PCR buffer (10 mMTris, pH 8.0, 50 mM KCl and 50 mM ammonium sulphate),1.8 mM MgCl2, and 0.2 unit of Taq DNA polymerase. ThePCR cycling conditions involved initial DNA denaturationat 94◦C for 5 min followed by 30 cycles of denaturation at94◦C for 1 min primer annealing at 55◦C-60◦C for 1 min andprimer extension at 72◦C for 1 min. This was followed by afinal extension step at 72◦C for 10 min followed by storage at4.0◦C. The amplified products were resolved on 3% agarosegel using 1x TBE buffer, run at 120 V for 2 to 3 h dependingon the size of the expected PCR product, and visualizedusing ethidium bromide staining using GEL documentationsystem. The band sizing of the amplicon generated by eachSSR marker was determined as against 100 bp DNA ladder.
B. napusB. junceaB. oleracea
B. rapaR. sativus
0
5
10
15
20
25
30
35
40
Un
igen
e(%
)
10–1
5
20–2
5
30–3
5
40–4
5
50–5
5
60–6
5
70–7
5
80–8
5
90–9
5
GC content range (%)
Figure 1: Frequency distribution of Unigenes with respect to GCcontent in five brassica species The average GC content of all thespecies was between 50%−55% and symmetrical in distributionexcept for B. napus which showed skewed distribution ranging from30%−95%.
3. Results
3.1. Clustering of ESTs into Unigenes. A total of 131,286 ESTsequences for five different crucifer family members weredownloaded from the GenBank including dbESTs. TheseESTs were generated from different tissues and stress levelsby various workers (http://www.ncbi.nlm.nih.gov/). Allsequences for each species were clustered into 25,428 uni-genes (http://203.122.19.19/plantgenomedb/plantgenomedb.html) in five species. Less-abundant or lowly expressedtranscripts could not be assembled into larger contigsremained as singletons. A summary of the EST and unigenesof each species is given in Table 2. In case of B. juncea, 83.4%of EST formed unigenes followed by B. oleracea (49.14%),B. rapa (41.14%), and B. napus (6.82%). We found only 133EST sequences in case of R. sativus of these 70.68% formed94 unigenes.
3.2. Similarity of Brassicaceae Gene Indices with ArabidopsisGenes. Using Arabidopsis gene indices, a comparative anal-ysis of Arabidopsis with the five Brassicaceae species geneindices exhibited high level of similarity with the unigenesof B. juncea, B. napus, B. oleracea, B. rapa, and R. sativus(Table 3). The analysis based on EST-derived unigenes inthese five Brassica species revealed that the majority of thegene indices have very less sequence variation comparedto Arabidopsis gene indices and are conserved across theBrassicaceae family.
3.3. Analysis of GC Content of Brassicaceae Unigenes. Weanalyzed the GC content (ratio of guanine and cytosine)of all the unigenes, and results were tabulated based onthe class intervals defined in the range from 10%−95% GC
4 Comparative and Functional Genomics
Table 2: Summary of gene indices of different species of Brassicaceae family.
Species ESTs∗ Unigene sESTs % EST forming unigenes
B.juncea 235 196 176 83.40
B. napus 88573 6045∗ 5468 6.82
B. oleracea 20923 10281 7104 49.14
B. rapa 21422 8812 5546 41.14
R. sativus 133 94 75 70.68
Total 131286 25428 18369∗http://www.ncbi.nlm.nih.gov/.
Table 3: Genome size, number of unigenes, and similarity betweenunigenes and the genes of Arabidopsis.
SpeciesGenome Size
(Mbp)Number ofUnigenes
BLASTNhomology
B. juncea 1068∗ 196 195
B. napus 1129−1235∗ 6045 5985
B. oleracea 599−618∗ 10281 10280
B. rapa 468−516∗ 8812 8812
R. sativus 573∗∗ 94 94∗http://www.brassica.info/information/GenomeSize.htm.∗∗http://radish.plantbiology.msu.edu/index.php/RadishDB:Analysis.
content, with an interval of 5%. The GC content rangeof the transcripts of all the unigenes of 5 Brassicaceaespecies is given in Figure 1. The average GC content ofall the species was between 50%−55% and symmetricalin distribution except for B. napus which showed skeweddistribution ranging from 30%−95%. The GC content of R.sativus unigenes was quite variable (Figure 1).
3.4. Distribution of Repeat Length Classes in Unigenes. Wefound that in all the five Brassicaceae species explored inpresent study, most of the unigenes contained a single SSRstretch from which potential unique markers can be derived.The frequency of single SSR-containing unigene ranged from60% (B. rapa) to 92% (R. sativus). The average frequency ofunigenes containing multiple SSRs across all five species was25%. The maximum number of unigene containing singleSSR was found in case of B. rapa, followed by B. juncea andB. oleracea (Table 4). The SSR frequency observed was notuniform among these Brassica species (x2 = 456.2, df = 4).The relative abundance of mono-, di-, tri-, tetra-, penta-, andhexanucleotide repeats in all the five Brassicaceae species weredetermined by calculating their frequencies in the unigenes.The mononucleotide repeats were predominant in all thefive species studied in present investigation. The frequency ofmononucleotide repeats varied from 60% in B. rapa to 92%in R. sativus. The second dominant class was dinucleotiderepeat in all species except B. juncea, which had trinucleotiderepeat at second position. In rest of the species, highest per-centage of mononucleotide repeats were obtained followedby di-, tri- and tetranucleotide repeats. A little variationobserved at penta- and hexanucleotide where frequency ofhexanucleotide was greater than pentanucleotide repeats.
3.5. Frequencies of Different SSR Repeat Types. The relativefrequencies of SSRs were calculated for five species. Thefrequency estimates shown are based on the total numberof SSRs observed in all unigenes that have either singleor multiple SSRs. It was seen that A/T repeats were thepredominant mononucleotides in all the five species. Theresults indicated that A/T SSRs represent more than 50%of the total SSRs in all five species whereas the frequencyof C/G repeats were 19.14% in B. oleracea, 4.55% in B.juncea, and 4.17% in R. sativus (Figure 2(a)). Amongdinucleotide SSRs, AG/GA/CT/TC group was a ruling classof dinucleotide repeats in all of the species analyzed duringthis investigation. It ranged from 4.2% to 18.7% of the totalSSRs explored. These repeats were maximum in B. rapafollowed by B. juncea, B. oleracea, B. napus, and R. sativus.The average frequency of AT/TA and AC/CA/TG/GT wasalmost same (0.61% and 0.67%, resp.) among the five species(Figure 2(b)).
An assay of frequencies of trinucleotide repeats oftotal SSRs showed the predominance of AAG/AGA/GAA/CTT/TTC/TCT repeats class in 4 out of 5 species. Forinstance, the trinucleotide repeats were 22.73% in B. juncea,18.48% in B. rapa, 12.85% in B. oleracea, 6.62% in B. napus,and 4.17% in R. sativus (Figure 2(c)). In R. sativus, theonly ATG/TGA/GAT/CAT/ATC/TCA repeat class was found,which is the second dominant class of repeats in B. juncea.The AGG/GGA/GAG/CCT/CTC/TCC repeat was the seconddominant class in B. napus, B. oleracea, and B. rapa.
The possibility of tetranucleotide repeats is 33 acrossthe genomes [11, 12], but only a small number of tetranucleotide repeats were observed among the 5 Brasiscaspecies in present study. As the numbers are too low forfrequency evaluation, all of the observed tetranucleotiderepeats were assayed in order to figure out the most recurrenttetranucleotide SSRs across these Brassicaceae species. Thetop 15 tetranucleotide repeats obtained in the 5 Brassicaceaespecies were AAAC, AAAG, ATGA, CCAA, CTTT, GAAC,TACA, GAAA, AGAA, TTGT, TCAA, TTTG, AATC, CAAA,and GAAG. The AAAC and AGAA repeats were the mostabundant tetranucleotide SSRs.
3.6. Frequencies of Different SSRs Repeat Length Classes.It was found that the majority of mononucleotide SSRsfall in 16−30 repeat classes followed by 15 or less repeatclasses, except in B. juncea and B. oleracea, where 15 or lessrepeat classes were more abundant than 16−30 repeat classes
Comparative and Functional Genomics 5
Table 4: Different types of SSR identified in the unigenes of five Brassicaceae crops.
Crops Unigenes Monomer Dimer Trimer Tetramer Pentamer Hexamer
B.juncea 196 14 3 5 0 0 0
B. napus 6045 3211 301 250 7 1 5
B. oleracea 10281 1904 385 339 5 1 4
B. rapa 8812 1187 423 367 5 0 4
R. sativus 94 22 1 1 0 0 0
Total 25428 6338 1113 962 17 2 13
(Figure 3(a)). In B. rapa, the 15 or less and 16−30 repeatclasses almost shared nearly equal distribution of the SSRs.Although SSRs with 46 or more repeats were less frequentin all species. Distribution of dinucleotide SSRs showedthat in most of species, they fall in the category of 5−10repeat classes succeeded by 11−16 repeat classes (Figure 3(b)).However, in R. sativus, SSRs were detected in 17 or morerepeat classes. With respect to the occurrence of trinucleotideSSR distribution into repeat length classes, the 5−10 repeatclasses were most predominant in all the species analyzed(Figure 3(c)). Thus, the distribution of SSRs clearly showedthe predominance of mononucleotide SSRs containing 16−30repeats and di- and trinucleotide containing 5−10 repeats.
3.7. Functional Annotation of the Unigenes. The data from thecompletely sequenced Arabidopsis genome was used to pre-dict genes and use them to compare with other species. Theunigene sequences from five Brassicaceae species; namely, B.juncea, B. napus, B. oleracea, B. rapa, and R. sativus wereused in this analysis. The functional categories of differentunigenes are given in Figure 5. The most predominantfunctional category of unigenes were metabolism and energy,consisting of 33.71% of the total unigenes in B. juncea and25.32% in R. sativus followed by B. napus (21.15%), B. rapa(20.72%), and B. oleracea (19.48%). The second dominantfunctional category was structural/catalytic proteins, whichconsisted of 15.41% of the total unigenes in B. rapa,followed by 13.92% in R. sativus, 13.52% in B. oleracea,10.07% in B. napus, and 8.99% of the total unigene inB. juncea. Few other dominant functional categories werecell localization, protein activity regulation, and cellulartransport (Figure 5). In two Brassicaceae species, that is,B. juncea and R. sativus common functional categorieslike cellular communication/signal transduction, interactionwith environment (systemic), transposable elements, viraland plasmid proteins, cell type differentiation, organ dif-ferentiation, subcellular localization, organ localization, andnuclear protein were not obtained (supplementary Table 1available online at doi:10.1155/2010/520238).
3.8. Validation of SSR Markers. To determine amplificationefficiency of SSR markers, 35 genomic, 39 GSS and, 15unigene-derived markers were chosen and used in PCRamplification. Thirty-one (88.57%) of the 35 genomic SSRmarkers, thirty-two (82.05%) of 39 GSS-SSR, and fourteen(93.3%) of 15 unigene SSR were successfully amplified(Table 5). Most of the markers produced fragments of
expected size. The number of alleles amplified per locusranged from 1 to 5 for genomic SSR, 1 to 2 alleles in caseof unigene SSR and from 1 to 3 alleles in case of GSS-SSR (Figure 4). All the markers amplified similar as wellas different size of DNA fragments in case of Brassica spp.Most of the primers were showing polymorphism withinand between Brassica species. Genomic SSR showed 63%polymorphism, unigene-derived SSR showed 40% whereasGSS-SSR showed 86% polymorphism across all brassicagenotypes analyzed in this study. Our study thus identifiedmarkers that are cross-transferable among different Brassicaspecies.
4. Discussion
Crops belonging to Brassicaceae family are closely related toArabidopsis thaliana. Since the whole sequence of A. thalianagenome has been decoded and is in public domain [13],it can be effectively used in comparative genome analysiswith the genomic sequence of Brassica species to understandbiological processes and manipulating different traits. In thepresent investigation, a comprehensive and detailed analysisof Brassicaceae unigenes was made and compared with thatof A. thaliana gene indices. Our analysis showed that Brassicaand Arabidopsis genes share high percentage of sequenceidentity hence can be used in various functional genomicstudies in Brassicaceae.
Analysis of GC contents showed that the unigenes ofB. juncea, a tetraploid species have more GC content thananother tetraploid species like B. napus. Even the unigenesof B. napus were less than that of diploid species B. oleraceaand B. rapa [14]. It has also been reported that the GCcontents may vary even in phylogenetically related specieslike onion and rice [14]. In other studies the mean GCcontent of coding regions is higher in angiosperms comparedto the dicots [15]. However, from present investigation,such conclusions cannot be drawn since we have takenall the unigene sequences and did not distinguish amongcoding or noncoding regions. A gradient in GC contentsalong the direction of transcription has been obtainedin case of gramineae genes [16]. Their exhaustive analy-sis showed that 5′-ends of gramineae genes were having25% higher GC contents than their 3′-ends. Similarly,microsynteny analysis between Oryza sativa spp japonicaand O. sativa spp. indica showed presence of higher averageGC contents in japonica genes than in the indica genes[17].
6 Comparative and Functional Genomics
Ta
ble
5:D
etai
lsof
the
SSR
mar
kers
use
dfo
rev
alu
atio
nof
ampl
ifica
tion
amon
gcu
ltiv
ars
ofB
rass
ica
and
cult
ivar
ofR
apha
nus
sati
vus.
S.n
o.P
rim
erId
#Fo
rwar
dpr
imer
sequ
ence
(5′→
3′)
Rev
erse
prim
erse
quen
ce(5′→
3′)
Pro
duct
Size∗
An
nea
l.te
mp.
$
1B
nG
enom
ic11
4T
GG
TAT
TC
GG
GC
AT
TG
GTA
TG
CA
AC
CA
CA
CA
GA
AG
GA
CA
A15
360
2B
nG
enom
ic29
GA
GT
CT
GA
TC
CT
GC
CT
CC
AT
CT
TT
GT
TAG
CG
GC
TC
CA
TG
T17
155
3B
nG
enom
ic33
CTA
AC
GA
CC
CT
TT
TC
CG
TC
AA
AA
CC
CC
AC
GT
TC
AT
TT
TT
G19
455
4B
oG
enom
ic16
6G
CA
TG
CTA
TG
TT
GG
GA
AC
CT
AA
GA
AG
AC
GA
GC
AA
GG
AC
GA
181
655
Bo
Gen
omic
76C
AA
GG
CC
AA
GG
TAT
CT
GG
AA
CT
CT
CT
GA
CC
GA
GC
CA
TC
AT
160
556
Bo
Gen
omic
77G
AT
TG
AA
GG
CG
CT
TAG
GA
GA
AC
AT
GC
GG
GA
TT
TT
GA
AG
AC
151
607
Bo
Gen
omic
85G
AG
GC
GA
GTA
CG
GT
TC
AA
GT
CC
CA
AA
TG
AG
GG
AA
AA
CA
AA
160
658
Bo
Gen
omic
90A
GC
AA
GA
AG
AG
GC
TC
AC
CA
AC
CG
AA
TAG
TG
AT
GC
AC
CA
AA
199
559
Br
Gen
omic
654
GG
CT
CTA
AG
CA
AG
TG
GG
AC
AA
AG
TG
CC
GA
TG
AC
GA
AG
AA
G16
560
10B
rG
enom
ic66
4C
TC
TT
TC
CC
TT
TC
CG
GTA
GC
CT
CA
AT
CG
GT
TC
TT
CT
GC
TG
187
5511
Br
Gen
omic
665
CG
GC
CA
TT
CA
TAC
CA
TC
TC
TA
CA
GG
AG
AA
AG
GA
GC
CC
GTA
152
5512
Br
Gen
omic
674
CG
AA
GC
AA
AT
GA
AG
CA
GA
CA
GA
GG
AA
CG
GT
TC
AG
CA
AG
AG
159
5513
Br
Gen
omic
677
TT
TTA
CC
AA
CG
CC
TT
GA
TC
GA
CC
GC
CT
TC
AT
CG
TAT
TG
AC
184
5514
Br
Gen
omic
684
TC
AT
TC
AT
CC
GC
AG
CA
CTA
CC
AC
GT
CT
GG
TT
GT
GC
GTA
TC
171
5515
Br
Gen
omic
692
AA
TG
TT
TG
CA
GG
AG
GA
GA
GC
AT
GC
AA
CT
GA
AT
GT
GG
TG
GA
156
6016
Br
Gen
omic
697
CT
CT
TTA
CT
TG
CG
GG
AG
AC
GC
AG
CC
TC
CT
TC
AC
CA
AG
AA
C19
855
17B
rG
enom
ic70
8T
CC
CC
AT
CG
AC
CT
TT
CT
TC
TG
GG
AA
CC
AG
CA
TC
AT
CA
GT
T15
3N
A18
Br
Gen
omic
709
GC
TT
GC
TT
GG
CT
GA
TT
TG
TT
CA
AG
CA
TT
CA
CG
AC
AT
CA
CC
154
6519
Br
Gen
omic
718
CC
TC
GT
GT
CA
GC
GA
CC
ATA
AA
GA
AC
AC
AT
CG
CC
TC
CT
TC
178
5520
Br
Gen
omic
730
AC
TAG
GA
AG
AA
GA
AG
GG
AG
AA
AC
CC
CG
AG
CC
CA
TT
TC
AT
TG
TAT
150
5521
Br
Gen
omic
732
AG
CA
TT
TG
CA
CA
TG
TT
GG
TT
AT
GT
GC
CT
CG
CA
TG
TG
AA
T15
955
22B
rG
enom
ic73
9TA
GG
GT
GA
AA
GG
GA
AG
CT
CA
CG
CTA
ATA
AT
GG
CG
CTA
AG
G15
060
23B
rG
enom
ic77
3G
TT
TC
TAT
CG
GT
GC
CT
CG
TG
TG
GC
CA
CTA
TG
GT
TT
TG
TC
A16
355
24B
rG
enom
ic77
7G
CA
CT
CA
GG
AG
AA
CA
GA
GC
AC
CT
GG
TAA
AA
CG
TG
TT
GC
AT
T16
355
25B
rG
enom
ic91
0A
AC
CC
CG
CT
GA
GC
TT
TAC
TC
GT
TT
CC
AG
GA
AT
CG
CT
TT
GA
159
5526
Br
Gen
omic
935
TAC
CC
AG
TC
CC
AA
CT
TG
CT
CG
AT
CT
GC
TG
GT
TG
GG
GA
TAG
192
5527
Br
Gen
omic
938
TT
TT
GT
CC
CT
TT
CC
CA
AA
TC
AG
CA
CA
AA
CC
AC
AC
CC
AA
A17
355
28B
rG
enom
ic93
9T
GA
GA
AC
AG
AA
GC
TC
AG
AG
TC
GT
TC
CT
CT
GT
GA
CT
GA
CTA
CA
AA
AC
A15
0N
A29
Br
Gen
omic
940
TT
TC
TC
CA
AG
GC
AG
AG
AG
GA
CA
CC
GC
AT
GA
TT
TG
AT
TG
AA
178
5530
Br
Gen
omic
942
AG
GG
AG
AA
AA
CG
CG
TT
GA
TAC
CT
GG
AT
GT
TT
GC
AC
CC
TAT
171
5531
Br
Gen
omic
944
AC
TT
GG
CT
CT
CG
AT
TT
GC
TC
TG
GA
GT
GT
TC
CA
TC
TT
CC
AT
176
NA
32B
rG
enom
ic94
5C
GG
AA
CC
AC
CT
CC
TT
GT
CTA
AA
CC
GG
TT
TG
TC
AG
GT
TT
TG
157
NA
33B
rG
enom
ic94
6C
GG
TG
AG
AG
AG
AG
GG
AG
AT
TC
TC
TC
CT
TT
GT
TT
GG
GC
TC
AG
181
5534
Br
Gen
omic
948
AG
TT
CA
AA
CC
AG
GT
GC
TG
CT
GC
CG
GC
CC
TAA
GA
TT
TAT
GT
152
6035
Br
Gen
omic
952
TG
AG
CA
GA
CG
AA
AC
CT
GC
TA
GA
AG
GG
AG
GA
AG
GA
AC
GA
A16
855
36G
SSB
n15
4C
AT
GC
TC
AA
AG
TAG
AC
GC
AG
AG
CC
TT
TG
CT
TC
AC
AA
CA
CA
A17
055
Comparative and Functional Genomics 7
Ta
ble
5:C
onti
nu
ed.
S.n
o.P
rim
erId
#Fo
rwar
dpr
imer
sequ
ence
(5′→
3′)
Rev
erse
prim
erse
quen
ce(5′→
3′)
Pro
duct
Size∗
An
nea
l.te
mp.
$
37G
SSB
n31
3T
TC
CG
TC
AC
CTA
AC
TAG
TC
AC
CA
TAT
TG
AG
AC
GC
CG
CA
GA
AC
164
5538
GSS
Bn
366
CC
AA
GG
GG
GA
GT
GT
TAT
GA
AA
GA
GT
GA
AA
AG
GA
AA
AC
CT
CC
T17
555
39G
SSB
n41
8G
AA
GC
AG
CA
TC
AT
GC
CT
GTA
CA
AA
CC
ATA
GT
CA
AG
GC
CT
CA
158
5540
GSS
Bn
423
TC
AG
TG
TG
CG
TAG
GA
AA
CA
AA
GG
TG
GG
TAC
CTA
AT
GG
TT
GG
150
NA
41G
SSB
n42
7T
TG
GA
TC
GTA
CT
GG
CC
TG
AT
GT
CC
TT
GT
TAT
GC
GG
CA
AA
G15
355
42G
SSB
n43
2C
AC
AT
GA
AC
AA
TT
GC
TT
GG
AG
TT
CG
TC
AA
GC
TAC
CA
CT
GG
A17
455
43G
SSB
n44
8A
GC
GG
AG
AT
TG
AC
TC
AG
AC
CG
GC
TT
CG
TC
TAA
AG
CC
AC
AG
170
NA
44G
SSB
n45
0A
AG
AG
CA
GG
CA
AC
AA
AT
CG
TG
CT
TG
GC
GA
AG
TAA
AA
AC
AA
156
NA
45G
SSB
n46
4T
CC
GA
CG
CA
AA
CTA
TC
AT
CA
TC
GA
TC
AC
CG
AT
GA
AG
TC
AC
150
5546
GSS
Bn
469
TC
TC
TG
GT
CA
CC
CC
TC
TAG
CC
TC
CA
TC
GA
AG
AA
AG
CC
AA
A15
155
47G
SSB
n48
3T
GT
TG
CT
GC
TT
CT
TC
GT
TT
GA
AT
CT
TC
AC
CA
AC
GC
TG
CT
T15
855
48G
SSB
n50
6G
GG
AA
GA
CC
CA
GA
AG
GA
AA
CA
GG
GA
GA
GG
GA
GA
AG
TG
AG
G15
155
49G
SSB
n51
7G
TG
GC
GA
CT
CT
GG
TG
AA
GA
CA
TT
CA
AT
TC
GC
TT
CC
TC
AC
G16
155
50G
SSB
n54
6A
AA
TAG
TC
GC
GA
TG
CG
TT
TT
TC
CT
TT
CG
TT
CC
GA
CA
AT
TC
159
5551
GSS
Bn
549
GC
CC
TG
CT
CG
TTA
GA
AG
AA
AA
GC
TT
GT
CC
CA
TT
CC
AA
CA
C16
460
52G
SSB
n55
2T
CC
TT
TC
GT
TC
CG
AC
AA
TT
CA
AA
TAG
TC
GC
GA
TG
CG
TT
TT
165
6553
GSS
Bn
561
TT
GC
CA
TC
TC
TC
CT
TC
GA
TT
CT
GG
AG
GT
GC
AG
CT
TT
GA
CT
168
5554
GSS
Bn
568
TC
CC
GTA
CG
AT
CC
TT
TG
AA
CA
TG
AC
GC
GA
CG
GA
TTA
TG
A15
155
55G
SSB
n57
0G
AC
AA
AA
AG
AG
CC
CA
CA
TG
AA
AG
CA
GG
TT
CT
TC
TC
CA
CC
AA
168
6556
GSS
Bn
571
GG
TT
GC
TAC
GG
TG
GA
GC
TAA
GA
GG
TT
GA
GA
CG
GA
AA
AG
CA
156
6557
GSS
Bn
574
CTA
AT
CG
AC
GC
AG
AC
GA
CA
GT
TT
GG
CT
TC
TC
CT
CG
AA
CT
C17
955
58G
SSB
n57
7G
GC
AT
TC
TC
AG
GT
CA
GT
GC
TG
GT
CC
TG
TT
CC
TG
AA
TT
CC
TC
159
NA
59G
SSB
n57
9C
TAC
CA
CC
GG
GA
AG
AA
AA
CA
CC
CT
CT
GT
CT
CC
CA
GTA
CC
A15
055
60G
SSB
n58
3T
GG
GA
AT
TG
CA
AC
AT
GA
AG
AC
AA
AG
AT
CG
GC
GA
AG
AA
GA
C19
055
61G
SSB
n60
6C
CG
GA
TAG
AG
AT
GG
AA
AT
GG
AT
TC
TC
CT
CA
GC
AG
CA
GC
A18
955
62G
SSB
n61
2G
CT
TG
CA
TG
TG
CTA
GG
TT
CA
ATA
CG
AG
CA
AG
GA
CG
AG
AC
G18
3N
A63
GSS
Bn
613
AG
GA
AT
GG
GT
CA
GA
TC
AA
GC
GT
CG
CT
GT
CT
CT
CT
CG
TC
CT
194
NA
64G
SSB
n61
4G
CC
CA
CA
AA
AA
TG
CT
GA
AA
TT
TC
GC
TT
GA
TTA
GC
TG
GA
AA
156
6565
GSS
Bn
616
GA
AT
CA
AG
CC
AC
CA
GC
AC
TT
AG
GA
GT
GA
TG
AG
CT
GG
CA
GT
152
6066
GSS
Bn
617
TC
GA
GT
CA
TAA
CG
CC
TT
TC
GA
AA
CC
GG
TC
GG
TTA
AA
AT
CA
160
6567
GSS
Bn
620
AA
GG
TG
AA
GC
CT
TT
GG
GT
CT
AA
GC
CA
AC
GA
AA
GC
AA
AA
AC
165
5568
GSS
Bn
622
TG
TG
GT
GA
TT
GC
TG
CTA
CA
GA
TAT
GC
AG
CT
GC
TT
TT
GC
TT
C18
9N
A69
GSS
Bn
623
AA
GA
CC
CA
AG
AC
CC
AA
GA
CC
CA
GC
TT
GG
AG
AG
AG
AG
GA
AG
G15
155
70G
SSB
n62
4A
TAA
CA
GT
CG
TC
CC
CC
TT
CC
CA
GC
AG
AG
AC
TT
GT
GG
GA
CA
189
5571
GSS
Bn
625
CT
TC
GC
CT
CG
ATA
AA
GA
AC
GT
GT
TATA
TG
GG
GA
AA
GG
TAG
GC
177
5572
GSS
Bn
626
AA
CC
GA
AC
CG
AA
AA
CC
AA
AT
GT
TG
CG
CG
GG
AT
TAT
TTA
T15
255
73G
SSB
n62
8G
GG
TG
AC
CG
TG
AT
CA
TAG
TG
TC
TT
CT
GC
AT
CG
TC
CA
AA
TC
A16
755
74G
SSB
n62
9C
GG
AC
CTA
TT
CC
TC
AA
GA
GC
AG
GA
GA
CA
AG
GA
GC
CA
CT
CA
199
5575
UB
oler
acea
320
AA
TC
TG
AA
TG
GG
CA
GT
TT
GG
GA
GC
AG
CC
CT
CA
TC
AT
CA
TC
190
6076
UB
oler
acea
321
TG
AG
TC
CA
CA
AA
CC
AG
AC
CA
TC
AT
CA
TC
TT
TG
TT
GG
GA
CC
T16
255
8 Comparative and Functional Genomics
Ta
ble
5:C
onti
nu
ed.
S.n
o.P
rim
erId
#Fo
rwar
dpr
imer
sequ
ence
(5′→
3′)
Rev
erse
prim
erse
quen
ce(5′→
3′)
Pro
duct
Size∗
An
nea
l.te
mp.
$
77U
Bol
erac
ea40
4A
GA
AG
AA
GA
AG
GG
CC
AA
AC
CT
CC
GA
GC
TTA
GA
TC
TG
TC
GA
G16
1N
A78
UB
oler
acea
433
GA
AC
CG
CT
CA
TG
AA
TG
CTA
CT
CC
CC
AA
GG
AT
TAG
GA
AG
TG
G20
055
79U
Bol
erac
ea44
7C
TAT
TC
CG
GC
GA
AC
TT
CT
CA
AC
GA
TC
CG
TTA
CT
CC
CG
TTA
150
5580
UB
oler
acea
460
CC
CG
GA
AG
AG
TT
TC
CC
TAT
CA
TC
CC
CT
GA
GA
AG
CT
GG
AA
T17
455
81U
Bol
erac
ea50
3A
TC
AA
CA
CA
GC
CT
CC
AG
CT
TG
TT
GA
TT
CG
GG
TG
CA
AG
AG
T15
355
82U
Bol
erac
ea50
4T
CC
GA
TC
AA
GT
CC
CA
TC
AA
TC
AC
TG
CT
TC
CC
CT
TTA
GC
AG
152
5583
UB
oler
acea
505
TAA
CC
CA
GG
AG
GT
GG
TC
AA
GT
TT
GG
GG
TC
ATA
CC
GT
TT
GT
171
NA
84U
Bol
erac
ea50
6A
AG
GA
CG
CT
GA
AG
CTA
CC
AA
TC
AA
GG
CC
GC
TAC
CA
TT
TAG
187
5585
UB
oler
acea
507
TG
AA
GA
GC
TT
CG
AA
GG
AA
GG
AC
CG
TG
TG
AG
AA
TC
CG
AA
AG
175
5586
UB
oler
acea
508
TAC
CG
GG
GA
AA
GA
AG
AA
GG
TT
TT
CA
GA
AT
CT
GC
AG
CA
AC
C16
355
87U
Bra
pa24
4A
TC
GC
AG
CC
TC
GA
GA
TTA
CT
GA
TC
GG
TG
AA
CG
GA
TAA
GG
A15
855
88U
Bra
pa42
1G
GC
AT
GG
CC
AG
CA
TATA
AG
TC
CT
CC
AT
GA
GA
CT
TC
GT
CA
A17
055
89U
Bra
pa42
2G
GC
AG
AA
GC
AT
CT
CC
TG
AA
GC
GC
TAT
GG
TC
CC
TT
TT
CA
GT
172
55#P
rim
erId
den
oted
asB
n:B
rass
ica
napu
s,B
o:B
rass
ica
oler
acea
,Br:
Bra
ssic
ara
pa,G
SS:G
enom
icsu
rvey
sequ
ence
s,U
:Un
igen
e.∗ E
xpec
ted
prod
uct
size
inba
sepa
irs.
$O
ptim
ized
ann
ealin
gte
mpe
ratu
rein◦ C
,N
A:n
otam
plifi
ed.
Comparative and Functional Genomics 9
B. napusB. juncea B. oleracea B. rapa R. sativus
A/TC/G
010
20
30
40
50
60
70
80
90
100
Freq
uen
cyof
mot
ifca
tego
ries
(%)
Brassicaceae species
(a)
B. napusB. juncea B. oleracea B. rapa R. sativus
AT/TAAG/GA/CT/TC
AC/CA/TG/GTGC/CG
02468
101214161820
Freq
uen
cyof
mot
ifca
tego
ries
(%)
Brassicaceae species
(b)
AAT/ATA/TAA/ATT/TTA/TATAAC/ACA/CAA/GTT/TTG/TGTAGT/GTA/TAG/ACT/CTA/TACAGC/GCA/CAG/GCT/CTG/TGCACC/CCA/CAC/GGT/GTC/TGGAAG/AGA/GAA/CTT/TTC/TCT
B. napusB. juncea B. oleracea B. rapa R. sativus0
2
4
6
8
10
12
14
16
Freq
uen
cyof
mot
ifca
tego
ries
(%)
Brassicaceae species
(c)
Figure 2: Frequency of SSR motif types in the unigenes. (a) mono-,(b) di-, and (c) trinucleotide repeats in unigene sequences signifyinguneven distribution of different motifs in five Brassica species.
B. napusB. juncea B. oleracea B. rapa R. sativus0
10
20
30
40
50
60
70
80
90
100
15 or less16–30
31–4546 or more
Brassicaceae species
Sin
gle
sequ
ence
repe
ats
(%)
(a)
B. napusB. juncea B. oleracea B. rapa R. sativus0
20
40
60
80
100
5–1011–16
17 or more
Sin
gle
sequ
ence
repe
ats
(%)
Brassicaceae species
(b)
0
20
40
60
80
100
120
Sim
ple
sequ
ence
rep
eats
(%)
B. napusB. juncea B. oleracea B. rapa R. sativus
5–1011–16
Brassicaceae species
(c)
Figure 3: Relationship between different motif types of SSRs. (a)mono-, (b) di-, and (c) trinucleotide repeats and repeat lengthobserved in unigene-derived SSRs of five Brassica species.
The frequencies of different classes and types of SSRshave been calculated in the unigenes of five species withinBrassicaceae species. Simple sequence repeats are found to bein abundance and consistently distributed in plant genomes.It has also been reported that SSRs occur as frequently asonce in about 6 kb in case of plant genomes [18]. SSRs are
10 Comparative and Functional Genomics
(bp)M M1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18500
100
(a)
(bp)M M1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
100
300
(b)
M M (bp)1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
100
200
(c)
Figure 4: Amplification profile of (a) genomic SSR marker Bo Genomic 90, (b) unigene SSR marker U boleracea 506 (c) GSS-SSR markerGSS Bn 464 in 18 genotypes belonging to Brassica species, lane 1, 2, 3 B. rapa toria, lane 4, 5 B. rapa cv Yellow sarson, lane 6, 7, 8 B. carinata,lane 9, 10, 11 B. juncea, lane 12, 13, 14 B. napus, lane 15, 16, 17 B. oleracea, lane 18 Raphanus sativa.
Hypothetical protein
B. napus
B. juncea
B. oleracea
B. rapa
R. sativus
0 20 40 60 80 100
(%)
Storage protein
Structural/catalytic proteinCellular transport
T.E./viral/plasmid proteinL.Z. protein Developmental
Ubiquitous expression Nuclear protein
Protein synthesisCell cycle and DNA processing
Protein fold/modification/destination
Metabolism and energyTranscription factor
Protein activity regulationCellular communication/signal transduction
Organ differentiationCell differentiation
Cell localization Organ localizationSubcellular differentiation
Cellular environ. interEnviron. inter.
Biogenesis of cellular components
Cell rescue, defense, and virulence
Figure 5: Frequency of genes in different functional categories analysed in five Brassicaceae species. The predominant functional categoryof unigenes was metabolism and energy followed by structural/catalytic protein. Most of the unigenes of all the species were hypothetical innature.
Comparative and Functional Genomics 11
more common in the vicinity of genes than in other regionsof the genome [19]. However, among five Brassicaceae cropsstudied in present investigation, 62.45% of the unigenes of B.napus contained SSRs.
Theoretically, the probability of finding mononucleotiderepeats in a genome is higher followed by dinucleotiderepeats and then by trinucleotide repeats followed by tetra-,penta-, and hexanucleotide repeats [20]. This trend ofdistribution of repeats for all the species, namely, B. napus, B.oleracea, B. rapa, and R. sativus has also been found in presentstudy. However, the trinucleotide repeats were the secondabundant in B. juncea. The frequency of hexanucleotiderepeats found in B. napus, B. oleracea, and B. rapa ismore than that of pentanucleotide repeats. The generaltrend showed that mononucleotides were the most abundantrepeats in all five species followed by di- and trinucleotiderepeats.
The available SSR motif combination could be groupedinto unique classes based on the property of DNA-basedcomplementarities. For mononucleotides, although A, T, C,and G are possible, A and T could be grouped into onecategory since an A repeat on one strand is same as a T repeaton the opposite strand and a poly C on one strand is the sameas a poly G on the opposite strand, resulting in two uniqueclasses of mononucleotides, A/T and C/G [11]. Similarly, inour study, all dinucleotides can be grouped into four uniqueclasses: (i) AT/TA; (ii) AG/GA/CT/TC; (iii) AC/CA/TG/GTand (iv) GC/CG. Thus, the number of unique classes possiblefor mono-, di-, tri-, and tetranucleotide repeats is 2, 4,10, and 33, respectively, [11, 12]. Major role of repeatelements has been attributed to the gene duplication andamplification for generating new alleles in a population.The whole genome analysis of rice and Arabidopsis hasshown very interesting observations. In whole rice genome,a total of 18,828 classes of di-, tri-, and tetranucleotide SSRsrepresenting 47 distinct motif families have been annotated[21]. It has been reported that 51 hypervariable SSR per Mbof the rice genome are available. These SSRs also used asDNA markers for specific regions of the genome, amplifiedwell with PCR, polymorphic among different genotypesthus are of immense applications in genetic analysis [21]. Acomprehensive analysis on presence of SSRs in Arabidopsisgenome has been performed [22, 23]. It has been reportedthat the majority (80%) of all SSRs found in Arabidopsisgenome were mono-, di-, tri-, tetra- and pentanucleotides[23]. In our analysis, maximum (22.73%) of trinucleotideswere obtained in B. juncea compared to other 4 speciesstudied. In Arabidopsis genome, SSRs in general are morefavored in upstream region of the genes and trinucleotiderepeated were the most common repeats found in the codingregions [22].
Comparative genomics has progressed the discovery andunderstanding of orthologues, but it has brought to lightmany fast evolving “orphan” genes of unknown functionand evolutionary history. In Brassica species, comparativeanalysis provides an opportunity to study rapid genomechanges associated with polyploidy level in this largest plantfamily. Brassica genome analysis might provide new insightsinto the organization of plant genome and the size and
shape of plants as well. To accomplish this task, the completesequence of Brassica’s close relative, Arabidopsis thaliana,would be an important genomic resource.
The abundance of unigenes with cellular roles in Bras-sicaceae species was estimated by classifying the BLASTXmatches with similarity to known proteins into 26 func-tional categories. The proportion of transcripts involved inmetabolism and energy was 24.1% (between 20% and 34%among Brassicaceae species). Though such analysis has notbeen performed in case of Brassica species, in sugarcaneassembled EST sequences with 23.8% transcripts involvedin various metabolism and energy processes like bioen-ergetics, secondary metabolism, lipid metabolism, aminoacid metabolism, DNA metabolism, nucleotide metabolism,and N, S, and P metabolism were obtained [24]. The22% of unigenes showed similarity with that of thegenes involved in storage protein, cell cycle, and DNAprocesses, transcription factor, protein synthesis, proteinfold/modification/destination, structural/catalytic protein,protein activity regulation, and nuclear protein in differentorganisms. Similar types of analysis was performed inwild Arachis stenosperma and found that ∼22% ESTs wereinvolved in the same function [25]. Maximum numbersof unigenes analyzed in our study are still hypotheticalor unknown hence could be used in functional analysisstudy, which may lead to discovery of some unique genes inBrassicaceae crops.
PCR-based markers designed from various genomicsequences can be used for various molecular and geneticstudies after their validation for quality and robustness ofthe amplification. Earlier reports suggest that a portion ofgenomic SSRs, developed in the past, have produced faintbands or stuttering [26, 27]. However, in the present study,all the genomic SSR produced clear and high-intensity bands.SSR derived from the genes have produced a high proportionof high-quality markers with strong bands and distinct allelesin most of the reports [28, 29]. The quality of genotypingdata obtained from EST-SSR is highly dependent on thequality and robustness of amplification patterns. Varshneyet al. [30] reported that markers derived from the conservedregion of genome are expected to show greater cross-transferability between species and genera. The unigene-derived SSR markers have unique identity and positions inthe transcribed region of the genome. With the availabilityof huge unigene databases, large-numbered SSR can be easilyidentified. The markers developed in present study wouldbe an important resource for the brassica breeders. Thesemarkers would be useful for generating comparative geneticand physical maps, study of genetic diversity, marker-assistedselection, and even positional cloning of useful genes inBrassica and other related species.
5. Conclusions
Our analysis on the comparative analysis of Brassicaceaecrops with A. thaliana confirmed a high level of nucleotidesequence conservation. Thus, a genome scale comparison ofArabidopsis with Brassica at the sequence level provides an
12 Comparative and Functional Genomics
excellent opportunity to find some agriculturally importantgenes, to clone and use them in breeding programmes.The average GC content of Brassicaceae species was between50%−55%. The mining of SSRs showed highest percent-age of mononucleotide repeats followed by di-, tri-, andtetranucleotide repeats in all of the species except B. juncea.A/T repeats were the prevalent mononucleotides with morethan 50% in all the 5 species. The predominant class ofdinucleotide repeats in all the species was AG/GA/CT/TC,maximum in B. rapa. The distribution of SSRs showedthe abundance of mononucleotide SSRs containing 16−30repeats while di- and trinucleotide containing 5−10 repeats.Out of the 28 functional categories, the ruling functionalcategory of unigenes was metabolism and energy followedby structural/catalytic protein. Comparative genomics canfacilitate the study of the evolution of sequences andfunctions of orthologous genes and also to understanddiversification and adaptation. These comparative studieshave contributed to analysis of complicated quantitativetraits and comparisons of the organization of the chromo-somes of Brassica. It is expected that comparative genomeanalysis between Arabidopsis and related crop species willexpedite research in the more complex Brassica genomes. Themarkers developed in present study would be an importantresource for the brassica breeders. These markers wouldbe useful for generating comparative genetic and physicalmaps, study of genetic diversity, marker-assisted selection,and even positional cloning of useful genes in Brassica andother related species.
Acknowledgment
Financial assistance received by T. R. Sharma from the IndianCouncil of Agricultural Research, New Delhi under NPTCProject on Bioinformatics and Comparative Genomics isgratefully acknowledged.
References
[1] Y. W. Yang, K. N. Lai, P. Y. Tai, and W. H. Li, “Rates ofnucleotide substitution in angiosperm mitochondrial DNAsequences and dates of diversification between Brassica andother angiosperm lineages,” Journal of Molecular Evolution,vol. 48, pp. 597–604, 1999.
[2] R. Price, J. Palmer, and I. Al-Shehbaz, “Systematic rela-tionships of Arabidopsis, a molecular and morphologicalperspective,” in Arabidopsis, E. Meyerowitz and C. Somerville,Eds., pp. 7–19, Cold Spring Harbor Laboratory Press, ColdSpring Harbor, NY, USA, 1994.
[3] M. Iwabuchi, K. Itoh, and K. Shimamoto, “Molecular andcytological characterization of repetitive DNA sequences inBrassica,” Theoretical and Applied Genetics, vol. 81, no. 3, pp.349–355, 1991.
[4] U. Lagercrantz and D. J. Lydiate, “Comparative genomemapping in Brassica,” Genetics, vol. 144, no. 4, pp. 1903–1910,1996.
[5] R. J. Snowdon, W. Kohler, W. Friedt, and A. Kohler, “Genomicin situ hybridization in Brassica amphidiploids and interspe-cific hybrids,” Theoretical and Applied Genetics, vol. 95, no. 8,pp. 1320–1324, 1997.
[6] S. B. Primrose and R. M. Twyman, Principles of GenomeAnalysis and Genomics, Blackwell Publishing, Boston, Mass,USA, 2002.
[7] K. Suwabe, H. Tsukazaki, H. Iketani et al., “Simple sequencerepeat-based comparative genomics between Brassica rapa andArabidopsis thaliana: the genetic origin of clubroot resistance,”Genetics, vol. 173, no. 1, pp. 309–319, 2006.
[8] H. S. Chawla, Introduction to Plant Biotechnology, SciencePublishers, 2002.
[9] A. E. Vinogradov, “DNA helix, the importance of being GC-rich,” Nucleic Acids Research, vol. 31, no. 7, pp. 1838–1844,2003.
[10] P. F. Arndt, T. Hwa, and D. A. Petrov, “Substantial regionalvariation in substitution rates in the human genome: impor-tance of GC content, gene density, and telomere-specificeffects,” Journal of Molecular Evolution, vol. 60, no. 6, pp. 748–763, 2005.
[11] M. V. Katti, P. K. Ranjekar, and V. S. Gupta, “Differentialdistribution of simple sequence repeats in eukaryotic genomesequences,” Molecular Biology and Evolution, vol. 18, no. 7, pp.1161–1167, 2001.
[12] J. Jurka and C. Pethiyagoda, “Simple repetitive DNA sequencesfrom primates: compilation and analysis,” Journal of MolecularEvolution, vol. 40, no. 2, pp. 120–126, 1995.
[13] S. Kaul, H. L. Koo, J. Jenkins et al., “Analysis of the genomesequence of the flowering plant Arabidopsis thaliana,” Nature,vol. 408, no. 6814, pp. 796–815, 2000.
[14] J. C. Kuhl, F. Cheung, Q. Yuan et al., “A unique set of 11,008onion expressed sequence tags reveals expressed sequence andgenomic differences between the monocot orders asparagalesand poales,” Plant Cell, vol. 16, no. 1, pp. 114–125, 2004.
[15] S. Jansson, G. Meyer-Gauen, R. Cerff, and W. Martin,“Nucleotide distribution in gymnosperm nuclear sequencessuggests a model for GC-content change in land-plant nucleargenomes,” Journal of Molecular Evolution, vol. 39, no. 1, pp.34–46, 1994.
[16] G. K.-S. Wong, J. Wang, L. Tao et al., “Compositional gradientsin Gramineae genes,” Genome Research, vol. 12, no. 6, pp. 851–856, 2002.
[17] S. P. Kumar, V. Dalai, N. K. Singh, and T. R. Sharma,“Comparative analysis of the 100 kb region containing thePi-Kh locus Between indica and japonica rice lines,” Genomics,Proteomics and Bioinformatics, vol. 5, no. 1, pp. 35–44, 2007.
[18] L. Cardle, L. Ramsay, D. Milbourne, M. Macaulay, D. Marshall,and R. Waugh, “Computational and experimental charac-terization of physically clustered simple sequence repeats inplants,” Genetics, vol. 156, no. 2, pp. 847–854, 2000.
[19] M. Morgante, M. Hanafey, and W. Powell, “Microsatellitesare preferentially associated with nonrepetitive DNA in plantgenomes,” Nature Genetics, vol. 30, no. 2, pp. 194–200, 2002.
[20] S. P. Kumpatla, Computational mining and survey of SimpleSequence Repeats (SSRs) in expressed sequence tags (ESTs) ofDicotyledonous plants, Ph.D. thesis, School of Informatics,Indiana University, Ind, USA, 2004.
[21] T. Sasaki, “The map-based sequence of the rice genome,”Nature, vol. 436, no. 7052, pp. 793–800, 2005.
[22] L. Zhang, D. Yuan, S. Yu et al., “Preference of simple sequencerepeats in coding and non-coding regions of Arabidopsisthaliana,” Bioinformatics, vol. 20, no. 7, pp. 1081–1086, 2004.
[23] M. J. Lawson and L. Zhang, “Distinct patterns of SSRdistribution in the Arabidopsis thaliana and rice genomes,”Genome Biology, vol. 7, no. 2, article no. R14, 2006.
[24] A. L. Vettore, F. R. da Silva, E. L. Kemper et al., “Analysis andfunctional annotation of an expressed sequence tag collection
Comparative and Functional Genomics 13
for tropical crop sugarcane,” Genome Research, vol. 13, no. 12,pp. 2725–2735, 2003.
[25] K. Proite, S. C. M. Leal-Bertioli, D. J. Bertioli et al., “ESTsfrom a wild Arachis species for gene discovery and markerdevelopment,” BMC Plant Biology, vol. 7, article no. 7, 2007.
[26] P. Stephenson, G. Bryan, J. Kirby et al., “Fifty new microsatel-lite loci for the wheat genetic map,” Theoretical and AppliedGenetics, vol. 97, no. 5-6, pp. 946–949, 1998.
[27] L. Ramsay, M. Macaulay, S. Degli Ivanissevich et al., “A simplesequence repeat-based linkage map of Barley,” Genetics, vol.156, no. 4, pp. 1997–2005, 2000.
[28] M. C. Saha, M. A. R. Mian, I. Eujayl, J. C. Zwonitzer, L.Wang, and G. D. May, “Tall fescue EST-SSR markers withtransferability across several grass species,” Theoretical andApplied Genetics, vol. 109, no. 4, pp. 783–791, 2004.
[29] N. Nicot, V. Chiquet, B. Gandon et al., “Study of sim-ple sequence repeat (SSR) markers from wheat expressedsequence tags (ESTs),” Theoretical and Applied Genetics, vol.109, no. 4, pp. 800–805, 2004.
[30] R. K. Varshney, R. Sigmund, A. Borner et al., “Interspecifictransferability and comparative mapping of barley EST-SSRmarkers in wheat, rye and rice,” Plant Science, vol. 168, no. 1,pp. 195–202, 2005.
Submit your manuscripts athttp://www.hindawi.com
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Anatomy Research International
PeptidesInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporation http://www.hindawi.com
International Journal of
Volume 2014
Zoology
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Molecular Biology International
GenomicsInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioinformaticsAdvances in
Marine BiologyJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Signal TransductionJournal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
BioMed Research International
Evolutionary BiologyInternational Journal of
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Biochemistry Research International
ArchaeaHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Genetics Research International
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Advances in
Virolog y
Hindawi Publishing Corporationhttp://www.hindawi.com
Nucleic AcidsJournal of
Volume 2014
Stem CellsInternational
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
Enzyme Research
Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014
International Journal of
Microbiology