Genome organization & its genetic implications
description
Transcript of Genome organization & its genetic implications
Genome organization &
its genetic implications
Lander , ES (2011) Initial impact of the sequencing of the human genome. Nature 470:187
Feuillet, C, JE Leach, J Rogers, PS Schnable, K Eersole (2011) Crop genome sequencing: lessons and rationales. Trendt Plant
Sci 16:77
DNA sequencing technologies
First gen Next gen(Sanger) (454/Illumina/APG)
Read length 800 bases 30-300 basesSpeed 0.1Gb/day 1-5 Gb/dayCost / human genome
$70, 000,000 $75,000-$250,000
Metzker, M (2010) Sequencing technologies – the next generation. Nature Rev Genet 11:31
What are the challenges for the correct assembly of genome sequence information?
• Genome sizeEukaryotic genomes ~ 109 – 1010 bp
• Genome compositionEukaryotic genomes ~ 50 % repetitive DNA
Genome size – the C-value paradox
genome size in basepairs
The amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes
Genome Size – the C value paradox:
• Complexity = length in nucleotides of longest non- repeating sequence that can be formed by splicing together all unique sequence in a sample
• Eukaryotic genomes contain different classes of DNA based on sequence complexity:
highly repetitive
middle repetitive
unique
Genome composition
Genome composition – DNA re-association kinetics
complexity in
[moles of nucleotide / liter] x sec
Genome composition - DNA re-association kinetics for a complex eukaryotic genome
[moles of nucleotide / liter] x sec
highly repetitive sequences
middle repetitive sequences
single copy sequences
From genome composition to genome organization
How are unique, middle repetitive and highly repetitive sequences organized in the genome?
Genome organizationE. coli
S. cerevisiae
H. sapiens
Z. mays
= Repeat= Gene
gene islandgene desert
Genetic complexity
• Eukaryotic genomes contain ~ 20,000 – 30,000 genes
• 30% of protein coding genes are members of gene families
duplication & divergence of sequence & gene function
Gene complexity
• What does a gene look like from a sequence or transcript perspective?no “typical gene”
• Introns and exonsintrons can be numerous and long, i.e. some genes are more intron than exon!
alternative splicing variants are common
• Not all genes encode proteins
non-coding structural RNAs (e.g. rRNA, tRNA, snRNA, snoRNA)
non-coding regulatory RNAs (e.g. miRNA, lncRNA)
Implications of gene and genetic complexity
• Forward genetics: Have mutant – want gene• Via map-based cloning:
Map your mutationLook at the genome sequence in the map interval to identify candidate genes
• Candidate gene identification may not be trivial, even with good genome annotation!
Especially an issue for plant genome sequences – only arabidopsis and rice are considered “finished” quality
• Note further genetic tests required, even if the perfect candidate is identified.
Gene identification - open reading frames
5'atgcccaagctgaatagcgtagaggggttttcatcatga
frame 1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca taa M P K L N S V E G F S S *
frame 2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat tggC P S * I A * R G F H H
How to tell real orfs from random chance orfs?• • • •
Galindo et al. PLoS Biol 5(5): e106 doi:10.1371/journal.pbio.0050106
Gene identification - short orfs can be translated!
• e.g. the drosophila tarsal-less gene
Gene identification – database searchinge.g. http://blast.ncbi.nlm.nih.gov/Blast.cgi
Gene identification – shared syntenyPreserved localization of genes on chromosomes of different species
e.g. mouse chromosome 11 and parts of 5 different human chromosomes
Perfect correspondence in order, orientation and spacing of 23 putative genes, and 245 conserved sequence blocks in noncoding regions
Caution! Even regions of high synteny may not show perfect gene-for-gene correspondence
from Gibson & Muse (2002) A Primer of Genome Science,Sinauer Inc.
Gene identification – shared synteny
Preserved localization of genes on chromosomes of different species
e.g. maize – sorghum (G) -rice (H)
Schnable et al. Science 326:1112
Gene identification – promoter elements
• TATA – box elements 5'-TATAAA-3' or variantplant and animal promoters
• CpG islandsRegions of higher than expected CpG dinucleotide
content, un-methlylated in active promoters~ 40% of mammalian promoters~ 70% of human promotersbut NOT in plant promoter regions
• Y patch (pyrimidine-rich patch) plant not mammalian promoters
Gene identification – introns & exons
• Long gene space more intron than exon
• Extreme example - human clotting factor VIII gene
Gene identification – alternative splicing variants
Pistoni et al. RNA Biol 7:441
Gene identification – trans-splicing
Gingeras, Nature 461: 206
Gene identification – non-coding RNAs
• non-coding structural RNAs rRNA & tRNA – transcription & translationsnoRNA – small nucleolar RNAs
guide chemical modification of rRNAs & tRNAssnRNA – small nuclear RNAs
guide splicing reactions
• non-coding regulatory RNAs miRNA & siRNA - small interfering RNAs
RNAi pathwaylncRNA - long noncoding RNAs
Origins of long non-coding RNAs
Kapranov, Nature Rev Genet 8:413
Overlapping transcriptional architecture
• e.g. the human phosphatidylserine decarboxylase (PISD) gene
Wilusz et al. Genes Dev. 23: 1494–1504
Functions of lncRNAs
Genome - Transcriptome - Proteome
• GenomeFull complement of an organism’s hereditary information
• TranscriptomeFull set of RNA molecules, coding and non-coding,
transcribed from the genome
• ProteomeFull set of proteins expressed from a genome
• Not a 1:1:1 correspondence
Implications of gene and genetic complexity
• What is the take-home message for forward genetics?
Implications of gene and genetic complexity
• Reverse genetics: Have gene – want phenotype
Predict phenotypes based on gene function in other organismsKnock out or knock down your gene of interest & look for corresponding changes in phenotype
Gene families• Gene duplication followed by:
Duplication of gene functionDivergence of gene functionLoss of gene function leading to a pseudogene
• e.g. humanglobin gene family
Gene families
• Gene duplication followed by:Duplication of gene functionDivergence of gene functionLoss of gene function leading to a pseudogene
• e.g. human beta-globin gene cluster chromosome 11Five functional genes and two pseudogenes
Gene families – paralogs & orthologs
• Homologs Protein or DNA sequences having shared ancestry
• OrthologsHomologs created by a speciation eventMay or may not retain the same function!
• ParalogsHomologs created by a gene duplication eventMay or may not retain the same function!
• It is not always easy or possible to distinguish orthologs from paralogs when comparing genes or proteins between species
Gene families – paralogs & orthologs
globin geneparalogs
Gene families – paralogs & orthologs
orthologs
paralogs
orthologs
orthologs
Storz et al. IUBMB Life 63:313
Implications of gene and genetic complexity
• What are the implications of gene families for forward genetics (i.e. looking for candidate genes that condition a mutant phenotype?)
• What are the implications of gene families for reverse genetics (i.e. altering gene function and looking for a phenotype)?
Genome organization – repeated sequences ~ 50% of the genome
• Segmental duplications and copy number variation
• Tandemly repeated genesrRNA, tRNA and histone gene products needed in large amounts
• Duplicated gene families
• Transposons
• Tandem simple sequence repeatscentromeric & telomeric repeatsminisatellitesmicrosatellites
Repeated sequences – segmental duplications & copy number variants
• Segmental duplications> 1 kb block of duplicated sequence with > 90%
sequence identityrecombine to mediate further copy number variants
Koszul & Fischer, C.R. Biologies 332:254
Repeated sequences – segmental duplications & copy number variants
Repeated sequences – segmental duplications & copy number variants
Girirajan et al. Annu Rev Genet 45:203
• Copy number variant (CNV)
Deviation from diploidcopy number at a locus
• Copy number polymorphism (CNP)
CNV present in >1% of apopulation
• Recent association with human developmental syndromes
Transposon-derived repeated sequences
• ~ 45% of human & 85% of maize genome
Transposon-derived repeated sequences
Gogvadze & Buzdin Cell Mol Life Sci 66:3727
• Many are truncated & inactive• Considered to be important in the
evolution of genome organization & function
Repeated sequences – short tandem repeats
• CentromericLong array (~100,000 bp) of short tandem repeats
~ 5bp drosophila, ~150 bp maize, ~170 bp humannot conserved across speciesin some cases not even conserved in all chromosomes
of the same speciesAssociation with a centromere-specific histone H3
• Telomeric Length varies between species
~ 300 base pairs - 150 kilobasepairsConserved, G-rich repeat sequence
vertebrates TTAGGG ; most plants TTTAGGG
Repeated sequences – short tandem repeats
• Minisatellites (Variable number tandem repeats, VNTRs) 10-100 bp repeat units500-30,000 bp arraysThe original DNA fingerprinting marker via Southern
blottingNow supplanted by microsatellites
Repeated sequences – short tandem repeats
[CACACACA]
[GTGTGTGT]
variety A[CACA]
[GTGT]
variety B
• Microsatellites (Simple sequence repeats, SSRs)Di, tri or tetra-nucleotide repeats; 1-10 repeat units per
locusRepeat numbers expand or contract over a short
evolutionary, or even generational time-frameAmplified by PCR
Primers based on unique flanking sequenceProducts fractionated by capillary or acrylamide gel electrophoresis
Co-dominant mapping & fingerprinting markersBoth alleles can be detected in a heterozygous individual