Genome organization & its genetic implications

Genome organization &

its genetic implications

Lander , ES (2011) Initial impact of the sequencing of the human genome. Nature 470:187

Feuillet, C, JE Leach, J Rogers, PS Schnable, K Eersole (2011) Crop genome sequencing: lessons and rationales. Trendt Plant

Sci 16:77

DNA sequencing technologies

First gen Next gen(Sanger) (454/Illumina/APG)

Read length 800 bases 30-300 basesSpeed 0.1Gb/day 1-5 Gb/dayCost / human genome

$70, 000,000 $75,000-$250,000

Metzker, M (2010) Sequencing technologies – the next generation. Nature Rev Genet 11:31

What are the challenges for the correct assembly of genome sequence information?

• Genome sizeEukaryotic genomes ~ 109 – 1010 bp

• Genome compositionEukaryotic genomes ~ 50 % repetitive DNA

Genome size – the C-value paradox

genome size in basepairs

The amount of DNA in the haploid cell of an organism is not related to its evolutionary complexity or number of genes

Genome Size – the C value paradox:

• Complexity = length in nucleotides of longest non- repeating sequence that can be formed by splicing together all unique sequence in a sample

• Eukaryotic genomes contain different classes of DNA based on sequence complexity:

highly repetitive

middle repetitive

unique

Genome composition

Genome composition – DNA re-association kinetics

complexity in

[moles of nucleotide / liter] x sec

Genome composition - DNA re-association kinetics for a complex eukaryotic genome

[moles of nucleotide / liter] x sec

highly repetitive sequences

middle repetitive sequences

single copy sequences

From genome composition to genome organization

How are unique, middle repetitive and highly repetitive sequences organized in the genome?

Genome organizationE. coli

S. cerevisiae

H. sapiens

Z. mays

= Repeat= Gene

gene islandgene desert

Genetic complexity

• Eukaryotic genomes contain ~ 20,000 – 30,000 genes

• 30% of protein coding genes are members of gene families

duplication & divergence of sequence & gene function

Gene complexity

• What does a gene look like from a sequence or transcript perspective?no “typical gene”

• Introns and exonsintrons can be numerous and long, i.e. some genes are more intron than exon!

alternative splicing variants are common

• Not all genes encode proteins

non-coding structural RNAs (e.g. rRNA, tRNA, snRNA, snoRNA)

non-coding regulatory RNAs (e.g. miRNA, lncRNA)

Implications of gene and genetic complexity

• Forward genetics: Have mutant – want gene• Via map-based cloning:

Map your mutationLook at the genome sequence in the map interval to identify candidate genes

• Candidate gene identification may not be trivial, even with good genome annotation!

Especially an issue for plant genome sequences – only arabidopsis and rice are considered “finished” quality

• Note further genetic tests required, even if the perfect candidate is identified.

Gene identification - open reading frames

5'atgcccaagctgaatagcgtagaggggttttcatcatga

frame 1 atg ccc aag ctg aat agc gta gag ggg ttt tca tca taa M P K L N S V E G F S S *

frame 2 tgc cca agc tga ata gcg tag agg ggt ttt cat cat tggC P S * I A * R G F H H

How to tell real orfs from random chance orfs?• • • •

Galindo et al. PLoS Biol 5(5): e106 doi:10.1371/journal.pbio.0050106

Gene identification - short orfs can be translated!

• e.g. the drosophila tarsal-less gene

Gene identification – database searchinge.g. http://blast.ncbi.nlm.nih.gov/Blast.cgi

Gene identification – shared syntenyPreserved localization of genes on chromosomes of different species

e.g. mouse chromosome 11 and parts of 5 different human chromosomes

Perfect correspondence in order, orientation and spacing of 23 putative genes, and 245 conserved sequence blocks in noncoding regions

Caution! Even regions of high synteny may not show perfect gene-for-gene correspondence

from Gibson & Muse (2002) A Primer of Genome Science,Sinauer Inc.

Gene identification – shared synteny

Preserved localization of genes on chromosomes of different species

e.g. maize – sorghum (G) -rice (H)

Schnable et al. Science 326:1112

Gene identification – promoter elements

• TATA – box elements 5'-TATAAA-3' or variantplant and animal promoters

• CpG islandsRegions of higher than expected CpG dinucleotide

content, un-methlylated in active promoters~ 40% of mammalian promoters~ 70% of human promotersbut NOT in plant promoter regions

• Y patch (pyrimidine-rich patch) plant not mammalian promoters

Gene identification – introns & exons

• Long gene space more intron than exon

• Extreme example - human clotting factor VIII gene

Gene identification – alternative splicing variants

Pistoni et al. RNA Biol 7:441

Gene identification – trans-splicing

Gingeras, Nature 461: 206

Gene identification – non-coding RNAs

• non-coding structural RNAs rRNA & tRNA – transcription & translationsnoRNA – small nucleolar RNAs

guide chemical modification of rRNAs & tRNAssnRNA – small nuclear RNAs

guide splicing reactions

• non-coding regulatory RNAs miRNA & siRNA - small interfering RNAs

RNAi pathwaylncRNA - long noncoding RNAs

Origins of long non-coding RNAs

Kapranov, Nature Rev Genet 8:413

Overlapping transcriptional architecture

• e.g. the human phosphatidylserine decarboxylase (PISD) gene

Wilusz et al. Genes Dev. 23: 1494–1504

Functions of lncRNAs

Genome - Transcriptome - Proteome

• GenomeFull complement of an organism’s hereditary information

• TranscriptomeFull set of RNA molecules, coding and non-coding,

transcribed from the genome

• ProteomeFull set of proteins expressed from a genome

• Not a 1:1:1 correspondence


• What is the take-home message for forward genetics?


• Reverse genetics: Have gene – want phenotype

Predict phenotypes based on gene function in other organismsKnock out or knock down your gene of interest & look for corresponding changes in phenotype

Gene families• Gene duplication followed by:

Duplication of gene functionDivergence of gene functionLoss of gene function leading to a pseudogene

• e.g. humanglobin gene family

Gene families

• Gene duplication followed by:Duplication of gene functionDivergence of gene functionLoss of gene function leading to a pseudogene

• e.g. human beta-globin gene cluster chromosome 11Five functional genes and two pseudogenes

Gene families – paralogs & orthologs

• Homologs Protein or DNA sequences having shared ancestry

• OrthologsHomologs created by a speciation eventMay or may not retain the same function!

• ParalogsHomologs created by a gene duplication eventMay or may not retain the same function!

• It is not always easy or possible to distinguish orthologs from paralogs when comparing genes or proteins between species


globin geneparalogs


orthologs

paralogs

orthologs

orthologs

Storz et al. IUBMB Life 63:313


• What are the implications of gene families for forward genetics (i.e. looking for candidate genes that condition a mutant phenotype?)

• What are the implications of gene families for reverse genetics (i.e. altering gene function and looking for a phenotype)?

Genome organization – repeated sequences ~ 50% of the genome

• Segmental duplications and copy number variation

• Tandemly repeated genesrRNA, tRNA and histone gene products needed in large amounts

• Duplicated gene families

• Transposons

• Tandem simple sequence repeatscentromeric & telomeric repeatsminisatellitesmicrosatellites

Repeated sequences – segmental duplications & copy number variants

• Segmental duplications> 1 kb block of duplicated sequence with > 90%

sequence identityrecombine to mediate further copy number variants

Koszul & Fischer, C.R. Biologies 332:254


Girirajan et al. Annu Rev Genet 45:203

• Copy number variant (CNV)

Deviation from diploidcopy number at a locus

• Copy number polymorphism (CNP)

CNV present in >1% of apopulation

• Recent association with human developmental syndromes

Transposon-derived repeated sequences

• ~ 45% of human & 85% of maize genome

Transposon-derived repeated sequences

Gogvadze & Buzdin Cell Mol Life Sci 66:3727

• Many are truncated & inactive• Considered to be important in the

evolution of genome organization & function

Repeated sequences – short tandem repeats

• CentromericLong array (~100,000 bp) of short tandem repeats

~ 5bp drosophila, ~150 bp maize, ~170 bp humannot conserved across speciesin some cases not even conserved in all chromosomes

of the same speciesAssociation with a centromere-specific histone H3

• Telomeric Length varies between species

~ 300 base pairs - 150 kilobasepairsConserved, G-rich repeat sequence

vertebrates TTAGGG ; most plants TTTAGGG


• Minisatellites (Variable number tandem repeats, VNTRs) 10-100 bp repeat units500-30,000 bp arraysThe original DNA fingerprinting marker via Southern

blottingNow supplanted by microsatellites


[CACACACA]

[GTGTGTGT]

variety A[CACA]

[GTGT]

variety B

• Microsatellites (Simple sequence repeats, SSRs)Di, tri or tetra-nucleotide repeats; 1-10 repeat units per

locusRepeat numbers expand or contract over a short

evolutionary, or even generational time-frameAmplified by PCR

Primers based on unique flanking sequenceProducts fractionated by capillary or acrylamide gel electrophoresis

Co-dominant mapping & fingerprinting markersBoth alleles can be detected in a heterozygous individual

Genome organization & its genetic implications

Documents

Transcript of Genome organization & its genetic implications