1 Previously on: Biological Sequences Analysis. 2 Motifs.
-
date post
22-Dec-2015 -
Category
Documents
-
view
221 -
download
0
Transcript of 1 Previously on: Biological Sequences Analysis. 2 Motifs.
11
Previously on: Previously on:
Biological Sequences Biological Sequences AnalysisAnalysis
22
MotifsMotifs
33
PatternsPatterns
W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].
Any amino, between 9-
11 times
F or Y or
V
WOPLASDFGYVVPPPLAWSEEWOPLASDFGYFFPPPLAWGQQ WOPLASDFGYVWPPPLQQQQS
44
Profile-pattern-consensusProfile-pattern-consensus
AAAACCTTTTGG
AAAAGGTTCCGG
CCAACCTTTTCC
1122334455
AA0.660.66110000..
TT00000011..
CC0.330.33000.660.6600..
GG00000.330.3300..
AAAACCTTTTGG
[AC-]A-[GC]-T-[TC]-[GC]
multiple alignment
consensus
pattern
profile
•Information:
consensus<pattern<profile
NNAANNTTNNNN
55
Chou-Fasman predictionChou-Fasman prediction Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr
α 142 57 69 113 113 114 114 100 106 142 83
β 83 55 147 138 138 74 74 87 170 83 119
MethodMethodAccuracyAccuracy
Chou & FasmanChou & Fasman50%50%
Adding the MSAAdding the MSA69%69%
MSA+ sophisticated MSA+ sophisticated computationscomputations
70-80%70-80%
66
GGeneene OOntologyntology
GO - GO describes proteins in terms of :GO - GO describes proteins in terms of :
biological processbiological process
e.g. induction of apoptosis by extrenal signalse.g. induction of apoptosis by extrenal signals
cellular componentcellular component
e.g. membrane fractione.g. membrane fraction
molecular functionmolecular function
e.g. protein kinasee.g. protein kinase
nucleus
Nuclear chromosome
cell
77
This week on: This week on:
Biological Sequences Biological Sequences AnalysisAnalysis
(Lesson 6)(Lesson 6)
88
Inside the GenomeInside the Genome
99
2001: the human genome2001: the human genome
1010
Neck to neck competitionNeck to neck competition
Celera Genomics (private company) Celera Genomics (private company) versus the International Human versus the International Human Genome Sequencing Consortium Genome Sequencing Consortium (public company)(public company)
1111
The highlightsThe highlights
~30,000 genes in the human ~30,000 genes in the human genome genome (today – estimated at 20-25K)(today – estimated at 20-25K)
Oases of genes in empty desertsOases of genes in empty deserts Long-range variation in GC contentLong-range variation in GC content Repetitive elements ruleRepetitive elements rule
1212
How many genes in the genomeHow many genes in the genome??
Ratio of average gene size to Ratio of average gene size to genome size:genome size:100,000100,000
Based on ESTs: Based on ESTs: 35,000-120,00035,000-120,000
1313
Detecting genes in the human Detecting genes in the human genomegenome
Gene finding methods:Gene finding methods: Ab initioAb initio
The challenge: small exons in a sea The challenge: small exons in a sea of intronsof introns
Homology-based Homology-based The problem: will not detect novel The problem: will not detect novel genesgenes
1414
Genscan (ab initio)Genscan (ab initio)
Based on a probabilistic model of a gene Based on a probabilistic model of a gene structurestructure
Takes into account:Takes into account:- promoters - promoters - gene composition – exons/introns- gene composition – exons/introns- GC content- GC content- splice signals- splice signals
Goes over all 6 reading framesGoes over all 6 reading framesBurge and Karlin, 1997, Prediction of complete gene structure in human genomic DNA, J. Mol. Biol. 268
\\|// (o o)-. .-. .-oOOo~(_)~oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-'
1515
SplicingSplicing
1616
Splicing Mechanism
Note: small exons in an ‘ocean’ of introns
typical exon – hundreds bptypical intron – thousands bp
1717
Eukaryotic splice sitesEukaryotic splice sites
Poly-pyrimidine tract
1818
CpG Islands: another signalCpG Islands: another signal
CpG islands are regions of the CpG islands are regions of the genome with a higher frequency of genome with a higher frequency of CG dinucleotides (not base-pairs!) CG dinucleotides (not base-pairs!) than the rest of the genomethan the rest of the genome
CpG islands often occur near the CpG islands often occur near the beginning of genesbeginning of genes maybe maybe related to the binding of the related to the binding of the TF Sp1TF Sp1
1919
Comparative proteome analysisComparative proteome analysis
Functional categories based on GO, for genes which matched an entry in Interpro
2020
Comparative proteome analysisComparative proteome analysis
Humans have more proteins involved Humans have more proteins involved in cytoskeleton, immune defense, in cytoskeleton, immune defense, and and transcriptiontranscription
2121
Evolutionary conservation of Evolutionary conservation of human proteinshuman proteins
Performed BLASTP of each protein Performed BLASTP of each protein against the ‘nr’ NCBI databaseagainst the ‘nr’ NCBI database
PSI-BLAST: non-vertebrates also
2222
Horizontal (lateral) gene transferHorizontal (lateral) gene transfer
Lateral Gene Transfer (LGT) is any process in which an organism transfers genetic material to another organism that is not its offspring
2323
Mechanisms:
Transformation
Transduction (phages/viruses)
Conjugation
2424
Bacteria to vertebrate LGT criteriaBacteria to vertebrate LGT criteria
Homologs in bacteriaHomologs in bacteria Homologs in vertebrates (detected in Homologs in vertebrates (detected in
PSI-BLAST)PSI-BLAST) NoNo significant homologs in non- significant homologs in non-
vertebratesvertebrates
2525
Bacteria to vertebrate LGT Bacteria to vertebrate LGT detectiondetection
E-value of bacterial homolog X9 E-value of bacterial homolog X9 better than eukaryal homologbetter than eukaryal homolog
Human query:
Hit ……………… e-value
Frog ………….. 4e-180
Mouse …………1e-164
E.Coli ………….. 7e-124
Streptococcus .. 9e-71
Worm ……………….0.1
2626
Bacteria to vertebrate LGTBacteria to vertebrate LGT
vertebratesBacteria Non-vertebrates
2727
Bacteria to vertebrate LGTBacteria to vertebrate LGT
Genes with a role in metabolism of Genes with a role in metabolism of xenobiotics or stress responsexenobiotics or stress response
Selective advantage for these transfers. Selective advantage for these transfers. May be highly important immune geneMay be highly important immune gene
2828
2929
Bacteria to vertebrate LGT??Bacteria to vertebrate LGT??
Hundreds of sequenced bacterial Hundreds of sequenced bacterial genome vs. handful of eukaryotesgenome vs. handful of eukaryotes
Gene finding in bacteria is much Gene finding in bacteria is much easier than in eukaryoteseasier than in eukaryotes
On the practical side: rigid On the practical side: rigid mechanical barriers to LGT in mechanical barriers to LGT in eukaryotes (nucleus, germ line)eukaryotes (nucleus, germ line)
3030
Repetitive ElementsRepetitive Elements in the in the
Human GenomeHuman Genome
3131
The C-value paradoxThe C-value paradox
Genome size does not correlate with Genome size does not correlate with organism complexityorganism complexity
YeastYeastHumanHumanRiceRiceAmoebaAmoeba
Genome Genome sizesize
12 million12 million3 billion3 billion4.3 billion4.3 billion67 billion67 billion
Number of Number of genesgenes
6,2756,27520-25,00020-25,000~30,000~30,000??
3232
Repetitive elementsRepetitive elements
The C-value mystery was partially The C-value mystery was partially resolved when it was found that resolved when it was found that large portions of genomes contain large portions of genomes contain repetitive elementsrepetitive elements
3333
Repeats in the human genomeRepeats in the human genome
~50% of the human genome (~1% ~50% of the human genome (~1% coding):coding):
– Transposon derived (=interspersed Transposon derived (=interspersed repeats) repeats)
– Retrotransposed cellular genesRetrotransposed cellular genes
– Sequence repeats (A)Sequence repeats (A)nn, (CG), (CG)nn, etc., etc.
– Segmental duplicationsSegmental duplications
3434
DNADNA Transposons & Transposons & RetrotransposonsRetrotransposons
DNA transposons
Encode a tranposase enzyme
Cut-and-paste mechanism:
Transposase binds to the inverted repeats of the transposon, and to a target sequence in the DNA
Replicative transposition
Retro-transposons
Encode reverse-transciptase and endonuclease
Transposition via an RNA intermediate
Copy-paste.
3535
Transposable elements in the Transposable elements in the human genomehuman genome
*
*
* Non-LTR retrotransposons
** LTR transposon
**
Retrotransposons
3636
LINEs and SINEsLINEs and SINEs
Highly successful elements in Highly successful elements in eukaryoteseukaryotes
LINE - LINE - LLong ong IInterspersed nterspersed NNuclear uclear EElement (>5,000 bp)lement (>5,000 bp)
SINE - SINE - SShort hort IInterspersed nterspersed NNuclear uclear EElement (< 500 bp)lement (< 500 bp)
SINEs are freeriders on the backs of SINEs are freeriders on the backs of LINEs – LINEs – encode no proteinsencode no proteins
3737
Determining the age of Determining the age of transposable elementstransposable elements
For each family, a For each family, a consensus consensus sequencesequence was built ===> the was built ===> the ancestral sequenceancestral sequence
Compute the divergence (%) of each Compute the divergence (%) of each sequence from the ancestorsequence from the ancestor
Convert sequence divergence to Convert sequence divergence to actual agesactual ages
3838
Age of transposable elementsAge of transposable elements
Most transposable elements date Most transposable elements date back to the emergence of placental back to the emergence of placental mammals (low disposal rate of mammals (low disposal rate of transposons)transposons)
DNA transposons in the human DNA transposons in the human genome are dead (high divergence genome are dead (high divergence from ancestor)!from ancestor)!
3939
Where are the transposons Where are the transposons locatedlocated??
LINEs LINEs AT-rich regions (less genes) AT-rich regions (less genes) SINEs (MIR, Alu) SINEs (MIR, Alu)
GC-rich areas …… ?? … they use the GC-rich areas …… ?? … they use the LINE machinery …….??LINE machinery …….??
4040
Why are there SINEs in GC-rich Why are there SINEs in GC-rich regionsregions??
1.1. SINEs target GC rich regionsSINEs target GC rich regions
2.2. Evolutionary advantage for SINEs Evolutionary advantage for SINEs that ‘land’ in a GC-rich regionthat ‘land’ in a GC-rich region
How do we resolve between the How do we resolve between the two options?two options?
4141
Age distribution of Alu’s in GC Age distribution of Alu’s in GC regionsregions
4242
SINEs in GC-rich regionsSINEs in GC-rich regions
1.1. High rate of random loss in AT-rich High rate of random loss in AT-rich regionsregions
2.2. Negative selection against Alu in Negative selection against Alu in AT-richAT-rich
3.3. Positive selection (evolutionary Positive selection (evolutionary advantage) for Alu in GC richadvantage) for Alu in GC rich
Comparison with LINEs
Alus correlate with actively transcribed genes
4343
Are Alus functionalAre Alus functional????
SINEs are transcribed under stressSINEs are transcribed under stress SINE RNAs may bind a protein kinase SINE RNAs may bind a protein kinase
promote translation under stress promote translation under stress
Need to be in regions which are highly Need to be in regions which are highly transcribedtranscribed
Role in alternative splicingRole in alternative splicing
4444
Repeats in the human genomeRepeats in the human genome
~50% of the human genome (~1% ~50% of the human genome (~1% coding):coding):
1.1. Transposon derived (=interspersed Transposon derived (=interspersed repeats) repeats)
2.2. Retrotransposed cellular genesRetrotransposed cellular genes
3.3. Sequence repeats (A)Sequence repeats (A)nn, (CG), (CG)nn, etc., etc.
4.4. Segmental duplicationsSegmental duplications
4545
Segment duplicationsSegment duplications
1077 segmental duplications detected1077 segmental duplications detected Several genes in the duplicated regions Several genes in the duplicated regions
associated with diseases (may be related associated with diseases (may be related to homologous recombination)to homologous recombination)
Most are recent duplications (conservation Most are recent duplications (conservation of entire segment, versus conservation of of entire segment, versus conservation of coding sequences only)coding sequences only)
4646
4747
Genome-wide studiesGenome-wide studies
4848
Sequenced genomesSequenced genomesAssembled and annotated eukaryote genomes in Ensembl
4949
5050
481 segments > 200 bp absolutely 481 segments > 200 bp absolutely conserved (100% identity) between conserved (100% identity) between human, rat and mousehuman, rat and mouse
5151
Comparison with a neutral Comparison with a neutral substitution ratesubstitution rate
Compare the substitution rate in a Compare the substitution rate in a any 1Mb regionany 1Mb region
Probability of 10Probability of 10-22 -22 of obtaining of obtaining 11 ultranconserved element (UE) by ultranconserved element (UE) by chancechance
5252
481 UEs
111 UE overlap a known
mRNA: exonic UEs
256 - no overlap (non-
exonic)
114 - inconclusive
100 intronic
156 inter-genic
5353
Who are the genesWho are the genes??
Type 1: exonic
Type 2: genes which are near non-exonic UEs
5454
Type 1:Type 1:enrichment for:enrichment for:- RNA binding and splicing regulation- RNA binding and splicing regulation- RRM motif (RNA recognition)- RRM motif (RNA recognition)
Type 2:Type 2:enrichment for:enrichment for:- Transcription regulation, DNA - Transcription regulation, DNA bindingbinding- DNA binding motifs- DNA binding motifs
5555
Intergenic UEsIntergenic UEs
Genes which flank intergenic UEs are Genes which flank intergenic UEs are enriched for early developmental enriched for early developmental genesgenes
Are UEs distal enhancers of these Are UEs distal enhancers of these genes?genes?
5656
Gene enhancerGene enhancer
A short region of DNA, usually quite A short region of DNA, usually quite distant from a gene (due to distant from a gene (due to chromatin complex folding), which chromatin complex folding), which binds an activatorbinds an activator
An activator recruits transcription An activator recruits transcription factors to the genefactors to the gene
5757
Experimental studies of UEsExperimental studies of UEs
Some UEs cluster within regions enriched for genes encoding developmentally important transcription factors
Within these loci, a special pattern of histone methylation (bivalent domains)
Silence the developmental genes when unnecessary
Suggest that the DNA pattern affects the histone methylation
Cell, Vol 125, 315-326, 21 April 2006
5858
Experimental studies of UEsExperimental studies of UEs
Tested 167 UEs (both mouse-human UEs and fish-human UEs) for enhancer activity: cloned before a reporter gene to test their activity
45% functioned as enhancers
5959
A bioinformatic successA bioinformatic success
Ultraconservation can predict highly Ultraconservation can predict highly important function!important function!
6060
BUT …
6161
PLoS Biol. 2007 Sep;5(9):e234
Chose 4 UEs which are near specific genes:
genes which show a specific phenotype when knocked-out
Performed complete deletion of these UEs
… the mice were viable and did not show any different phenotype
6262
ConclusionsConclusions……
Ultraconservation can be indicative Ultraconservation can be indicative of important functionof important function
…… And sometimes not:And sometimes not:
- gene redundancy- gene redundancy- long-range phenotypes- long-range phenotypes- laboratories cannot mimic life- laboratories cannot mimic life