1 Previously on: Biological Sequences Analysis. 2 Motifs.

62
1 Previously on: Previously on: Biological Biological Sequences Sequences Analysis Analysis
  • date post

    22-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of 1 Previously on: Biological Sequences Analysis. 2 Motifs.

Page 1: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

11

Previously on: Previously on:

Biological Sequences Biological Sequences AnalysisAnalysis

Page 2: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

22

MotifsMotifs

Page 3: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

33

PatternsPatterns

W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].W-x(9,11)-[FYV]-[FYW]-x(6,7)-[GSTNE].

Any amino, between 9-

11 times

F or Y or

V

WOPLASDFGYVVPPPLAWSEEWOPLASDFGYFFPPPLAWGQQ WOPLASDFGYVWPPPLQQQQS

Page 4: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

44

Profile-pattern-consensusProfile-pattern-consensus

AAAACCTTTTGG

AAAAGGTTCCGG

CCAACCTTTTCC

1122334455

AA0.660.66110000..

TT00000011..

CC0.330.33000.660.6600..

GG00000.330.3300..

AAAACCTTTTGG

[AC-]A-[GC]-T-[TC]-[GC]

multiple alignment

consensus

pattern

profile

•Information:

consensus<pattern<profile

NNAANNTTNNNN

Page 5: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

55

Chou-Fasman predictionChou-Fasman prediction Ala Pro Tyr Phe Phe Lys Lys His Val Ala Thr

α 142 57 69 113 113 114 114 100 106 142 83

β 83 55 147 138 138 74 74 87 170 83 119

MethodMethodAccuracyAccuracy

Chou & FasmanChou & Fasman50%50%

Adding the MSAAdding the MSA69%69%

MSA+ sophisticated MSA+ sophisticated computationscomputations

70-80%70-80%

Page 6: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

66

GGeneene OOntologyntology

GO - GO describes proteins in terms of :GO - GO describes proteins in terms of :

biological processbiological process

e.g. induction of apoptosis by extrenal signalse.g. induction of apoptosis by extrenal signals

cellular componentcellular component

e.g. membrane fractione.g. membrane fraction

molecular functionmolecular function

e.g. protein kinasee.g. protein kinase

nucleus

Nuclear chromosome

cell

Page 7: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

77

This week on: This week on:

Biological Sequences Biological Sequences AnalysisAnalysis

(Lesson 6)(Lesson 6)

Page 8: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

88

Inside the GenomeInside the Genome

Page 9: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

99

2001: the human genome2001: the human genome

Page 10: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1010

Neck to neck competitionNeck to neck competition

Celera Genomics (private company) Celera Genomics (private company) versus the International Human versus the International Human Genome Sequencing Consortium Genome Sequencing Consortium (public company)(public company)

Page 11: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1111

The highlightsThe highlights

~30,000 genes in the human ~30,000 genes in the human genome genome (today – estimated at 20-25K)(today – estimated at 20-25K)

Oases of genes in empty desertsOases of genes in empty deserts Long-range variation in GC contentLong-range variation in GC content Repetitive elements ruleRepetitive elements rule

Page 12: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1212

How many genes in the genomeHow many genes in the genome??

Ratio of average gene size to Ratio of average gene size to genome size:genome size:100,000100,000

Based on ESTs: Based on ESTs: 35,000-120,00035,000-120,000

Page 13: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1313

Detecting genes in the human Detecting genes in the human genomegenome

Gene finding methods:Gene finding methods: Ab initioAb initio

The challenge: small exons in a sea The challenge: small exons in a sea of intronsof introns

Homology-based Homology-based The problem: will not detect novel The problem: will not detect novel genesgenes

Page 14: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1414

Genscan (ab initio)Genscan (ab initio)

Based on a probabilistic model of a gene Based on a probabilistic model of a gene structurestructure

Takes into account:Takes into account:- promoters - promoters - gene composition – exons/introns- gene composition – exons/introns- GC content- GC content- splice signals- splice signals

Goes over all 6 reading framesGoes over all 6 reading framesBurge and Karlin, 1997, Prediction of complete gene structure in human genomic DNA, J. Mol. Biol. 268

\\|// (o o)-. .-. .-oOOo~(_)~oOOo-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. .-. ||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\ /|||X|||\|/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||X|||/ \|||' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-' `-'

Page 15: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1515

SplicingSplicing

Page 16: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1616

Splicing Mechanism

Note: small exons in an ‘ocean’ of introns

typical exon – hundreds bptypical intron – thousands bp

Page 17: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1717

Eukaryotic splice sitesEukaryotic splice sites

Poly-pyrimidine tract

Page 18: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1818

CpG Islands: another signalCpG Islands: another signal

CpG islands are regions of the CpG islands are regions of the genome with a higher frequency of genome with a higher frequency of CG dinucleotides (not base-pairs!) CG dinucleotides (not base-pairs!) than the rest of the genomethan the rest of the genome

CpG islands often occur near the CpG islands often occur near the beginning of genesbeginning of genes maybe maybe related to the binding of the related to the binding of the TF Sp1TF Sp1

Page 19: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

1919

Comparative proteome analysisComparative proteome analysis

Functional categories based on GO, for genes which matched an entry in Interpro

Page 20: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2020

Comparative proteome analysisComparative proteome analysis

Humans have more proteins involved Humans have more proteins involved in cytoskeleton, immune defense, in cytoskeleton, immune defense, and and transcriptiontranscription

Page 21: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2121

Evolutionary conservation of Evolutionary conservation of human proteinshuman proteins

Performed BLASTP of each protein Performed BLASTP of each protein against the ‘nr’ NCBI databaseagainst the ‘nr’ NCBI database

PSI-BLAST: non-vertebrates also

Page 22: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2222

Horizontal (lateral) gene transferHorizontal (lateral) gene transfer

Lateral Gene Transfer (LGT) is any process in which an organism transfers genetic material to another organism that is not its offspring

Page 23: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2323

Mechanisms:

Transformation

Transduction (phages/viruses)

Conjugation

Page 24: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2424

Bacteria to vertebrate LGT criteriaBacteria to vertebrate LGT criteria

Homologs in bacteriaHomologs in bacteria Homologs in vertebrates (detected in Homologs in vertebrates (detected in

PSI-BLAST)PSI-BLAST) NoNo significant homologs in non- significant homologs in non-

vertebratesvertebrates

Page 25: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2525

Bacteria to vertebrate LGT Bacteria to vertebrate LGT detectiondetection

E-value of bacterial homolog X9 E-value of bacterial homolog X9 better than eukaryal homologbetter than eukaryal homolog

Human query:

Hit ……………… e-value

Frog ………….. 4e-180

Mouse …………1e-164

E.Coli ………….. 7e-124

Streptococcus .. 9e-71

Worm ……………….0.1

Page 26: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2626

Bacteria to vertebrate LGTBacteria to vertebrate LGT

vertebratesBacteria Non-vertebrates

Page 27: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2727

Bacteria to vertebrate LGTBacteria to vertebrate LGT

Genes with a role in metabolism of Genes with a role in metabolism of xenobiotics or stress responsexenobiotics or stress response

Selective advantage for these transfers. Selective advantage for these transfers. May be highly important immune geneMay be highly important immune gene

Page 28: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2828

Page 29: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

2929

Bacteria to vertebrate LGT??Bacteria to vertebrate LGT??

Hundreds of sequenced bacterial Hundreds of sequenced bacterial genome vs. handful of eukaryotesgenome vs. handful of eukaryotes

Gene finding in bacteria is much Gene finding in bacteria is much easier than in eukaryoteseasier than in eukaryotes

On the practical side: rigid On the practical side: rigid mechanical barriers to LGT in mechanical barriers to LGT in eukaryotes (nucleus, germ line)eukaryotes (nucleus, germ line)

Page 30: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3030

Repetitive ElementsRepetitive Elements in the in the

Human GenomeHuman Genome

Page 31: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3131

The C-value paradoxThe C-value paradox

Genome size does not correlate with Genome size does not correlate with organism complexityorganism complexity

YeastYeastHumanHumanRiceRiceAmoebaAmoeba

Genome Genome sizesize

12 million12 million3 billion3 billion4.3 billion4.3 billion67 billion67 billion

Number of Number of genesgenes

6,2756,27520-25,00020-25,000~30,000~30,000??

Page 32: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3232

Repetitive elementsRepetitive elements

The C-value mystery was partially The C-value mystery was partially resolved when it was found that resolved when it was found that large portions of genomes contain large portions of genomes contain repetitive elementsrepetitive elements

Page 33: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3333

Repeats in the human genomeRepeats in the human genome

~50% of the human genome (~1% ~50% of the human genome (~1% coding):coding):

– Transposon derived (=interspersed Transposon derived (=interspersed repeats) repeats)

– Retrotransposed cellular genesRetrotransposed cellular genes

– Sequence repeats (A)Sequence repeats (A)nn, (CG), (CG)nn, etc., etc.

– Segmental duplicationsSegmental duplications

Page 34: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3434

DNADNA Transposons & Transposons & RetrotransposonsRetrotransposons

DNA transposons

Encode a tranposase enzyme

Cut-and-paste mechanism:

Transposase binds to the inverted repeats of the transposon, and to a target sequence in the DNA

Replicative transposition

Retro-transposons

Encode reverse-transciptase and endonuclease

Transposition via an RNA intermediate

Copy-paste.

Page 35: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3535

Transposable elements in the Transposable elements in the human genomehuman genome

*

*

* Non-LTR retrotransposons

** LTR transposon

**

Retrotransposons

Page 36: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3636

LINEs and SINEsLINEs and SINEs

Highly successful elements in Highly successful elements in eukaryoteseukaryotes

LINE - LINE - LLong ong IInterspersed nterspersed NNuclear uclear EElement (>5,000 bp)lement (>5,000 bp)

SINE - SINE - SShort hort IInterspersed nterspersed NNuclear uclear EElement (< 500 bp)lement (< 500 bp)

SINEs are freeriders on the backs of SINEs are freeriders on the backs of LINEs – LINEs – encode no proteinsencode no proteins

Page 37: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3737

Determining the age of Determining the age of transposable elementstransposable elements

For each family, a For each family, a consensus consensus sequencesequence was built ===> the was built ===> the ancestral sequenceancestral sequence

Compute the divergence (%) of each Compute the divergence (%) of each sequence from the ancestorsequence from the ancestor

Convert sequence divergence to Convert sequence divergence to actual agesactual ages

Page 38: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3838

Age of transposable elementsAge of transposable elements

Most transposable elements date Most transposable elements date back to the emergence of placental back to the emergence of placental mammals (low disposal rate of mammals (low disposal rate of transposons)transposons)

DNA transposons in the human DNA transposons in the human genome are dead (high divergence genome are dead (high divergence from ancestor)!from ancestor)!

Page 39: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

3939

Where are the transposons Where are the transposons locatedlocated??

LINEs LINEs AT-rich regions (less genes) AT-rich regions (less genes) SINEs (MIR, Alu) SINEs (MIR, Alu)

GC-rich areas …… ?? … they use the GC-rich areas …… ?? … they use the LINE machinery …….??LINE machinery …….??

Page 40: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4040

Why are there SINEs in GC-rich Why are there SINEs in GC-rich regionsregions??

1.1. SINEs target GC rich regionsSINEs target GC rich regions

2.2. Evolutionary advantage for SINEs Evolutionary advantage for SINEs that ‘land’ in a GC-rich regionthat ‘land’ in a GC-rich region

How do we resolve between the How do we resolve between the two options?two options?

Page 41: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4141

Age distribution of Alu’s in GC Age distribution of Alu’s in GC regionsregions

Page 42: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4242

SINEs in GC-rich regionsSINEs in GC-rich regions

1.1. High rate of random loss in AT-rich High rate of random loss in AT-rich regionsregions

2.2. Negative selection against Alu in Negative selection against Alu in AT-richAT-rich

3.3. Positive selection (evolutionary Positive selection (evolutionary advantage) for Alu in GC richadvantage) for Alu in GC rich

Comparison with LINEs

Alus correlate with actively transcribed genes

Page 43: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4343

Are Alus functionalAre Alus functional????

SINEs are transcribed under stressSINEs are transcribed under stress SINE RNAs may bind a protein kinase SINE RNAs may bind a protein kinase

promote translation under stress promote translation under stress

Need to be in regions which are highly Need to be in regions which are highly transcribedtranscribed

Role in alternative splicingRole in alternative splicing

Page 44: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4444

Repeats in the human genomeRepeats in the human genome

~50% of the human genome (~1% ~50% of the human genome (~1% coding):coding):

1.1. Transposon derived (=interspersed Transposon derived (=interspersed repeats) repeats)

2.2. Retrotransposed cellular genesRetrotransposed cellular genes

3.3. Sequence repeats (A)Sequence repeats (A)nn, (CG), (CG)nn, etc., etc.

4.4. Segmental duplicationsSegmental duplications

Page 45: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4545

Segment duplicationsSegment duplications

1077 segmental duplications detected1077 segmental duplications detected Several genes in the duplicated regions Several genes in the duplicated regions

associated with diseases (may be related associated with diseases (may be related to homologous recombination)to homologous recombination)

Most are recent duplications (conservation Most are recent duplications (conservation of entire segment, versus conservation of of entire segment, versus conservation of coding sequences only)coding sequences only)

Page 46: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4646

Page 47: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4747

Genome-wide studiesGenome-wide studies

Page 48: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4848

Sequenced genomesSequenced genomesAssembled and annotated eukaryote genomes in Ensembl

Page 49: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

4949

Page 50: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5050

481 segments > 200 bp absolutely 481 segments > 200 bp absolutely conserved (100% identity) between conserved (100% identity) between human, rat and mousehuman, rat and mouse

Page 51: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5151

Comparison with a neutral Comparison with a neutral substitution ratesubstitution rate

Compare the substitution rate in a Compare the substitution rate in a any 1Mb regionany 1Mb region

Probability of 10Probability of 10-22 -22 of obtaining of obtaining 11 ultranconserved element (UE) by ultranconserved element (UE) by chancechance

Page 52: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5252

481 UEs

111 UE overlap a known

mRNA: exonic UEs

256 - no overlap (non-

exonic)

114 - inconclusive

100 intronic

156 inter-genic

Page 53: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5353

Who are the genesWho are the genes??

Type 1: exonic

Type 2: genes which are near non-exonic UEs

Page 54: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5454

Type 1:Type 1:enrichment for:enrichment for:- RNA binding and splicing regulation- RNA binding and splicing regulation- RRM motif (RNA recognition)- RRM motif (RNA recognition)

Type 2:Type 2:enrichment for:enrichment for:- Transcription regulation, DNA - Transcription regulation, DNA bindingbinding- DNA binding motifs- DNA binding motifs

Page 55: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5555

Intergenic UEsIntergenic UEs

Genes which flank intergenic UEs are Genes which flank intergenic UEs are enriched for early developmental enriched for early developmental genesgenes

Are UEs distal enhancers of these Are UEs distal enhancers of these genes?genes?

Page 56: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5656

Gene enhancerGene enhancer

A short region of DNA, usually quite A short region of DNA, usually quite distant from a gene (due to distant from a gene (due to chromatin complex folding), which chromatin complex folding), which binds an activatorbinds an activator

An activator recruits transcription An activator recruits transcription factors to the genefactors to the gene

Page 57: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5757

Experimental studies of UEsExperimental studies of UEs

Some UEs cluster within regions enriched for genes encoding developmentally important transcription factors

Within these loci, a special pattern of histone methylation (bivalent domains)

Silence the developmental genes when unnecessary

Suggest that the DNA pattern affects the histone methylation

Cell, Vol 125, 315-326, 21 April 2006

Page 58: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5858

Experimental studies of UEsExperimental studies of UEs

Tested 167 UEs (both mouse-human UEs and fish-human UEs) for enhancer activity: cloned before a reporter gene to test their activity

45% functioned as enhancers

Page 59: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

5959

A bioinformatic successA bioinformatic success

Ultraconservation can predict highly Ultraconservation can predict highly important function!important function!

Page 60: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

6060

BUT …

Page 61: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

6161

PLoS Biol. 2007 Sep;5(9):e234

Chose 4 UEs which are near specific genes:

genes which show a specific phenotype when knocked-out

Performed complete deletion of these UEs

… the mice were viable and did not show any different phenotype

Page 62: 1 Previously on: Biological Sequences Analysis. 2 Motifs.

6262

ConclusionsConclusions……

Ultraconservation can be indicative Ultraconservation can be indicative of important functionof important function

…… And sometimes not:And sometimes not:

- gene redundancy- gene redundancy- long-range phenotypes- long-range phenotypes- laboratories cannot mimic life- laboratories cannot mimic life