1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics...

1-Month Practical Master CourseGenome Analysis

Jaap HeringaCentre for Integrative Bioinformatics VU (IBIVU)Vrije Universiteit AmsterdamThe Netherlands

[email protected]

CENTR

FORINTEGRATIVE

BIOINFORMATICSVU

E

MathematicsStatistics

Computer ScienceInformatics

BiologyMolecular biology

Medicine

Chemistry

Physics

Bioinformatics

Biological Sequence Analysis

Pair-wise sequence alignmentResidue exchange matricesMultiple sequence alignmentPhylogeny

C E N T R F O R I N T E G R A T I V EB I O I N F O R M A T I C S V U

E

.....acctc ctgtgcaaga acatgaaaca nctgtggttc tcccagatgg gtcctgtccc aggtgcacct gcaggagtcg ggcccaggac tggggaagcc tccagagctc aaaaccccac ttggtgacac aactcacaca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc acggtgccca gagcccaaat cttgtgacac acctccccca tgcccacggt gcccagagcc caaatcttgt gacacacctc ccccgtgccc ccggtgccca gcacctgaac tcttgggagg accgtcagtc ttcctcttcc ccccaaaacc caaggatacc cttatgattt cccggacccc tgaggtcacg tgcgtggtgg tggacgtgag ccacgaagac ccnnnngtcc agttcaagtg gtacgtggac ggcgtggagg tgcataatgc caagacaaag ctgcgggagg agcagtacaa cagcacgttc cgtgtggtca gcgtcctcac cgtcctgcac caggactggc tgaacggcaa ggagtacaag tgcaaggtct ccaacaaagc aaccaagtca gcctgacctg cctggtcaaa ggcttctacc ccagcgacat cgccgtggag tgggagagca atgggcagcc ggagaacaac tacaacacca cgcctcccat gctggactcc gacggctcct tcttcctcta cagcaagctc accgtggaca agagcaggtg gcagcagggg aacatcttct catgctccgt gatgcatgag gctctgcaca accgctacac gcagaagagc ctctc.....

DNA sequenceDNA sequence

Genome sizeGenome sizeOrganism Number of base pairsX-174 virus 5,386Epstein Bar Virus 172,282Mycoplasma genitalium 580,000Hemophilus Influenza 1.8 106 Yeast (S. Cerevisiae) 12.1 106

Human Human 3.2 3.2 10 1099

Wheat 16 109

Lilium longiflorum 90 109

Salamander 100 109 Amoeba dubia 670 109

Three main principles

• DNA makes RNA makes Protein

• Structure more conserved than sequence

• Sequence Structure Function

TERTIARY STRUCTURE (fold)TERTIARY STRUCTURE (fold)

Genome

Expressome

Proteome

Metabolome

Functional GenomicsFunctional Genomics

Regulation, signalling cascades, chaperonins, compartmentalisation

How to go from DNA to protein sequence

A piece of double stranded DNA:

5’ attcgttggcaaatcgcccctatccggc 3’3’ taagcaaccgtttagcggggataggccg 5’

DNA direction is from 5’ to 3’

How to go from DNA to protein sequence

6-frame translation using the codon table (last lecture):

5’ attcgttggcaaatcgcccctatccggc 3’

3’ taagcaaccgtttagcggggataggccg 5’

Dean, A. M. and G. B. Golding: Pacific Symposium on Bioinformatics 2000

Evolution and three-dimensional protein structure information

Isocitratedehydrogenase:

The distance fromthe active site(in yellow) determinesthe rate of evolution(red = fast evolution, blue = slow evolution)

Protein Sequence-Structure-FunctionProtein Sequence-Structure-Function

Sequence

Structure

Function

Threading

Homology searching (BLAST)

Ab initio prediction and folding

Function prediction from structure

Widely used tool for homology detection: PSI-BLAST

• Heuristic tool to cut down computations required for database searching (~1M sequences in DB)

• Sensitivity gained by iteratively finding hits (local alignments) and repeating search

Q

DBT

hits

PSSM

Threading

Query sequence

Template sequence

+

Template structure

Compatibility score

Fold recognition by threading

Query sequence

Compatibility scores

Fold 1

Fold 2

Fold 3

Fold N

“Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975))

“Nothing in bioinformatics makes sense except in the light of Biology”

Bioinformatics

Divergent evolution Ancestral sequence: ABCD

ACCD (B C) ABD (C ø)

ACCD or ACCD Pairwise Alignment AB─D A─BD

mutation deletion

Divergent evolution Ancestral sequence: ABCD

ACCD (B C) ABD (C ø)

ACCD or ACCD Pairwise Alignment AB─D A─BD

true alignment

mutation deletion

Mutations under divergent evolution

Ancestral sequence

Sequence 1 Sequence 2

1: ACCTGTAATC2: ACGTGCGATC * **D = 3/10 (fraction different sites (nucleotides))

G

G C

(a) G

A C

(b)

G

A A

(c)

One substitution -one visible

Two substitutions -one visible

Two substitutions -none visible

G

G A

(d)

Back mutation -not visible G

Convergent evolution

• Often with shorter motifs (e.g. active sites)• Motif (function) has evolved more than once

independently, e.g. starting with two very different sequences adopting different folds

• Sequences and associated structures remain different, but (functional) motif can become identical

• Classical example: serine proteinase and chymotrypsin

Serine proteinase (subtilisin) and chymotrypsin

• Different evolutionary origins, no sequence similarity • Similarities in the reaction mechanisms. Chymotrypsin,

subtilisin and carboxypeptidase C have a catalytic triad of serine, aspartate and histidine in common: serine acts as a nucleophile, aspartate as an electrophile, and histidine as a base.

• The geometric orientations of the catalytic residues are similar between families, despite different protein folds.

• The linear arrangements of the catalytic residues reflect different family relationships. For example the catalytic triad in the chymotrypsin clan (SA) is ordered HDS, but is ordered DHS in the subtilisin clan (SB) and SDH in the carboxypeptidase clan (SC).

A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***

A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******

What can sequence tell us about structure(HSSP)

Sander & Schneider, 1991

Searching for similaritiesWhat is the function of the new gene?

The “lazy” investigation (i.e., no biologial experiments, just bioinformatics techniques):

– Find a set of similar protein sequences to the unknown sequence

– Identify similarities and differences

– For long proteins: identify domains first

Evolutionary and functional relationships

Reconstruct evolutionary relation:

•Based on sequence-Identity (simplest method)-Similarity

•Homology (common ancestry: the ultimate goal)•Other (e.g., 3D structure)

Functional relation:Sequence Structure Function

Common ancestry is more interesting:Makes it more likely that genes sharethe same function

Homology: sharing a common ancestor– a binary property (yes/no)– it’s a nice tool:When (an unknown) gene X is homologous to (a known) gene G it means that we gain a lot of information on X: what we know about G can be transferred to X as a good suggestion.

Searching for similarities

Biological definitions for Biological definitions for related sequencesrelated sequences

Homologues are similar sequences in two different organisms that have been derived from a common ancestor sequence. Homologues can be described as either orthologues or paralogues.

Orthologues are similar sequences in two different organisms that have arisen due to a speciation event. Orthologs typically retain identical or similar functionality throughout evolution.

Paralogues are similar sequences within a single organism that have arisen due to a gene duplication event.

Xenologues are similar sequences that do not share the same evolutionary origin, but rather have arisen out of horizontal transfer events through symbiosis, viruses, etc.

How to evolveImportant distinction:• Orthologues: homologous proteins in different species (all

deriving from same ancestor)• Paralogues: homologous proteins in same species (internal gene

duplication)

• In practice: to recognise orthology, bi-directional best hit is used in conjunction with database search program (this is called an operational definition)

Source: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html

So this means …So this means …

Example today: Pairwise sequence alignment needs sense of evolution

Global dynamic programmingMDAGSTVILCFVG

MDAASTILCGS Amino Acid

Exchange Matrix

Gap penalties (open,extension)

Search matrix

MDAGSTVILCFVG-MDAAST-ILC--GS

Evolution

How to determine similarityFrequent evolutionary events at the DNA level:

1. Substitution

2. Insertion, deletion

3. Duplication

4. Inversion

We will restrict ourselves to these events

A DNA sequence alignmentattcgttggcaaatcgcccctatccggccttaaatt---tggcggatcg-cctctacgggcc----*** **** **** ** ******

A protein sequence alignmentMSTGAVLIY--TSILIKECHAMPAGNE--------GGILLFHRTHELIKESHAMANDEGGSNNS * * * **** ***

nucleotide one-letter code

amino acid one-letter code

– Substitution (or match/mismatch)

• DNA

• proteins

– Gap penalty

• Linear: gp(k)=ak

• Affine: gp(k)=b+ak

• Concave, e.g.: gp(k)=log(k)

The score for an alignment is the sum of the scores over all alignment columns

Dynamic programmingScoring alignments


Sa,b = -

gp(k) = gapinit + kgapextension affine gap penalties

li jbas ),( )(kgpN

kk

DNA: define a score for match/mismatch of lettersSimple:

Used in genome alignments:

A C G T

A 1 -1 -1 -1

C -1 1 -1 -1

G -1 -1 1 -1

T -1 -1 -1 1

A C G T

A 91 -114 -31 -123

C -114 100 -125 -31

G -31 -125 100 -114

T -123 -31 -114 91


10 1Amino Acid Exchange Matrix Affine gap

penalties (open, extension)

2020

Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)-Po-2Px ++s(L,I)+s(K,K)

T D W V T A L KT D W L - - I K

1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics...

Documents

Transcript of 1-Month Practical Master Course Genome Analysis Jaap Heringa Centre for Integrative Bioinformatics...