Download - Alignments and phylogeny Peter Hantz EMBL Heidelberg

Alignments and phylogeny

Peter Hantz EMBL Heidelberg

Mutations:

changes in the DNA sequencechemicals, physical conditions, replication errors (cca 10/replication/human genome)

coding region mutations: >silent

results in the same AA >mis-sense results in a different AA>nonsense results in a stop codon

non-coding region mutations: e.g. Transcription factor binding sites - developmental deficiencies

gene dublications + "neofunctionalization" of the extra copy new function - new evolutionary pressure

Evolution

Evolution

Homology, orthology, paralogyHomologous:

genes having a common ancestororthologous: common ancestor, in different species paralogous: common ancestor, gene dublication in the same species

wikipedia

Alignments

Aims:-to compare DNA or protein sequences that diverged during evolutionary processes

Why is it good for?

-a new piece of DNA/protein: what's it? (proteins: structural relationships)

-to trace evolutionary relationships(phylogenetic trees)

-to find mutations that cause diseases

-for gene cloning: degenerate PCR primer design

-forensics, whale protection...

Alignments are the first step of building evolutionary trees

A. Budd

Malaria - Sickle Cell Anemia

23andme.com

Some applications...

Forensics

snp

Global and Local Alignments

Global: assumed to be similar overall (e.g. closely related ones)"forces" the alignment to span the entire length of the sequencese.g. to find evolutionary relationshipse.g. to find mutations in a gene

Local: if short pieces (motifs, domains) are similarto identify regions of similarity within long sequencese.g. To find coding regions in a genomic dna full with introns

Source:wikipedia

How a very simple pair-wise alignent works: the DotPlot method

Source: Gene Cloning

Global/local alignmnets: Dynamic Programing

G: Needleman-Wunsch algorithmL: Smith-Waterman algorithm

Aligning DNA

|: identicalnothing: mismatch-: gap

How do we quantify their similarity?

Measure of similarity: f(#of matches, mismatches, gaps)

How do we count for identities, mismatches, gaps?A simple model (standard BLAST scoring matrix):

All mismatches are "similar"?Kimura model:

purine-purine and pirimidine-pirimidine : more probable

Identity +2Mismatch -3Gap creation -5Gap extension -2Terminal Mismatch 0

Score of two aligned DNA sequences (or parts of them):

TCCTAGGACTCATCGTAAGGTCCTAG - - AACCTCGTAAGG+2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2 =28-7-9=14

Aligning DNA, continued, concepts for both global and local alignments

Aligning Proteins

some subtitutions less relevant than other ones change to a similar AA: not too much change in the protein structure change to sg. else: unprobable, it will be selkected against

letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap(software: Blast)

Identities: same AAPositives: "similar" AA-s, incl. identities

How "similar" are these sequences?

Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps)

AA changes:quantification by the "substitution matrix"

>the simplest one:the identity matrix>A more realistic one: BlockSUbstitutionMatrix (BLOSUM62): +probable, -unprobable, diagoinal:prob. let like this

Gap penalties: usually -10 for gap open and -2 for gap extension

The score: measure of similarity of a segment/the enitre length of two aligned protein sequenceshigh score means: good alignment/real similarityscore for a given alignment = symbol-wise score total (matrix) + gap penalty total

A A B B C C D D - - E E F A A - - - - D D K K K E F G G 4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0 =

Significance: E-value

Given our query sequence (DNA or protein)Let's take a random sequence from a hypothetical database of size D:

Can an alignment as good or better that this occur BY CHANCE? (calculated from a random database sequence)

From the score S a so-called "Expectation value" is calculated E=P(S(random)>S(query))D

If the chance is tiny (E<10^-6):unlikely that the observed alignment is due to chance alone

Note: D is very high, the E-value increases

Why score S is not enough? To what do we compare it? Statistics is needed.Ex.

Multiple alignments: more complex than the pairwise ones"Progressive alignment methods" (e.g. Clustal)Iterative methods (e.g. Muscle)

Multiple alignment of Nitric Oxide Synthase protein sequences

(P. Hantz)

If two sequences have a large % of identity, they can be interpreted to be homologous (y/n)histons - very conservative MHC gene pool - evolves like crazy

Viagra!

The BLAST (Basic Local Alignment Search Tool)

How does it work?In a nutshell:

-List the words of length 3 (by default) of the query:PQGEFG >> PQG, GGE, GEF, EFG

-Scan the database sequences with "the relevant ones" of these

-try extending the exact matches, via local alignments, until there are not too much mismatches (until a score level)

>> HSP-s (High-Scoring Segment Pairs)

-Evaluate the significance and E-value of the HSP-s

Scored pairwise local alignments are generated very fast"Is my new sequence related to sg I know about?"

Also used for aligning sequences

Using it: BLAST sub-programs (beside the pairwise alignments)

Blastn: Search a nucleotide database using a nucleotide queryWhat can my sequence be? – close relatives(What are the sequences similar to my sequence?)

BlastP: Search protein database using a protein queryWhat can my protein be?(What are the proteins similar to my sequence?)Find conserved domains in the queryFind members of the protein family

Blastx: Search protein database using a translated nucleotide query Find coding sequences in a piece of genomic DNAProtein sequences: more conserved!

Tblastn: Search translated nucleotide database using a protein query Find similar proteins e.g. not annotated DNA sequences

Tblastx: Search translated nucleotide database using a translated nucleotide queryFind coding sequences in a piece of genomic DNA

Note: translation is done in all 6 frames, and all of these are locally analyzed

Making Phylogenetic Trees

Aims: to show evolutinary relationships of:

genes, species (they evolve ALL the time) even computer softwares...

A. Budd Ji et al., 2008

The new rRNA-based animal phylogeny (1995)

Halanych et al, Aduotte et al., after 1995

(1995)

Mollusca

Annelida

Platyhelm.

Protostomia Deuterostomia

rRNA: present in all creaturesslow/fast evolving partssecondary structure

The new rRNA-based Tree of Life

Woese et al., after 1990

Description of Evolutionary Trees

Internal nodes:

hypothetical ancestral organisms/genesTerminal nodes:

existing organisms/genes (Operational Taxonomy Units, OTU)Root:

the last common ancestor of the entire groupSister groups:

on either side of a split, with a common ancestor and no additional descendents

Monophyletic group A group containing an ancestor and all of its descendants most recent common ancestor of the group

Only the terminal nodes (OTU) exist right nowThey all evolve in time!

Description of Phylogenetic Trees

Cladogram: branch length unscaled

Phylogram: branch length=amounts of evol. divergence(horizontals doesn't count)

Biology: sequences, organisms evolve all the time time

Molecular clocks:Can the # of mutations of a DNA or protein sequence correlatedwith the time lapsed after the Last Common Ancestor?

If the rate is cca constant: YESDifferent rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY

too fast: saturation - back mutations - underestimation of the distance:A-B-C-D-E-A-...

calibration: some dated fossil records are needed

Building Phylogenetic trees: A very simple example:

(a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC

Mutations (a)<>(b)<>(c) Mutations (a)<>(d)<>(b)

A "rooted" tree: an assumption for the commn ancestor

source: Gene Cloning

One step-one mutation

Building Phylogenetic Trees

Distance-based methods"Distance" of two sequences: "metric" in mathematics (4 axioms)several ones: euclidean... non-euclidean...

A distance between two strings: Leveshtein distance d(IJ)the minimum number of edits (insertion, deletion, or substitution of a single character) needed to transform one string into the other

Its calculation: intuitively easy, practically complicated (dynamic programing)

Example: d(kitten/6/, sitting/7/)=3

Leveshtein distance: a special case of the score: gaps/mismatches/missing ends: 1; matches: 0

Green: Euclidean distanceOthers: "Manhattan" distance

kitten → sitten ( 's' for 'k') sitten → sittin ('i' for 'e') sittin → sitting (insert 'g').

The Building of a Tree: the UPGMA method

Sequential clustering methodStarting point: distance matrix [d(IJ)] of the sequences (triangular m.)

0)(

0),(

0

1

21

nn SSd

ua

SSd

ua

-grouping the pairwise distances corresponding to the pairs of strings with the with the smallest pairwise distance:

-node "placed" at the half of the distance

-creating a reduced matrix:these two are "joined" distances are re-calculated:

2

),(),(),(

FXdBXdBFXd

http://www.nmsr.org/upgma.htm

FIRST DO AN ALIGNMENTPre-processing: Eliminate obiously wrong sequence regions

e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G e.g. obvious sequencing errorse.g. bad sequences

Correct the distances for multiple substitutions (homoplasy) (measure of distances: change/site, 0…1)

Problems might still appear!...

Rooting a treeRooting induces a directionality >"automatically" done by several software:

midpoint rooting (root on the branch on equal distance from the most distant OTU-s) >"by hand"by choosing an "outgroup": a homologous, but "quite far" sequenceroot: on the branch between the tree and the outgroup

Building a treeuse a program...

AAAA

Resample datasets(with replacement)C

CGG

AATT

GGCC

TAAT

Taxa1Taxa2Taxa3Taxa4

1

2

3

4

TAAT

AAAA

CCGG

GGCC

AATT

Sample1 Sample2

TAAT

AAAA

GGCC

AATT

GGCC

Sample3

AAAA

AAAA

AAAA

AAAA

AAAA

Sample99

TAAT

AAAA

GGCC

TAAT

TAAT

Sample100

TAAT

AAAA

CCGG

GGCC

CCGG

60%

T. Larsson

Bootstrap values: how roboust is our tree

The result:

...

Randomizing a bit the sequences.... does the tree/subtree persist?Larger% - better