Alignments and phylogeny
Peter Hantz EMBL Heidelberg
Mutations:
changes in the DNA sequencechemicals, physical conditions, replication errors (cca 10/replication/human genome)
coding region mutations: >silent
results in the same AA >mis-sense results in a different AA>nonsense results in a stop codon
non-coding region mutations: e.g. Transcription factor binding sites - developmental deficiencies
gene dublications + "neofunctionalization" of the extra copy new function - new evolutionary pressure
Evolution
Evolution
Homology, orthology, paralogyHomologous:
genes having a common ancestororthologous: common ancestor, in different species paralogous: common ancestor, gene dublication in the same species
wikipedia
Alignments
Aims:-to compare DNA or protein sequences that diverged during evolutionary processes
Why is it good for?
-a new piece of DNA/protein: what's it? (proteins: structural relationships)
-to trace evolutionary relationships(phylogenetic trees)
-to find mutations that cause diseases
-for gene cloning: degenerate PCR primer design
-forensics, whale protection...
Alignments are the first step of building evolutionary trees
A. Budd
Malaria - Sickle Cell Anemia
23andme.com
Some applications...
Forensics
snp
Global and Local Alignments
Global: assumed to be similar overall (e.g. closely related ones)"forces" the alignment to span the entire length of the sequencese.g. to find evolutionary relationshipse.g. to find mutations in a gene
Local: if short pieces (motifs, domains) are similarto identify regions of similarity within long sequencese.g. To find coding regions in a genomic dna full with introns
Source:wikipedia
How a very simple pair-wise alignent works: the DotPlot method
Source: Gene Cloning
Global/local alignmnets: Dynamic Programing
G: Needleman-Wunsch algorithmL: Smith-Waterman algorithm
Aligning DNA
|: identicalnothing: mismatch-: gap
How do we quantify their similarity?
Measure of similarity: f(#of matches, mismatches, gaps)
How do we count for identities, mismatches, gaps?A simple model (standard BLAST scoring matrix):
All mismatches are "similar"?Kimura model:
purine-purine and pirimidine-pirimidine : more probable
Identity +2Mismatch -3Gap creation -5Gap extension -2Terminal Mismatch 0
Score of two aligned DNA sequences (or parts of them):
TCCTAGGACTCATCGTAAGGTCCTAG - - AACCTCGTAAGG+2+2+2 +2+2+2 -5-2 -3 -3 +2 -2 -3 +2+2 +2 +2+2+2+2 =28-7-9=14
Aligning DNA, continued, concepts for both global and local alignments
Aligning Proteins
some subtitutions less relevant than other ones change to a similar AA: not too much change in the protein structure change to sg. else: unprobable, it will be selkected against
letter =match, +=conservative substitution, Ø= non-cons. substitution, - =gap(software: Blast)
Identities: same AAPositives: "similar" AA-s, incl. identities
How "similar" are these sequences?
Measure of similarity: f(#of matches, # of cons/non-cons ch, #gaps)
AA changes:quantification by the "substitution matrix"
>the simplest one:the identity matrix>A more realistic one: BlockSUbstitutionMatrix (BLOSUM62): +probable, -unprobable, diagoinal:prob. let like this
Gap penalties: usually -10 for gap open and -2 for gap extension
The score: measure of similarity of a segment/the enitre length of two aligned protein sequenceshigh score means: good alignment/real similarityscore for a given alignment = symbol-wise score total (matrix) + gap penalty total
A A B B C C D D - - E E F A A - - - - D D K K K E F G G 4+4 -10-2-2-2+6+6 -10-1+1+5+6 0 0 =
Significance: E-value
Given our query sequence (DNA or protein)Let's take a random sequence from a hypothetical database of size D:
Can an alignment as good or better that this occur BY CHANCE? (calculated from a random database sequence)
From the score S a so-called "Expectation value" is calculated E=P(S(random)>S(query))D
If the chance is tiny (E<10^-6):unlikely that the observed alignment is due to chance alone
Note: D is very high, the E-value increases
Why score S is not enough? To what do we compare it? Statistics is needed.Ex.
Multiple alignments: more complex than the pairwise ones"Progressive alignment methods" (e.g. Clustal)Iterative methods (e.g. Muscle)
Multiple alignment of Nitric Oxide Synthase protein sequences
(P. Hantz)
If two sequences have a large % of identity, they can be interpreted to be homologous (y/n)histons - very conservative MHC gene pool - evolves like crazy
Viagra!
The BLAST (Basic Local Alignment Search Tool)
How does it work?In a nutshell:
-List the words of length 3 (by default) of the query:PQGEFG >> PQG, GGE, GEF, EFG
-Scan the database sequences with "the relevant ones" of these
-try extending the exact matches, via local alignments, until there are not too much mismatches (until a score level)
>> HSP-s (High-Scoring Segment Pairs)
-Evaluate the significance and E-value of the HSP-s
Scored pairwise local alignments are generated very fast"Is my new sequence related to sg I know about?"
Also used for aligning sequences
Using it: BLAST sub-programs (beside the pairwise alignments)
Blastn: Search a nucleotide database using a nucleotide queryWhat can my sequence be? – close relatives(What are the sequences similar to my sequence?)
BlastP: Search protein database using a protein queryWhat can my protein be?(What are the proteins similar to my sequence?)Find conserved domains in the queryFind members of the protein family
Blastx: Search protein database using a translated nucleotide query Find coding sequences in a piece of genomic DNAProtein sequences: more conserved!
Tblastn: Search translated nucleotide database using a protein query Find similar proteins e.g. not annotated DNA sequences
Tblastx: Search translated nucleotide database using a translated nucleotide queryFind coding sequences in a piece of genomic DNA
Note: translation is done in all 6 frames, and all of these are locally analyzed
Making Phylogenetic Trees
Aims: to show evolutinary relationships of:
genes, species (they evolve ALL the time) even computer softwares...
A. Budd Ji et al., 2008
The new rRNA-based animal phylogeny (1995)
Halanych et al, Aduotte et al., after 1995
(1995)
Mollusca
Annelida
Platyhelm.
Protostomia Deuterostomia
rRNA: present in all creaturesslow/fast evolving partssecondary structure
The new rRNA-based Tree of Life
Woese et al., after 1990
Description of Evolutionary Trees
Internal nodes:
hypothetical ancestral organisms/genesTerminal nodes:
existing organisms/genes (Operational Taxonomy Units, OTU)Root:
the last common ancestor of the entire groupSister groups:
on either side of a split, with a common ancestor and no additional descendents
Monophyletic group A group containing an ancestor and all of its descendants most recent common ancestor of the group
Only the terminal nodes (OTU) exist right nowThey all evolve in time!
Description of Phylogenetic Trees
Cladogram: branch length unscaled
Phylogram: branch length=amounts of evol. divergence(horizontals doesn't count)
Biology: sequences, organisms evolve all the time time
Molecular clocks:Can the # of mutations of a DNA or protein sequence correlatedwith the time lapsed after the Last Common Ancestor?
If the rate is cca constant: YESDifferent rates - different advantages GENE PHYLOGENY ≠ SPECIES PHYLOGENY
too fast: saturation - back mutations - underestimation of the distance:A-B-C-D-E-A-...
calibration: some dated fossil records are needed
Building Phylogenetic trees: A very simple example:
(a) MSATHC (b) ITATHC (c) ITAGHC (d) LTAAHC
Mutations (a)<>(b)<>(c) Mutations (a)<>(d)<>(b)
A "rooted" tree: an assumption for the commn ancestor
source: Gene Cloning
One step-one mutation
Building Phylogenetic Trees
Distance-based methods"Distance" of two sequences: "metric" in mathematics (4 axioms)several ones: euclidean... non-euclidean...
A distance between two strings: Leveshtein distance d(IJ)the minimum number of edits (insertion, deletion, or substitution of a single character) needed to transform one string into the other
Its calculation: intuitively easy, practically complicated (dynamic programing)
Example: d(kitten/6/, sitting/7/)=3
Leveshtein distance: a special case of the score: gaps/mismatches/missing ends: 1; matches: 0
Green: Euclidean distanceOthers: "Manhattan" distance
kitten → sitten ( 's' for 'k') sitten → sittin ('i' for 'e') sittin → sitting (insert 'g').
The Building of a Tree: the UPGMA method
Sequential clustering methodStarting point: distance matrix [d(IJ)] of the sequences (triangular m.)
0)(
0),(
0
1
21
nn SSd
ua
SSd
ua
-grouping the pairwise distances corresponding to the pairs of strings with the with the smallest pairwise distance:
-node "placed" at the half of the distance
-creating a reduced matrix:these two are "joined" distances are re-calculated:
2
),(),(),(
FXdBXdBFXd
http://www.nmsr.org/upgma.htm
FIRST DO AN ALIGNMENTPre-processing: Eliminate obiously wrong sequence regions
e.g. "forgotten" introns when investigating proteins, AG|GT...AG|G e.g. obvious sequencing errorse.g. bad sequences
Correct the distances for multiple substitutions (homoplasy) (measure of distances: change/site, 0…1)
Problems might still appear!...
Rooting a treeRooting induces a directionality >"automatically" done by several software:
midpoint rooting (root on the branch on equal distance from the most distant OTU-s) >"by hand"by choosing an "outgroup": a homologous, but "quite far" sequenceroot: on the branch between the tree and the outgroup
Building a treeuse a program...
AAAA
Resample datasets(with replacement)C
CGG
AATT
GGCC
TAAT
Taxa1Taxa2Taxa3Taxa4
1
2
3
4
TAAT
AAAA
CCGG
GGCC
AATT
Sample1 Sample2
TAAT
AAAA
GGCC
AATT
GGCC
Sample3
AAAA
AAAA
AAAA
AAAA
AAAA
Sample99
TAAT
AAAA
GGCC
TAAT
TAAT
Sample100
TAAT
AAAA
CCGG
GGCC
CCGG
60%
T. Larsson
Bootstrap values: how roboust is our tree
The result:
...
Randomizing a bit the sequences.... does the tree/subtree persist?Larger% - better
Top Related