Bio info statistical-methods[1]

37
Statistical methods in Bioinformatics

Transcript of Bio info statistical-methods[1]

Statistical methods in Bioinformatics

Dot Matrix•First described by Gibbs and McIntyre (1970)•Dot matrix analysis of DNA sequence (W=11, S=7) Phage P22 c2 repressor

Phage lambda cI

Dot Matrix•Dot matrix analysis of amino acid sequence (W=1, S=1) Phage lambda cI

Phage P22 c2 repressor

Filtering in Dot Matrix•Filtering can be applied using Sliding windows Window size Match requirement (Stringency) DNA 15 10 Protein 2/3 2

•For DNA Long Windows, higher Stringency For Proteins Short Windows, Low Stringency For Protein Domains Long Windows, Low Stringency

Dot Matrix Programs

•DNA strider•DOTTER•COMPARE•DOPLOT

For sequence repeats,•LALIGN•PLALIGN

LALIGN/PALIGN

Dot plot for Repeat analysis(Window=1, Stringency=1)

Dot plot for Repeat analysis (Window=23, Stringency=7)

Dynamic programming•Compares every pair of characters in the two sequences and generates an alignment

•Alignment includes matches, mismatches and gaps

•Alignments obtained depend on the choice of scoring system

Programs for alignment of sequences

Scoring using Gap penalty

Derivation of Dynamic programming algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Dynamic programming Algorithm

Formal description of Algorithm

Global and Local alignments

Global and Local alignments

Scoring matrices

•Certain amino acid substitutions common in related proteins from different species

Proteins still function with these substitutions

Scoring matrices

Scoring matrices

•Probability of changing

A B is identical to

B A

PAM (Percent Accepted Mutation)

•Based on evolutionary principles

•Each matrix gives the changes expected for a given period of evolutionary time

•Each change at a particular site is assumed to be independent of previous mutational events

•Estimations are based on 1572 changes in 71 groups of protein sequences that were at least 85% similar

Scoring matrices

PAM (Percent Accepted Mutation)

PAM1 matrix estimates what rate of substitution would be expected if 1% of the amino acids had changed

Similarity Matrix used40% PAM12050% PAM8060% PAM6014-27% PAM250

BLOSUM (Blocks Amino acid Substitution Matrices)

Matrix values are based on amino acid substitutions in a large set of ~2000 conserved amino acid patterns (blocks)

Note: patterns are found by MOTIFMOTIF program

BLOSUM – Derivation of the Matrix values

PAM 250

BLOSUM62

BLAST home page

BLAST

BLAST results

BLAST results

BLAST results