Homology and sequence alignment.

HomologyHomology = Similarity between objects due to a common ancestry

Hund = Dog,Schwein = Pig

Sequence homology

VLSPAVKWAKVGAHAAGHG||| || |||| | ||||VLSEAVLWAKVEADVAGHG

Similarity between sequences as a result of common ancestry.

Sequence alignment

Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in the sequences.

Why align?VLSPAVKWAKV||| || |||| VLSEAVLWAKV

1.To detect if two sequence are homologous. If so, homology may indicate similarity in function (and structure).

2.Required for evolutionary studies (e.g., tree reconstruction).

3.To detect conservation (e.g., a tyrosine that is evolutionary conserved is more likely to be a phosphorylation site).

Sequence alignment

If two sequences share a common ancestor – for example human and dog hemoglobin, we can represent their evolutionary relationship using a tree

VLSPAV-WAKV||| || |||| VLSEAVLWAKV

VLSPAV-WAKV VLSEAVLWAKV

Perfect match

A perfect match suggests that no change has occurred from the common ancestor (although this is not always the case).

A substitution

A substitution suggests that at least one change has occurred since the common ancestor (although we cannot say in which lineage it has occurred).

VLSPAV-WAKV

VLSEAVLWAKV

Option 1: The ancestor had L and it was lost here. In such a case, the event was a deletion.

VLSEAVLWAKV

VLSPAV-WAKV

VLSEAVWAKV

Option 2: The ancestor was shorter and the L was inserted here. In such a case, the event was an insertion.

VLSEAVLWAKV

VLSPAV-WAKV

Normally, given two sequences we cannot tell whether it was an insertion or a deletion, so we term the event as an indel.

VLSEAVLWAKV

Deletion? Insertion?

Global vs. Local

• Global alignment – finds the best alignment across the entire two sequences.

• Local alignment – finds regions of similarity in parts of the sequences.

ADLGAVFALCDRYFQ|||| |||| |ADLGRTQN-CDRYYQ

ADLG CDRYFQ|||| |||| |ADLG CDRYYQ

Global alignment:

forces alignment in

regions which differ

Local alignment will

return only regions of

good alignment

Global alignment

PTK2 protein tyrosine kinase 2 of human and rhesus monkey

Proteins are comprised of domains

Domain B

Protein tyrosine kinase domain

Domain A

Human PTK2 :

In leukocytes, a different gene for tyrosine kinase is expressed.

Domain X

Domain A

Domain X

Domain BProtein tyrosine kinase domain

Domain A

Leukocyte TK

PTK2 The sequence similarity is restricted to a single domain

Global alignment of PTK and LTK

Local alignment of PTK and LTK

Conclusions

Use global alignment when the two sequences share the same overall sequence arrangement.

Use local alignment to detect regions of similarity.

How alignments are computed

Pairwise alignment

AAGCTGAATTCGAAAGGCTCATTTCTGA

AAGCTGAATT-C-GAAAGGCT-CATTTCTGA-

One possible alignment:

This alignment includes:2 mismatches 4 indels (gap)

10 perfect matches

Choosing an alignment for a pair of sequences

AAGCTGAATTCGAAAGGCTCATTTCTGA

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Which alignment is better?

Many different alignments are

possible for 2 sequences:

Scoring system (naïve)

Score: = (+1)x10 + (-2)x2 + (-1)x4 = 2 Score: = (+1)x9 + (-2)x2 + (-1)x6 = -1

A-AGCTGAATTC--GAAAG-GCTCA-TTTCTGA-

Higher score Better alignment

Perfect match: +1

Mismatch: -2

Indel (gap): -1

Alignment scoring - scoring of sequence similarity:

Assumes independence between positions:each position is considered separately

Scores each position:• Positive if identical (match)• Negative if different (mismatch or gap)

Total score = sum of position scoresCan be positive or negative

Scoring system

•In the example above, the choice of +1 for match,-2 for mismatch, and -1 for gap is quite arbitrary

•Different scoring systems different alignments

•We want a good scoring system…

DNA scoring matrices

Can take into account biological phenomena such as:

• Transition-transversion

Amino-acid scoring matrices• Take into account physico-chemical properties

Scoring gaps (I)

In advanced algorithms, two gaps of one amino-acid are given a different score than one gap of two amino acids. This is solved by giving a penalty to each gap that is opened.

Gap extension penalty < Gap opening penalty

Homology versus chance similarity

How to check if the score is significant?

A. Take the two sequences Compute score.

B. Take one sequence randomly shuffle it -> find score with the second sequence. Repeat 100,000 times.

If the score in A is at the top 5% of the scores in B the similarity is significant.

How close?

• Rule of thumb:

• Proteins are homologous if they are at least 25% identical (length >100)

• DNA sequences are homologous if they are at least 70% identical

Twilight zone

• < 25% identity in proteins – may be homologous and may not be….

• (Note that 5% identity will be obtained completely by chance!)

Searching a sequence database

Idea: In order to find homologous sequences to a sequence of interest, one should compute its pairwise alignment against all known sequences in a database, and detect the best scoring significant homologs

The same idea in short: Use your sequence as a query to find homologous

sequences in a sequence database

Some terminology

• Query sequence - the sequence with which we are searching

• Hit – a sequence found in the database, suspected as homologous

Query sequence: DNA or protein?

• For coding sequences, we can use the DNA sequence or the protein sequence to search for similar sequences.

• Which is preferable?

Protein is better!

• Selection (and hence conservation) works (mostly) at the protein level:

CTTTCA = Leu-SerTTGAGT = Leu-Ser

Query type

• Nucleotides: a four letter alphabet

• Amino acids: a twenty letter alphabet

• Two random DNA sequences will, on average, have 25% identity

• Two random protein sequences will, on average, have 5% identity

Conclusion

The amino-acid sequence is often preferable for homology search

How do we search a database?

• If each pairwise alignment takes 1/10 of a second, and if the database contains 107

sequences, it will take 106 seconds = 11.5 days to complete one search.

• 150,000 searches (at least!!) are performed per day. >82,000,000 sequence records in GenBank.

Conclusion

• Using the exact comparison pairwise alignment algorithm between the query and all DB entries – too slow

Heuristic

•Definition: a heuristic is a design to solve a problem that does not provide an exact solution (but is not too bad) but reduces the time complexity of the exact solution

• BLAST - Basic Local Alignment and Search Tool

• A heuristic for searching a database for similar sequences

DNA or Protein• All types of searches are possible

Query: DNA Protein

Database: DNA Protein

blastn – nuc vs. nucblastp – prot vs. protblastx – translated query vs. protein databasetblastn – protein vs. translated nuc. DBtblastx – translated query vs. translated database

Translated databases:

trEMBLgenPept

BLAST - underlying hypothesis

• The underlying hypothesis: when two sequences are similar there are short ungapped regions of high similarity between them

• The heuristic:

1. Discard irrelevant sequences

2. Perform exact local alignment only with the remaining sequences

How do we discard irrelevant sequences quickly?

• Divide the database into words of length w (default: w = 3 for protein and w = 7 for DNA)

• Save the words in a look-up table that can be searched quickly

WTDFGYPAILKGGTAC

WTDTDFDFGFGYGYP …

BLAST: discarding sequences

• When the user enters a query sequence, it is also divided into words

• Search the database for consecutive neighboring words

Neighbor words

• neighbor words are defined according to a scoring matrix (e.g., BLOSUM62 for proteins) with a certain cutoff level

GFC (20)

GPC (11)WAC (5)

E-value• The number of times we will theoretically

find an alignment with a score ≥ Y of a random sequence vs. a random database

Theoretically, we could trust

any result with an

E-value ≤ 1

In practice – BLAST uses estimations. E-values of 10-4 and lower indicate a

significant homology.E-values between 10-4 and 10-2 should be checked (similar domains, maybe

non-homologous).E-values between 10-2 and 1 do not

indicate a good homology

Web servers for pairwise alignment

BLAST 2 sequences (bl2Seq) at NCBI

Produces the local alignment of two given sequences using BLAST (Basic Local Alignment Search Tool) engine for local alignment

• Does not use an exact algorithm but a heuristic

Back to NCBIBack to NCBI

BLAST – bl2seqBLAST – bl2seq

Bl2Seq - queryBl2Seq - query

blastnblastn – – nucleotidenucleotide blastpblastp – – proteinprotein

Bl2seq resultsBl2seq results

Bl2seq results

Match Match Dissimilarity Dissimilarity Gaps Gaps Similarity Similarity Low Low

complexity complexity

BLAST – programsBLAST – programs

Query: DNA Protein

Database: DNA Protein

BLAST – BlastpBLAST – Blastp

Blastp - resultsBlastp - results

Blastp – results (cont’)Blastp – results (cont’)

Blast scores:

• Bits score – A score for the alignment according to the number of similarities, identities, etc.

• Expected-score (E-value) –The number of alignments with the same score one can “expect” to see by chance when searching a random database of a particular size. The closer the e-value is to zero, the greater the confidence that the hit is really a homolog

Blastp – acquiring sequencesBlastp – acquiring sequences

blastp – acquiring sequencesblastp – acquiring sequences

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Similar to pairwise alignment BUT n sequences are aligned instead of just 2

Multiple sequence alignment

MSA = Multiple Sequence AlignmentEach row represents an individual sequenceEach column represents the ‘same’ position

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGTSSNIGS--ITVNWYQQLPGLRLSCSSSGFIFSS--YAMYWVRQAPGLSLTCTVSGTSFDD--YYSTWVRQPPGPEVTCVVVDVSHEDPQVKFNWYVDG--ATLVCLISDFYPGA--VTVAWKADS--AALGCLVKDYFPEP--VTVSWNSG---VSLTCLVKGFYPSD--IAVEWWSNG--

Multiple sequence alignment

Conserved positions

• Columns in which all the sequences contain the same amino acids or nucleotides

• Important for the function or structure

VTISCTGSSSNIGAG-NHVKWYQQLPGVTISCTGSSSNIGS--ITVNWYQQLPGLRLSCTGSGFIFSS--YAMYWYQQAPGLSLTCTGSGTSFDD-QYYSTWYQQPPG

Consensus sequence

A T C T T G T

A A C T T G T

A A C T T C T

A A C T T G T

A consensus sequence holds the most frequent character of the alignment at each column

Profile = PSSM = Position Specific Score Matrix

A T C T T G

A A C T T G

A A C T T C

1 2 3 4 5 6

A 1 .67 0 0 0 0

C 0 0 1 0 0 0.33

G 0 0 0 0 0 0.67

T 0 .33 0 1 1 0

Alignment methods

There is no available optimal solution for MSA – all methods are heuristics:

• Progressive/hierarchical alignment (Clustal)

• Iterative alignment (mafft, muscle)

Compute the pairwise Compute the pairwise alignments for all against alignments for all against

all (6 pairwise alignments).all (6 pairwise alignments).The similarities are The similarities are

converted to distances and converted to distances and stored in a tablestored in a table

First step:

Progressive alignment

A B C D E

C 15 17

D 16 14 10

E 32 31 31 32

Cluster the sequences to create a Cluster the sequences to create a tree (tree (guide treeguide tree):):

•represents the order in which pairs of represents the order in which pairs of sequences are to be alignedsequences are to be aligned•similar sequences are neighbors in the similar sequences are neighbors in the tree tree •distant sequences are distant from distant sequences are distant from each other in the treeeach other in the tree

Second step: A B C D E

C 15 17

D 16 14 10

E 32 31 31 32The guide tree is imprecise and is NOT the tree which truly describes the evolutionary relationship between the sequences!

Third step:A

1. Align the most similar (neighboring) pairs

sequence

Third step:A

2. Align pairs of pairs

sequence

profile

Third step:A

E sequence

profile

Main disadvantages:

•Sub-optimal tree topology

•Misalignments resulting from globally aligning pairs of sequences.

Iterative alignment

Guide tree

Pairwise distance table

Iterate until the MSA does not change (convergence)

Case study: Using homology searching

• The human kinome

Kinases and phosphatases

Multi-tasking enzymes

• Signal transduction• Metabolism• Transcription• Cell-cycle• Differentiation• Function of nervous and

immune system• …• And more

How many kinases in the human genome?

• 1950’s, discovery that reversible phosphorylation regulates the activity of glycogen phosphorylase

• 1970’s, advent of cloning and sequencing produced a speculation that the vertebrate genome encodes as many as 1,001 kinases

• 2001 – human genome sequence …

• As well – databases of Genbank, Swissprot, and dbEST

• How can we find out how many kinases are out there?

How many kinases in the human genome?

The human kinome

• In 2002, Manning, Whyte, Martinez, Hunter and Sudarsanam set out to:

1. Search and cross-reference all these databases for all kinases

2. Characterize all found kinases

ePKs and aPKs

Eukaryotic protein kinases (majority) catalytic domain

Atypical protein kinases

Sequence homology of the catalytic domain; additional regulatory domains are non-homologous

No sequence homology to ePKs; some aPK subfamilies have structural similarity to ePKs

The search

• Several profiles were built:based on the catalytic domain of:

(a) 70 known ePKs from yeast, worm, fly, and human with > 50% identity in the ePK domain

(b) each subfamily of known aPKs

• HMM-profile searches and PSI-BLAST searches were performed

The results…

• 478 ePKs • 40 aPKs

• Total of 518 kinases

in the human genome

(half of the prediction

in the 1970’s)

[1.7% of human genes]

Homology and sequence alignment.

Documents

Transcript of Homology and sequence alignment.

Sequence Motif Comparison and Homology …docsdrive.com/pdfs/ansinet/ppj/2015/123-129.pdfsequenced. Bioedit sequence alignment editor version 5.09.04 (Hall, 1999) was used for the

homology & alignment

Exercises (Sequence databases, sequence alignment ...

Pairwise sequence Alignment Homology, Score Matrix.

Definition of sequence alignment - SRM Institute of ...6).pdf · Definition of sequence alignment • Sequence alignment is the procedure of comparing two (pair‐wise alignment)

Homology assessment and molecular sequence alignment.

Protein Sequence Alignment Multiple Sequence Alignment

Sequence Alignment & Searchcompbio.ucdenver.edu/Hunter/intro-course/sequence-alignment_Verspoor.pdfPairwise Sequence Alignment • Sequence similarity depends on an alignment. •

Fast Sequence Search Multiple Sequence Alignment

Sequence alignment

Tema 13. Sequence comparison. Concept of homology. Sequence alignment. Comparison strategies. BLAST, PSI-Blast. Multiple alignment, profiles. Families.

SEQUENCE ALIGNMENT - IMB Bioinformatics Groupbig.sci.am/jc/NGS_course/Alignment/sequence_alignment.pdf · SEQUENCE ALIGNMENT 1. INTRODUCTION Sequence alignment is not only the essential

Sequence homology and alignment...Sequence alignment Alignment: Comparing two (pairwise) or more (multiple) sequences. Searching for a series of identical or similar characters in

Multiple Sequence Alignment. Definition Homology: related by descent Homologous sequence positions ATTGCGC ATTGCGC ATCCGC C ATTGCGC AT-CCGC ATTGCGC.

Simultaneous Alignment and Folding of Protein Sequences€¦ · protein sequences. By concurrently optimizing unaligned protein sequences for both sequence homology and structural

Chapter 3: Pairwise Sequence Alignment - fh-muenster.de · AlIgnment algorithms: global and local Global sequence alignment: algorithm of Needleman and Wunsch Local sequence alignment:

Lecture 2 - Sequence Alignment - Schatzlabschatzlab.cshl.edu/teaching/2011/2011.Lecture3.Sequence Alignment.pdf• BLAST: Hash based homology searches • MUMmer: Suffix Tree based

Homology Search: Basic Local Alignment Search Tool (BLAST)

“Homology-enhanced probabilistic consistency” multiple sequence alignment : a case study on transmembrane protein Jia-Ming Chang 2013-July-09 Chang, J-M,

Sequence Alignmentbioinformatics.amc.nl/wp-content/uploads/gs-sequence... · 2018-03-05 · Sequence alignment Introduction sequence alignment Calculation of an alignment − exercises