Matching Problems in Bioinformatics Charles Yan Fall 2008.

34
Matching Problems in Bioinformatics Charles Yan Fall 2008
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    216
  • download

    0

Transcript of Matching Problems in Bioinformatics Charles Yan Fall 2008.

Page 1: Matching Problems in Bioinformatics Charles Yan Fall 2008.

Matching Problems in Bioinformatics

Charles YanFall 2008

Page 2: Matching Problems in Bioinformatics Charles Yan Fall 2008.

2

Matching Problem

Given a string P (pattern) and a long string T (text), find all occurrences, if any, of P in T.

ExampleT: Given a string P (pattern) and a long string T (text), find all

occurrences, if any, of P in T.P: any

Exact matching: Does not allow any mismatchInexact matching: Allow up to k mismatches

Page 3: Matching Problems in Bioinformatics Charles Yan Fall 2008.

3

Matching Problem

Unix: grepMS word: find

Genbank: http://www.ncbi.nlm.nih.gov/Genbank/

Human genome:http://www.ncbi.nlm.nih.gov/projects/mapview/

map_search.cgi?taxid=9606

Given “TTGTTCCGGTTAAAGATGGTGAAAATTTTT”, does it appear in human genome? Where?

How about “ACCCCCAGGCGAGCATCTGACAGCCTGGAGCAGCACACACAACCCCAGGCGAG”?

Page 4: Matching Problems in Bioinformatics Charles Yan Fall 2008.

4

Motifs

A motif is a conserved element corresponding to a certain function (or structure). Occurrence of a motif in a protein is likely to indicate that the protein has the corresponding function.

Motifs are usually represented using alignment or regular expression

Page 5: Matching Problems in Bioinformatics Charles Yan Fall 2008.

5

Motifs

Page 6: Matching Problems in Bioinformatics Charles Yan Fall 2008.

6

Motifs

Protein function prediction using motifs Each protein function is characterized by

one single motif or multiple motifs . If a protein contain the motif(s), it probably

has the function that the motif(s) corresponds to.

A pertinent analogy is the use of fingerprints by the police for identification purposes. A fingerprint is generally sufficient to identify a given individual. Similarly, motif(s) can be used to formulate hypotheses about the function of a newly discovered protein.

Page 7: Matching Problems in Bioinformatics Charles Yan Fall 2008.

7

PROSITE

PROSITE (http://ca.expasy.org/prosite/) is a database of protein families and domains. (Starting in 1988).

PROSITE currently contains patterns (motifs) and profiles specific for more than a thousand protein families or domains. Release 20.36, of 22-Jul-2006 (contains 1528 documentation entries).

Each of these signatures comes with documentation providing background information on the structure and function of these proteins.

Page 8: Matching Problems in Bioinformatics Charles Yan Fall 2008.

8

PROSITE

Page 9: Matching Problems in Bioinformatics Charles Yan Fall 2008.

9

PROSITE

Page 10: Matching Problems in Bioinformatics Charles Yan Fall 2008.

10

PROSITE

Page 11: Matching Problems in Bioinformatics Charles Yan Fall 2008.

11

PROSITE

Page 12: Matching Problems in Bioinformatics Charles Yan Fall 2008.

12

PROSITE

Steps in the development of a new motif Select a set of sequences that belong to a function

family. Make a multiple alignment. Find a short (not more than four or five residues long)

conserved sequence (core motif) which is part of a region known to be important or which include biologically significant residue(s).

Page 13: Matching Problems in Bioinformatics Charles Yan Fall 2008.

13

PROSITE

Steps in the development of a new motif (cont.) The most recent version of the Swiss-Prot knowledgebase

is then scanned with these core pattern(s). If a core motif will detect all the proteins in the family and none (or very few) of the other proteins, we can stop at this stage.

In most cases we are not so lucky and we pick up a lot of extra sequences which clearly do not belong to the group of proteins under consideration. A further series of scans, involving a gradual increase in the size of the motif, is then necessary. In some cases we never manage to find a good motif.

Page 14: Matching Problems in Bioinformatics Charles Yan Fall 2008.

14

PROSITE

The motif are described using the following conventions: The standard IUPAC one-letter codes for the amino acids are

used. The symbol 'x' is used for a position where any amino acid is

accepted. Ambiguities are indicated by listing the acceptable amino

acids for a given position, between square parentheses '[ ]'. For example: [ALT] stands for Ala or Leu or Thr.

Ambiguities are also indicated by listing between a pair of curly brackets '{ }' the amino acids that are not accepted at a given position. For example: {AM} stands for any amino acid except Ala and Met.

Each element in a pattern is separated from its neighbor by a '-'.

Page 15: Matching Problems in Bioinformatics Charles Yan Fall 2008.

15

PROSITE

The motif are described using the following conventions (Cont.): Repetition of an element of the pattern can be indicated by

following that element with a numerical value or a numerical range between parenthesis. Examples: x(3) corresponds to x-x-x, x(2,4) corresponds to x-x or x-x-x or x-x-x-x.

When a pattern is restricted to either the N- or C-terminal of a sequence, that pattern either starts with a '<' symbol or respectively ends with a '>' symbol. In some rare cases (e.g. PS00267 or PS00539), '>' can also occur inside square brackets for the C-terminal element. 'F-[GSTV]-P-R-L-[G>]' means that either 'F-[GSTV]-P-R-L-G' or 'F-[GSTV]-P-R-L>' are considered.

A period ends the pattern.

Examples: [AC]-x-V-x(4)-{ED}.This pattern is translated as: [Ala or Cys]-

any-Val-any-any-any-any-{any but Glu or Asp}

Page 16: Matching Problems in Bioinformatics Charles Yan Fall 2008.

16

PROSITE

Page 17: Matching Problems in Bioinformatics Charles Yan Fall 2008.

17

PROSITE

Page 18: Matching Problems in Bioinformatics Charles Yan Fall 2008.

18

PROSITE

A profile or weight matrix is a table of position-specific amino acid weights and gap costs. These numbers (also referred to as scores) are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence.

Page 19: Matching Problems in Bioinformatics Charles Yan Fall 2008.

19

PROSITE

Page 20: Matching Problems in Bioinformatics Charles Yan Fall 2008.

20

Motifs and Matching

Motif Finding: Given a set of protein sequences, to find the motif(s) that are

shared by these proteins. Motif Scanning Given a motif and a protein sequence, to find the occurrences

(not necessary identical) of the motif on the protein sequences.

–--The Matching Problem!

Page 21: Matching Problems in Bioinformatics Charles Yan Fall 2008.

21

From Single Motif to Multiple Motifs

One single motif is not sufficient to predict a protein function. Multiple motifs have stronger predicting power.

Page 22: Matching Problems in Bioinformatics Charles Yan Fall 2008.

22

Multiple Motifs

Protein function prediction using multiple motifs

Each protein function is characterized by a set of motifs (in stead of a single one).

If a protein contain a set of motifs, it probably has the function that the set of motifs correspond to.

Page 23: Matching Problems in Bioinformatics Charles Yan Fall 2008.

23

PRINTS

PRINTS (http://umber.sbs.man.ac.uk/dbbrowser/PRINTS/ ) is a database of protein fingerprints.

A fingerprint is a group of conserved motifs used to characterize a protein family;

ftp.bioinf.man.ac.uk/pub/prints PRINTS is now maintained at the University of

Manchester PRINTS VERSION 38.1 (25 May, 2007) 1904 FINGERPRINTS, encoding 11,451 single

motifs

Page 24: Matching Problems in Bioinformatics Charles Yan Fall 2008.

24

PRINTS

Two types of fingerprint are represented in the database, i.e. they are either simple or composite, depending on their complexity: simple fingerprints are essentially single-motifs; while composite fingerprints encode multiple motifs. The bulk of the database entries are of the latter type because discrimination power is greater for multi-component searches.

Usually the motifs do not overlap, but are separated along a sequence, though they may be contiguous in 3D-space.

Fingerprints can encode protein folds and functionalities more flexibly and powerfully than can single motifs, full diagnostic potency deriving from the mutual context provided by motif neighbors.

Page 25: Matching Problems in Bioinformatics Charles Yan Fall 2008.

25

PRINTS

Page 26: Matching Problems in Bioinformatics Charles Yan Fall 2008.

26

PRINTS

Page 27: Matching Problems in Bioinformatics Charles Yan Fall 2008.

27

PRINTS

a) General field

Page 28: Matching Problems in Bioinformatics Charles Yan Fall 2008.

28

PRINTS

FPScanSubmitting a PROTEIN sequence find the closest

matching PRINTS fingerprint/s.

Page 29: Matching Problems in Bioinformatics Charles Yan Fall 2008.

29

PRINTS

Page 30: Matching Problems in Bioinformatics Charles Yan Fall 2008.

30

PRINTS

Page 31: Matching Problems in Bioinformatics Charles Yan Fall 2008.

31

PRINTS

Page 32: Matching Problems in Bioinformatics Charles Yan Fall 2008.

32

PRINTS

Page 33: Matching Problems in Bioinformatics Charles Yan Fall 2008.

33

Related Projects

InterPro - Integrated Resources of Proteins Domains and Functional Sites

BLOCKS - BLOCKS db Pfam - Protein families db (HMM derived) [Mirror at

St. Louis (USA)] PRINTS - Protein Motif fingerprint db ProDom - Protein domain db (Automatically generated) PROTOMAP - An automatic hierarchical classification of

Swiss-Prot proteins SBASE - SBASE domain db SMART - Simple Modular Architecture Research Tool TIGRFAMs - TIGR protein families db

Page 34: Matching Problems in Bioinformatics Charles Yan Fall 2008.

34

Motifs and Matching

Motif Finding: Given a set of protein sequences, to find the motif(s) that are

shared by these proteins. Motif Scanning Given a motif and a protein sequence, to find the occurrences

(not necessary identical) of the motif on the protein sequences.

–--The Matching Problem!