Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

89
Bioinformatics Course, Spring 20 Bioinformatics CSC 391/691; PHY 392; BICM 715
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Page 1: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Bioinformatics

CSC 391/691; PHY 392; BICM 715

Page 2: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Importance of bioinformatics

• A more global perspective in experimental design

• The ability to capitalize on the emerging technology of database-mining--the process by which testable hypotheses are generated regarding the function or structure of a gene or protein of interest by identifying similar sequences in better characterized organisms.

Page 3: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Amino acids: chemical composition or digital symbols for proteins

http://wbiomed.curtin.edu.au/teach/biochem/tutorials/AAs/AA.html

Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html

See also Table 2.2 (Mount)

Page 4: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Nucleotides: chemical composition or digital symbols for nucleic acids

http://ndbserver.rutgers.edu/NDB/archives/NAintro/

Link found on the Research Collaboratory for Structural Biology web site: www.rcsb.org/pdb/education.html

See also Table 2.1 (Mount)

http://www.web-books.com/MoBio/Free/Ch3A.htm

Page 5: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

http://www.accessexcellence.org/AB/GG/genetic.html

The Genetic Code: how DNA nucleotides encode protein amino acids

Page 6: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Biologists think it’s a lot of data, but maybe its really not

He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data".  He offered the

comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna.  (3 gigabytes).  "The movie contains four times more information than Madonna's genome.  And Madonna shares 99%

of her DNA with a chimp...And 90% with Craig Venter’s dog.” 

More proof that the genome is not a lot of data:  About 90-something percent of genetic information is common to all humans.  "The unique

part of you will fit on a floppy disk."

Nathan Myhrvold, former Chief Technology Officer for MicrosoftKeynote Speech at NIH Digital Biology Meeting 2003

Page 7: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Review of Lab 1

• What did you learn about the sites you visited: SGD, SwissProt, EntrezRefSeq, EntrezNeighbor, EntrezProtein, PIR-US

• Can you define the term protein function?

• Does the term gene function have any meaning?

• Questions?

Page 8: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Biologists think it’s a lot of data, but maybe its really not

He made fun of biologists for complaining that the human genome, which takes up about 3 gigabytes, is "a lot of data".  He offered the

comparison of the DVD movie "Evita", which is about 12 gigabytes, with the genome of Madonna.  (3 gigabytes).  "The movie contains four times more information than Madonna's genome.  And Madonna shares 99%

of her DNA with a chimp...And 90% with Craig Venter’s dog.” 

More proof that the genome is not a lot of data:  About 90-something percent of genetic information is common to all humans.  "The unique

part of you will fit on a floppy disk."

Nathan Myhrvold, former Chief Technology Officer for MicrosoftKeynote Speech at NIH Digital Biology Meeting 2003

Page 9: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Biologists think it’s a lot of data, and maybe it really is

• The genome is not a static, one-time picture

• Genome changes over time—mutations and other changes

• Genes expressed to make proteins– Set of genes that are expressed changes with

cell type– Set of genes that are expressed changes over

time and state

Page 10: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Definition of a Biological Database

A biological database is a large, organized body of persistent data, usually associated

with computerized software designed to update, query, and retrieve components of

the data stored within the system.

Page 11: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Sources of sequence data1. GenBank at the National Center of Biotechnology Information, National Library of

Medicine, Washington, DC (nucleotides and proteins) http://www.ncbi.nlm.nih.gov/Entrez

2. European Molecular Biology Laboratory (EMBL) Outstation at Hixton, England http://www.ebi.ac.uk/embl/index.html

3. DNA DataBank of Japan (DDBJ) at Mishima, Japan http://www.ddbj.nig.ac.jp/

4. Protein International Resource (PIR) database at the National Biomedical Research Foundation in Washington, DC (see Barker et al. 1998) http://www-nbrf.georgetown.edu/pirwww/

5. The SwissProt protein sequence database at ISREC, Swiss Institute for Experimental Cancer Research in Epalinges/Lausanne http://www.expasy.ch/cgi-bin/sprot-search-de

6. The Sequence Retrieval System (SRS) at the European Bioinformatics Institute allows both simple and complex concurrent searches of one or more sequence databases. The SRS system may also be used on a local machine to assist in the preparation of local sequence databases. http://srs6.ebi.ac.uk

Table 2.5. Mount

Page 12: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Sources of protein structure data

• RCSB Protein Data Bank (PDB): www.rcsb.org

• BioMagResBank: http://www.bmrb.wisc.edu/

• MMDB: http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml

Page 13: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Review of Lab 2• What did you learn about the RCSB web page?• What are your thoughts about the PDB file format?• Was RasMol easy or hard to use? Is there anything you

tried to do, but couldn’t figure out how?• What is the difference between the two glutaredoxin

structures (1aaz and 1die)?• MMDB: database of protein structures, ASN.1 format (

http://www.ncbi.nlm.nih.gov/Structure/MMDB/mmdb.shtml)

• Other questions?

Page 14: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Levels of protein structure

• Primary structure

• Secondary structure

• (Super secondary structure)

• Tertiary structure

• Quaternary structure

Page 15: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Databases of protein structure classification

• SCOP: Murzin A. G., Brenner S. E., Hubbard T., Chothia C. (1995). J. Mol. Biol. 247, 536-540. [email protected]

• CATH: Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. (1997) Vol 5. No 8. p.1093-1108. http://www.biochem.ucl.ac.uk/bsm/cath/

• Dali: L. Holm and C. Sander (1996) Science 273:595-602. http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html

• VAST: S. H. Bryant and C. Hogue. http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

Page 16: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

RNA Structure

• Primary structure: sequence of GACU nucleotides

• Secondary structure: stem-loop structures

• Tertiary structure• http://

www.rnabase.org/

Page 17: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

DNA structure

• Primary structure: sequence of GACT nucleotides

• Secondary structure: double helix

• Higher levels of structure… nucleosome… chromatin… chromosome

Page 18: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

An example of pairwise alignment

(A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa using matrix file: BL50, gap penalties: -14/-4 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014

10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM

10 20 30 40 50 60

60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... .Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY

70 80 90

Page 19: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Pairwise Sequence Alignment

• The alignment of two sequences (either protein or nucleic acid) based on some algorithm

• What is the “right answer”?– Align (pairwise) the following words:

instruction, insurrection, incision

• There is NO unique, precise, and universally applicable method of pairwise alignment

Page 20: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

An example of pairwise alignment

(A) ./wwwtmp/lalign/.17728.1.seq Glutaredoxin, T4, 1AAZ.pdb - 87 aa (B) ./wwwtmp/lalign/.17728.2.seq Unknown protein - 93 aa using matrix file: BL50, gap penalties: -14/-4 27.0% identity in 89 aa overlap; score: 101 E(10,000): 0.0014

10 20 30 40 50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVKKQPFEFINIMPEKGV---FDD—EKIAELLTKLGR ..:: .. :: : .: :: : .:.: .. . . :: ::. : .. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKKANNQLGFDYILEKFDECKARANM

10 20 30 40 50 60

60 70 80 Glutar DTQIGLTMPQVFAPDGSHIGGFDQLREYF .:. ..:..:. ::..::.. :... .Unknow QTR-PTSFPRIFV-DGQYIGSLKQFKDLY

70 80 90

Page 21: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Global vs Local Alignment

Figure 3.1, Mount

Page 22: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Pairwise Sequence Alignment Websites

Table 3.1, Mount

Bayes block aligner http://www.wadsworth.org/res&res/bioinfo

Zhu et al. (1998)

BCM Search Launcher: Pairwise sequence alignment

http://searchlauncher.bcm.tmc.edu/seq-search/alignment.html

 

SIM—Local similarity program for finding alternative alignments

http://www.expasy.ch/tools/sim.html Huang et al. (1990); Huang and Miller (1991); Pearson and Miller (1992)

Global alignment programs (GAP, NAP)

http://genome.cs.mtu.edu/align/align.html

Huang (1994)

FASTA program suite http://fasta.bioch.virginia.edu/fasta/fasta_list.html

Pearson and Miller (1992); Pearson (1996)

BLAST 2 sequence alignment (BLASTN, BLASTP)

http://www.ncbi.nlm.nih.gov/gorf/bl2.html

Altschul et al. (1990)

LALIGN http://www.ch.embnet.org/software/LALIGN_form.html

Huang and Miller, published in Adv. Appl. Math. (1991) 12:337-357

Likelihood-weighted sequence alignment (lwa)

http://stateslab.bioinformatics.med.umich.edu/service/lwa.html

 

Page 23: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

What is multiple sequence alignment?

• Multiple sequence alignment is the alignment of more than two nucleotide or protein sequences

• Compare pairwise sequence alignment multiple sequence alignment

Page 24: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Issues with multiple sequence alignment

• Try creating a multiple sequence alignment of the three words:– Insurrection– Incision– Instruction

Page 25: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

• What’s the right answer?–

• Computational complexity• What is reasonable method for obtaining

cumulative score?• Placement and scoring of gaps

Issues with multiple sequence alignment

in cisioninsurrectionins truction

in cisioninsurrec tioninstr uc tion

in cisioninsurrectioninstr uction

inci sioninsurrectionins truction

Page 26: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Pairwise sequence alignment: LALIGN of OVCA2 and DYR_SCHPO (global)./wwwt MAAQRPLRVLCLAGFRQSERGFREKTGALRKALRGRAELVCLSGPHPVPDPPGPEGARSD :. .::.:::: :. :: : .: :...: : ::: .:: . . :. . dihydr MS—KPLKVLCLHGWIQSGPVFSKKMGSVQKYLSKYAELHFPTGPVVADEEADPNDEEEK 10 20 30 40 50

70 80 90 100 110 120./wwwt FGSCPPEEQPRGWWFSEQEADVFSALEEPAVCRGLEESLGMVAQALNRLGPFDGLLGFSQ . : :. :.. :. . . .::: . : ... ::::::.::::dihydr KRLAALGGEQNGGKFGWFEVEDFKN-----TYGSWDESLECINQYMQEKGPFDGLIGFSQ 60 70 80 90 100 110

130 140 150 160 170 ./wwwt GAALAALVCALGQAGDPRFPL---P—RFILLVSSFCPRGIGFKESILQRPLSLPSLHVF ::...:.. . : :.: : : .:...:..: . : . . . :. ::::. dihydr GAGIGAMLAQMLQPGQPPNPYVQHPPFKFVVFVGGFRAEKPEF-DHFYNPKLTTPSLHIA 120 130 140 150 160 170

180 190 200 210 ./wwwt GDTDKVIPSQESVQLASQFPGAITLTHSGGHFIPA-------------AAP--------- : .: ..: .: ::. . .: .: : : :..: .:: dihydr GTSDTLVPLARSKQLVERCENAHVLLHPGQHIVPQQAVYKTGIRDFMFSAPTKEPTKHPR

19.2% sequence identity; score -413

Page 27: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Multiple sequence alignment

Page 28: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

What is multiple sequence alignment used for?

• Consensus sequences: which residues can be used to identify other members of the family?

• Gene and protein families: which residues are functionally important; functional families

• Relationships and phylogenies: contains evolutionary “history” of sequences

• Data underlying some protein structure prediction algorithms

• Genome sequencing: sequence random, overlapping fragments; automation of assembly (in this case, there is a RIGHT answer)

Page 29: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Consensus sequences and important functional residues

Baxter, et al, Mol Cell Prot 2003

Page 30: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Relationships and phylogenies

• Serine-threonine protein phosphatases

• Same biochemical function

• Clustering clearly shows PP1, PP2a and PP2B families

• What is different about these families?

Fetrow, Siew, Skolnick, FASEB J, 1999

Page 31: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Possible redox site in PP1 family

Only a clustering, not a true phylogenetic tree

Page 32: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Methods to solve computational complexity

• Progressive global alignment

• Iterative methods

• Alignments based on locally conserved patterns

• Statistical methods and probabilistic models

Page 33: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Multiple Sequence Alignment: Global

CLUSTALW or CLUSTALX (latter has graphical interface)

FTP to ftp.ebi.ac.uk/pub/softwarea,d

Thompson et al. (1994a, 1997); Higgins et al. (1996)

MSA http://www.psc.edu/b

http://www.ibc.wustl.edu/ibc/msa.html

c

FTP to fastlink.nih.gov/pub/msa

Lipman et al. (1989);Gupta et al. (1995)

PRALINE http://mathbio.nimr.mrc.ac.uk/~jhering/praline/

Heringa (1999)

Table 4.1, Mount

Page 34: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Multiple Sequence Alignment: Interative

 

DIALIGN segment alignment

http://www.gsf.de/biodv/dialign.html

Morgenstern et al. (1996)

MultAlin http://protein.toulouse.inra.fr/multalin.html

Corpet (1988)

Parallel PRRN progressive global alignment

http://prrn.ims.u-tokyo.ac.jp/ Gotoh (1996)

SAGA genetic algorithm

http://igs-server.cnrs-mrs.fr/~cnotred/Projects_home_page/saga_home_page.html

Notredame and Higgins (1996)

Table 4.1, Mount

Page 35: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Multiple Sequence Alignment: LocalAligned Segment Statistical

Evaluation Tool (Asset)FTP to

ncbi.nlm.nih.gov/pub/neuwald/asset

Neuwald and Green (1994)

BLOCKS Web site http://blocks.fhcrc.org/blocks/ Henikoff and Henikoff (1991, 1992)

eMOTIF Web server http://dna.Stanford.EDU/emotif/ Nevill-Manning et al. (1998)

GIBBS, the Gibbs sampler statistical method

FTP to ncbi.nlm.nih.gov/pub/neuwald/gibbs9_95/

Lawrence et al. (1993); Liu et al. (1995); Neuwald et al. (1995)

HMMER hidden Markov model software

http://hmmer.wustl.edu/ Eddy (1998)

MACAW, a workbench for multiple alignment construction and analysis

FTP to ncbi.nlm.nih.gov/pub/macaw/

Schuler et al. (1991)

MEME Web site, expectation maximization method

http://meme.sdsc.edu/meme/website/

Bailey and Elkan (1995); Grundy et al. (1996, 1997); Bailey and Gribskov (1998)

Profile analysis at UCSDa,e http://www.sdsc.edu/projects/profile/

Gribskov and Veretnik (1996)

SAM hidden Markov model Web site

http://www.cse.ucsc.edu/research/compbio/sam.html

Krogh et al. (1994); Hughey and Krogh (1996)

Table 4.1, Mount

Page 36: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Methods to solve computational complexity

• Progressive global alignment– Start with most related sequences– Problem is that these errors in initial alignments are

propagated• Iterative methods

– Iterative alignment of subgroup of sequences to find “best”; then align subgroups

• Alignments based on locally conserved patterns– Block analysis

• Statistical methods and probabilistic models– Expectation maximum; Gibbs sampler; Hidden

Markov Models;

Page 37: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Profile Methods

• Perform a global multiple sequence alignment on a group of sequences

• Extract more highly conserved regions

• Profile = scoring matrix for these highly conserved regions

• Used to search unknown sequences for membership in the family

Figures 4.11 (p. 162) and 4.12 (p. 166-167)

Page 38: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Limitations of such profiles

• Limited by sequences in original msa:– Sequence bias (too many of one type of

sequence)– Sequences in msa not representative of entire

family

Page 39: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Blocks

• Blocks are conserved regions of msa (like profiles) but no gaps allowed

• Servers for producing Blocks:– Blocks server– eMotif server

• Block libraries for database searching– Blocks (Henikoff and Henikoff)– Prosite (Bairoch)– Prints (Attwood)

Page 40: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Blocks that might be extracted from an msa

Baxter, et al, Mol Cell Prot 2003

Page 41: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Blocks that might be extracted from an msa

Baxter, et al, Mol Cell Prot 2003

Page 42: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Database searching

• Identify a new sequence by experimental methods: what is it?

• Search databases to find similar sequences

• If “enough similarity”, can say that function of new sequence is same as known sequence: function annotation transfer

• What is “enough similarity”?• What is “function”?

Chapter 7, Mount

Page 43: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Relationships between family members

• Sequence relationships between family members

• Not all members of family have significant sequence similarity to all others

• Can be represented by nodes and edges of a graph

Z

C

A

B

E

D

F

Page 44: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Beware of issues with function annotation transfer

• Multiple domains

• High sequence identity, but functional residues not conserved

• Sequence repeats (low complexity regions)

Function B Function A

New

Function A

Known serine hydrolaseS D H

New sequenceS D L

Page 45: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Methods for database searching

• Sequence similarity with query sequence: FASTA, BLAST (Fig 7.5, p. 305)

• Profile search: ProfileSearch• Position-specific scoring matrix: MAST• Iterative alignment (combination of

sequence searching and profile search): PSI-BLAST

• Patterns: Prosite, Blocks, Prints, CDD/Impala

Table 7.1, Mount

Page 46: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

The problem with speed

• Dynamic programming– Guaranteed to find optimal answer– Too slow (number of searches performed and number

of sequences in databases that are searched): Smith-Waterman dynamic programming algorithm 50X slower than BLAST or FASTA

– faster hardware has made this problem feasible• Heuristic methods

– FASTA: short, common patterns in query and database searches

– BLAST: similar, but searched for more rare and significant patterns

Page 47: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Searches on DNA vs Protein Sequences

• 20-letter alphabet vs 4-letter alphabet• Fivefold larger variety of sequence characters in

proteins: easier to detect patterns• Searches with DNA sequences produce fewer

significant matches• What if you don’t know reading frame?• Sometimes must do nucleic acid searches

(searching for similarities in non-coding regions)

Page 48: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Sensitivity vs selectivity

• Sensitivity: method’s ability to find most members of the protein family

• Selectivity: method’s ability to distinguish true members from non-members

• Want a method to have high sensitivity (get all true positives) and high selectivity (not get false positives)

• Can be a difficult test with biological data sets: not all true positives are known

Page 49: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Scoring matrices commonly used

• PAM250: point accepted mutation; Dayhoff, M., Schwartz, R. M., and Orcutt, B. C., Atlas of Protein Sequence and Structure (1978) 5(3):345

• BLOSUM62: blocks amino acid substitution matrices; Henikoff and Henikoff, Amino acid substitution matrices from protein blocks. (1992) Proc. Natl. Acad. Sci. USA 89:10915-10919.

Page 50: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

PAM250

– Calculated for families of related proteins (>85% identity)

– 1 PAM is the amount of evolutionary change that yields, on average, one substitution in 100 amino acid residues

– A positive score signifies a common replacement whereas a negative score signifies an unlikely replacement

– PAM250 matrix assumes/is optimized for sequences separated by 250 PAM, i.e. 250 substitutions in 100 amino acids (longer evolutionary time)

Page 51: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

BLOSUM62

• BLOSUM matrices are based on local alignments (“blocks” or conserved amino acid patterns)

• BLOSUM 62 is a matrix calculated from comparisons of sequences with no less than 62% divergence

• All BLOSUM matrices are based on observed alignments; they are not extrapolated from comparisons of closely related proteins

• BLOSUM 62 is the default matrix in BLAST 2.0

Page 52: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Comparison of PAM250 and BLOSUM62

The relationship between BLOSUM and PAM substitution matrices. BLOSUM matrices with higher numbers and PAM matrices with low numbers are both designed for comparisons of closely related sequences. BLOSUM matrices with low numbers and PAM matrices with high numbers are designed for comparisons of distantly related proteins. If distant relatives of the query sequence are specifically being sought, the matrix can be tailored to that type of search.

BLOSUM80

PAM1

BLOSUM62

PAM120

BLOSUM45

PAM250

Less divergent

More divergent

Page 53: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Scoring matrices commonly used

• PAM250 – Represents a period of time during which only about

20% of amino acids will remain unchanged– Shown to be appropriate for searching for sequences

of 17-27% identity• BLOSUM62

– Matrix calculated from comparisons of sequences with no less than 62% divergence

– Though it is tailored for comparisons of moderately distant proteins, it performs well in detecting closer relationships

• BLOSUM50– Shown to be better for FASTA searches

Page 54: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Methods for database sequence searching

• Sequence similarity with query sequence: FASTA, BLAST

• Profile search: ProfileSearch

• Position-specific scoring matrix: MAST

• Iterative alignment (combination of sequence searching and profile search): PSI-BLAST

• Patterns: Prosite, PFAM, CDD/Impala

Page 55: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Review of protein structure

• Primary structure: sequence of amino acids

• Secondary structure: local segments of protein structure

• Tertiary structure: three-dimensional structure of a single protein chain

• Quaternary structure: packing of 2 or more protein chains

Page 56: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Classification of protein tertiary structure

• All alpha proteins

• All beta proteins

• Alpha+beta proteins

• Alpha/beta proteins

• Irregular proteins

Classify these proteins: T-cell protein CD8 (1cd8), myoglobin, triose phosphate isomerase, G-specific endonuclease (1rnb)

Page 57: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Representations of protein structures

• All atom

• CPK models

• Cartoons (ribbons, etc)

• Topology diagrams

Page 58: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Protein structure databases

• RCSB (PDB): http://www.rcsb.org/pdb– General repository for all protein coordinate files

• MMDB: http://www.ncbi.nlm.nih.gov/Structure– NCBI structure database; structures from pdb– Links to sequence and genome databases

• BioMagResBank: http://www.bmrb.wisc.edu/– General repository for NMR structure data

Page 59: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Alignment of protein structure

• Superposition of protein 3D structures

• Used in searching for structural similarity and grouping proteins into “fold families”

• Structural similarity is common and does not necessarily indicate an evolutionary relationship (different from sequence similarity)

Page 60: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Structure Alignment: A difficult problem

• Alignment in atom positions in 3D space

• Pieces of proteins may align– What is significant and

what is not? (Is alignment of two helices significant?)

• Alignment of topology or secondary structure packing give different answers

More difficult examples:http://www.sbg.bio.ic.ac.uk/people/rob/sf/sf.html

Easy example (Eidhammer and Jonassen):

Page 61: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Structure alignment used to classify (group) protein structures

• SCOP (Structural Classification Of Proteins; http://scop.mrc-lmb.cam.ac.uk/scop/)– Class (all alpha, all beta, alpha+beta, alpha/beta), family, superfamily, fold– Reflects structural and evolutionary relationships– Mostly done by “hand” (expert analysis)

• CATH (classification by class, architecture, topology and homology; http://www.biochem.ucl.ac.uk/bsm/cath)

– Class (all alpha, all beta, alpha+/beta), architecture, fold, superfamily, family– Uses SSAP structure alignment program

• FSSP (fold classification based on structure-structure protein alignment; http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html.)

– Based on pairwise alignment of all non-redundant proteins in PDB– Divides proteins into structures and domains: represents unique configuration of

secondary structure elements– Uses Dali structure alignment program

• MMDB (molecular modeling database; http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Structure)

– Proteins classified into structurally related groups by VAST, based on arrangements of secondary structures

– Groupings of all PDB structures• SARF (spatial arrangement of backbone fragments; http://123d.ncifcrf.gov/)

Page 62: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Web sites for structure alignment• VAST: http://www.ncbi.nlm.nih.gov/Structure/VAST/vast.shtml

– NCBI structure comparison– Comparison of orientations of secondary structures (vector representation of

secondary structures)– Approach from graph theory

• Dali: http://www.ebi.ac.uk/dali/– FSSP structure comparison– Protein represented as distance matrix between alpha carbons– Monte Carlo simulation to do random search for sub-distance-matrices

• SSAP: http://www.biochem.ucl.ac.uk/cgi-bin/cath/GetSsapRasmol.pl– CATH structure comparison– Set structure environment for each residue, then align residue by residue using

double dynamic programming– Structure environment can use beta carbon vectors or phi/psi backbone dihedral

angles• Others: Lots, such as Structal (Gerstein and Levitt); Minarea (Falicov and

Cohen); Lock (Singh and Brutlag)

Page 63: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Protein Structure Prediction

• Goal is to understand the relationship between the primary amino acid sequence and the structure of the protein

• Relationship between sequence and structure is not simple and is not understood

• “Protein folding problem” remains unsolved

Page 64: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Protein Structure Prediction

• Secondary structure prediction: unsolved?

• Tertiary structure prediction: unsolved problem (CASP competition)

• Quaternary structure prediction: unsolved problem– “Docking” of two subunits

Page 65: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Secondary structure prediction

• Prediction of three classes of secondary structure: helix, strand, “coil”– Solved problem? 70-80% “correct

predictions”– Methods (web sites) can give very different

answers

• Prediction of non-regular secondary structure (loops and turns) not as successful

Page 66: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Secondary structure prediction

• Method development– Frequencies on types of residues found in each

secondary structures– Frequencies calculated from database of known

structures (training set)

• Method evaluation– Test method on proteins whose structures are known

(testing set)

• Training and testing sets must not be the same

Page 67: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Secondary structure prediction methods and referencesSingle

residue statistics

Explicit rules

Nearest neighbors

Neural networks

Hidden markov models

1st generation

Chou/Fasman (’74)

GOR I

Lim (’74)

2nd generation

GOR III (’87) Predator (’96)

Levin (’86)Nishikawa

and Ooi (‘86)

Yi and Lander (’93)

Qian and Sejnowski (‘88)

Holley and Karplus (’89)

Yi and Lander (’93)

Asai/Handa (’93)

3rd generation

GOR IV DSC (Prof) (’96)

NNssp (’95) NNssp (’95)PHD (’93)Jnet (’99)PsiPred (’99)

PASSML (’98)

See Table 9.7, Mount, for list of servers

Page 68: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

GOR IV secondary structure prediction

• Three state prediction: helix, strand, loop

• Statistics of pair frequencies observed within a window of 17 amino acid residues

• Based on information theory—sound statistical basis and no ad hoc rules

• Mean accuracy of 64.4% for a three state prediction (Q3)

Garnier, Gibrat, Robson; http://abs.cit.nih.gov/index.html

Page 69: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

PHD secondary structure prediction

• Three state prediction: helix, strand, loop• Predicts secondary structure from multiple sequence

alignments• Three consecutive neural networks (feed forward)

– Raw 3-state prediction for each position, based on alignment composition in 13 residue window

– Filter 3-state probabilities based on probabilities of flanking positions in 17-residue window

– Jury network using several raw/filter combinations trained separately

• Expected average accuracy > 72% for three state prediction (Q3)

Rost and Sander; http://www.predictprotein.org

Page 70: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Method evaluation: how good is “good”?• Testing of prediction methods involves

– Applying the method to a set of proteins whose secondary structures are known experimentally and comparing prediction results to known results

– Calculating measures of how good the performance is• Q1 (h, s, or c)

– (number of residues correctly predicted in one state/number of residues in that state) * 100

• Q3 (h, s, and c): – (number of residues correctly predicted in each of 3 states/number of all

residues) * 100• Matthews correlation coefficient (Cs)

– (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)]

Num: ....,....1....,....2....,....3....,....4....,....5....,....6 Res: MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN|Actu: HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE HHHHHHHHHHHHHH|Pred: HHHHHH EEEEE HHHHHHHHHHHH EEEEEE HHHHHHHH |Pred: HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH|

Page 71: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

• Matthews correlation coefficient (Cs)– (TpTn - FpFn) / sqrt[(Tp+Fp)(Tn+Fn)(Tp+Fn)(Tn+Fp)]– Where Tp, true positive predictions (method predicts helix, and

residue is in a helix); Tn, true negative prediction (method predicts “not helix”, and residue is not in a helix); Fp, false positive prediction (method predicts helix, but residue is not in a helix); Fn, false negative prediction (method predicts “not helix”, but residue is in a helix)

Num: ....,....1....,....2....,....3....,....4....,....5....,....6 Res: MSTKQHSAKDELLYLNKAVVFGSGAFGTALAMVLSKKCREVCVWHMNEEEVRLVNEKREN|Actu: HHHHHHHHHHHH EE HHHHHHHHHHHHHHH EE HHHHHHHHHHHHHH|Pred: HHHHHH EEEEE HHHHHHHHHHHH EEEEEE HHHHHHHH |

Q1 (helix)=(4+12+8)/(12+15+14)*100=58%Q3=(4+12+8+2+2)/60*100=47%Tp=4+12+8=24; Tn=9+8=17Fp=2; Fn=8+1+2+5+1=17Ch=[(24*17)-(2*17)]/sqrt[(24+2)(17+17)(24+17)(17+2)]

Method evaluation: how good is “good”?

Page 72: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Tertiary Structure Prediction

• Homology modeling: identifiable sequence similarity

• Fold recognition (“threading;” Table 9.8 for server list)

• “Ab initio” methods

Page 73: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Homology modeling

• Sequence alignment

• Side chain modeling

• Modeling insertions and deletions

• Optimizing the model

• Model evaluation

• Repeat?

Page 74: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Fold Recognition (“threading”)

• Template identification/sequence alignment/alignment optimization

• Side chain modeling

• Modeling insertions and deletions

• Optimizing the model

• Model evaluation

• Repeat?

Page 75: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Ab initio methods: folding “from scratch”

• Start with unfolded protein or random conformation• Use atomic-level forces, solve energetic equations • Identify most stable conformation (lowest free energy)• Computational demands high: for protein of 100 amino

acids– Assume constant bond lengths and angles– Allow 2/3 backbone torsion angles per amino acid to rotate– Do not allow side chain torsion angles to move– Assuming 10 allowed conformations per residue, must explore

10100 conformations– Calculation of 10100 energies (one for each conformation) is not

possible

Page 76: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Ab initio methods: simplifications

• Lattice models to simplify the conformational search space

• Monte Carlo statistical sampling of conformational space

• Stepwise processes:– Predict regular secondary structures– Pack secondary structures to form tertiary

structures

• Others…

Page 77: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Review of Definitions

• Cell: fundamental working unit of biology• DNA: encodes all information to create cells and allow

them to function– Linear arrangement of bases (AGTC)

• Genome: organism’s complete set of DNA• Chromosome: physically distinct molecules of DNA

– Genomes can be composed of 1, 2 or more chromosomes

• Gene: basic physical and functional unit of heredity– Linear arrangement of bases along the chromosome– Contain instructions for encoding protein– (Remember genetic code?)

Page 78: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Genomes and proteomes

• Genome: Sum of all genes and intergenic DNA sequences in a cell– the smallest known genome for a free-living organism

(a bacterium) contains about 600,000 DNA base pairs– human and mouse genomes have about 3 billion– “relatively” unchanging from cell to cell

• Proteome: The entire set of proteins encoded in the genome of an organism and produced by that organism– Constellation of proteins in cells is highly dynamic

Page 79: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

The Human Genome• 24 chromosomes• Chromosomes range is size from 50 million to 250 million base pairs• Total size of the human genome is over 3 billion base pairs (3.1647

billion)– 99.9% of all bases are the same in all people

• Genes comprise only 2% of the total genome– Human genome is estimated to contain 30,000 to 40,000 genes– Average gene size is about 3000 bases– Largest identified so far is 2.4 million bases (dystrophin)– Functions for less than 50% of genes and gene products are known

• Remainder of genome is non-coding regions– Chromosomal structural integrity– Repetitive sequences– Regulation of protein production– Other functions that we don’t know about

Page 80: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Human Genome Sequencing Project Goals

• Determine the sequences of the 3 billion chemical base pairs that make up human DNA

• Identify all the approximately 30,000 genes in human DNA

• Store this information in databases• Improve tools for data analysis• Transfer related technologies to the private sector• Address the ethical, legal, and social issues that may

arise from the project

Human Genome Project (DOE):http://www.ornl.gov/sci/techresources/Human_Genome/home.shtml

NIH:http://www.ncbi.nlm.nih.gov/genome/guide/human/

Page 81: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Other sequencing projects

• Over 200 genomes sequenced• Range of archeae, bacteria, eukaryotic

genomes– Organisms that have been well-studied in the

laboratory– Organisms that are pathogenic to humans– Organisms of special scientific or technical

interest

NCBI list of sequenced genomes (NIH):http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome

Page 82: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Prokaryotes and eukaryotes

• Prokaryotes (bacteria and archaea)– No true nucleus– DNA generally circular

(one chromosome)

• Eukaryotes– True nucleus contains

(most) DNA– DNA linear and

arranged in chromosomes

Phylogenetic analysis of small subunit ribosomal RNAs, C. Woese, 1987

Page 83: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Anatomy of a prokaryotic genome

• DNA compact and circular

• ORFs (open reading frames) with start and stop codons

• No introns

Page 84: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Anatomy of a eukaryotic genome

• Linear DNA; chromosomes

• Centromeres• Telomeres• Tandem repeats• Transposable

elements• Introns• Pseudogenes Example of chromosome maps:

http://www.ncbi.nlm.nih.gov/genome/guide/human/

Page 85: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

DNA sequencing

• Separate strands of DNA• Anneal primer to one

strand• Replicate using

fluorescently labeled ddNTPs (as opposed to normal dNTPs)

• Separate fragments by size

• Image gel for fluorescent labels

See also, electropherogram, Fig2.2, Mount

A G C T

Page 86: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Methods of genome sequencing

• Mapping method– Fragment chromosome– Identify markers and order them– Arrange fragments, then sequence

• Shotgun method– Fragment chromosome– Sequence fragments, then arrange

• cDNA sequencing (ESTs)– Isolate mRNA (expressed in cell)– Reverse transcribe mRNA to create cDNA– Sequence cDNA

Page 87: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Maps

• Gene map

• Chromosome map

• Sequence map

• Maps important for obtaining sequence information (mapping method)– Restriction map– Contig (contiguous clone) map

NCBI map viewer:http://www.ncbi.nlm.nih.gov/mapview/

Page 88: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Prediction of genes

• Method

• Difference between prokaryotes and eukaryotes

• Tests for validation of predictions

Page 89: Bioinformatics Course, Spring 2004 Bioinformatics CSC 391/691; PHY 392; BICM 715.

Bioinformatics Course, Spring 2004

Genome Analysis

• General approach (p. 492)• Comparative genomics

– Self-comparison reveals gene families and duplication

– Between-genome-comparison reveals orthologs, gene families and domains

– Gene ordering on chromosomes

• Phylogenetic analysis• Genetic diversity