online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx ·...

17
Bioinformatics Exercises Over the last two decades, information has been gaining increasing importance in both teaching and learning biochemistry. The most obvious case is the sequencing of the human genome and many other complete genomes. In 1990, the determination of the sequence of a protein was often the topic of a full publication in a peer-reviewed journal such as Science, Nature, or The Journal of Biological Chemistry. Now entire genomes are the topic of individual research papers. The term "bioinformatics" is a catch-all phrase which generally refers to the use computers and computer science approaches to the study of biological systems. The main chapters where this information is discussed in the text are chapters 3 (Nucleotides, Nucleic Acids and Genetic Information), 5 (Proteins: Primary Structure), 6 (Proteins: Three- Dimensional Structure), 12 (Enzyme Kinetics, Inhibition and Regulation) and 13 (Introduction to Metabolism). Here we provide exercises appropriate to these chapters aimed at introducing the techniques of bioinformatics that involve the use of computers, Internet-accessible databases and the tools that have been developed to “mine” those databases. General principles 1. Open ended questions. The exercises may include some questions that have definite answers, but in many cases there will also be questions which may be answered in a number of ways, depending on the approach you take or the topic you select. 2. Stable Internet Resources. As much as possible, the exercises will be based on well established, stable web sites. If it is necessary to use less reliable sites and/or resources, attempts have been made to provide multiple sites that perform similar functions. 3. Here are the stable online resources that will be used most frequently: a. Genbank (http://www.ncbi.nlm.nih.gov/) b. Protein Data Bank (http://www.rcsb.org) c. Expasy Proteomics Server (http://us.expasy.org/) d. European Bioinformatics Institute (http://www.ebi.ac.uk/) e. Pfam (http://www.sanger.ac.uk/Software/Pfam/) f. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/) g. CATH (http://www.biochem.ucl.ac.uk/bsm/cath/) h. PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi)

Transcript of online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx ·...

Page 1: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Bioinformatics Exercises

Over the last two decades, information has been gaining increasing importance in both teaching and learning biochemistry. The most obvious case is the sequencing of the human genome and many other complete genomes. In 1990, the determination of the sequence of a protein was often the topic of a full publication in a peer-reviewed journal such as Science, Nature, or The Journal of Biological Chemistry. Now entire genomes are the topic of individual research papers. The term "bioinformatics" is a catch-all phrase which generally refers to the use computers and computer science approaches to the study of biological systems. The main chapters where this information is discussed in the text are chapters 3 (Nucleotides, Nucleic Acids and Genetic Information), 5 (Proteins: Primary Structure), 6 (Proteins: Three-Dimensional Structure), 12 (Enzyme Kinetics, Inhibition and Regulation) and 13 (Introduction to Metabolism). Here we provide exercises appropriate to these chapters aimed at introducing the techniques of bioinformatics that involve the use of computers, Internet-accessible databases and the tools that have been developed to “mine” those databases.

General principles

1. Open ended questions. The exercises may include some questions that have definite answers, but in many cases there will also be questions which may be answered in a number of ways, depending on the approach you take or the topic you select.

2. Stable Internet Resources. As much as possible, the exercises will be based on well established, stable web sites. If it is necessary to use less reliable sites and/or resources, attempts have been made to provide multiple sites that perform similar functions.

3. Here are the stable online resources that will be used most frequently:

a. Genbank (http://www.ncbi.nlm.nih.gov/)b. Protein Data Bank (http://www.rcsb.org)c. Expasy Proteomics Server (http://us.expasy.org/)d. European Bioinformatics Institute (http://www.ebi.ac.uk/)e. Pfam (http://www.sanger.ac.uk/Software/Pfam/)f. SCOP (http://scop.mrc-lmb.cam.ac.uk/scop/)g. CATH (http://www.biochem.ucl.ac.uk/bsm/cath/)h. PubMed (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi)i. PubMed Central (http://www.pubmedcentral.nih.gov/)

4. Answer key. Where a definite answer is known, it will be provided in an answer key. For more open-ended questions, a typical correct answer will be presented.

5. Historical perspective. If historical resources are available online (including PubMed), there may be questions designed to help students identify some of the historical roots of biochemistry and molecular biology.

Project 7: Enzyme Commission Classes and Catalytic Site Alignments with PyMOL

Page 2: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Resources

The Protein Data Bank (http://www.rcsb.org/pdb/home/home.do) The Catalytic Site Atlas (http://www.ebi.ac.uk/thornton-srv/databases/CSA/) Enzyme Nomenclature (http://www.chem.qmul.ac.uk/iubmb/enzyme/) Classification and Nomenclature of Enzymes by the Reactions they Catalyze

(http://www.chem.qmul.ac.uk/iubmb/enzyme/rules.html) KEGG ( Kyoto Encyclopedia of Genes and Genomes ) Reaction

(http://www.genome.jp/kegg/reaction/) Expasy Enzyme (http://enzyme.expasy.org/)

Software

PyMOL for Edcuational Use Only (http://pymol.org/educational/) ProMOL Download Site (http://sourceforge.net/projects/sbevsl/) ProMOL User Guide ChemAxon MarvinSketch

( http :// www . chemaxon . com / products / marvin / marvinsketch /)

Chapter 11 presents the general principles of enzyme function, along with specific mechanisms for lysozyme and serine proteases. The role of protein structures that have been determined by X-ray crystallography in our understanding of enzyme mechanisms is also discussed. In this exercise, we will explore a number of Internet resources relating to enzyme mechanism, with a particular focus on the role of catalytic residues in the active sites of enzymes. We will then use ProMOL, a plugin for the PyMOL molecular graphics environment, to search enzyme structures for catalytic sites.

1. The Enzyme Commission and IUPAC Classification of Enzymes

1. What is IUPAC?

2. Where else have you seen IUPAC? (Hint: think organic chemistry)

3. What are the six classes of enzymes in the IUPAC classification system? Now look at the different levels of classification. For example, if you look for an enzyme with the EC classification of 1.2.3.4. A search of the Expasy Enzyme site by enzyme class (http://enzyme.expasy.org/enzyme-byclass.html) will help you to understand the hierarchy.

4. Finding enzymes by EC class. In Chapter 11, you have studied a number of enzymes: ribonuclease A, carbonic anhydrase, proline racemase, lysozyme, chymotrypsin, trypsin, elastase, subtilisin, serine carboxypeptidase A, and and phospholipase A. Search for the EC classification for each of these enzymes using one or more of the following resources:

Page 3: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

a. The Protein Data Bank ( http :// www . rcsb . org / pdb / home / home . do ) b. The Catalytic Site Atlas ( http :// www . ebi . ac . uk / thornton - srv / databases / CSA /) c. The DBGET Access Point to the KEGG Database

( http :// www . genome . jp / dbget /) d. Expasy Enzyme (http://enzyme.expasy.org/)

5. Are any of these enzyme related to each other based on the numerical hierarchy of the EC classification system? The further you go down the hierarchy, the more closely related the enzymes are. For example, an enzyme with the classification 2.3.5.17 is much closer to an enzyme with classification 2.3.5.25 than to an enzyme with classification 2.3.5.17. Identify the two enzymes from this list that are closest in the Enzyme Commission classification system.

6. For those who want to dig a little deeper: go to the IUBMB Enzyme Nomenclature page ( http :// www . chem . qmul . ac . uk / iubmb / enzyme /) and pick two enzymes that are from the same level 1 class, but differ in the 2nd level (for example, 1.2.3.4 and 1.5.3.4). Then look for these enzymes by EC class on BRENDA ( http :// www . brenda - enzymes . org /) and the KEGG Enzyme Database ( http :// www . genome . jp / dbget - bin / www _ bfind ? enzyme ) . What reaction does each enzyme catalyze? Compare the structures of the substrates and products. Are there any cofactors involved? Look for the enzymes by name on the Protein Data Bank (http://www.rcsb.org/pdb/home/home.do) and see if you can find any similarities or differences.

2. Exploring the Catalytic Site Atlas (CSA)

Now we’ll take some time to explore an online resource dedicated to the catalytic sites of enzymes: The Catalytic Site Atlas ( http :// www . ebi . ac . uk / thornton - srv / databases / CSA /) . Visit this link and answer the following questions based on what you find on the opening page.

1. Who is the host of the Catalytic Site Atlas?

2. List three references to peer-reviewed articles about the CSA.

3. What kind of information can you find on the CSA?

4. What is their basis for defining catalytic site residues?

Searching for Catalytic Site Residues

Page 4: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

There are a number of ways you can explore the Catalytic Site Atlas. You can search by EC#, you can browse their literature entries ( http :// www . ebi . ac . uk / thornton - srv / databases / cgi - bin / CSA / CSA _ Browse . pl ) , or you can enter a PDB code or a SwissProt code. Let’s start by looking at a structure that the CSA has elected to use as a template, catechol 2,3-dioxygenase, PDB code 1mpy. Enter 1mpy in the PDB code box that is located near the top of the CSA homepage.

This should bring up a page entitled, “CSA entry for 1MPY” that looks like this (as of August, 2011).

Page 5: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Explore this page and answer the following questions about the CSA entry for 1mpy:1. What is the EC# for catechol 2,3-dioxygenase?

2. What residues are found in the active site? What is their source for that information?

3. Follow some of the links that are listed under “Other Databases” on the page. Notice that there are links to sites that were mentioned previously: the PDB, KEGG and Swiss-Prot. Can you find information at the top level of each of these links that is not found on the main CSA page for 1mpy?

4. How many homologs of 1mpy are found in the CSA (look for the link under “Other CSA entries”?

Page 6: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Do they all have the same EC#? If the numbers are different, which digits are different? Remember that the first digit is one of the 6 main classes and that each subsequent number represents a subset of the number to its left.

Pick one of these homologs with a different EC# and see if it has the same catalytic residues listed. What did you find?

5. Now repeat this exercise starting from a PDB code (which you’ll need to get from the Protein Data Bank ( http :// www . rcsb . org / pdb / home / home . do ) ) for one of the structures that are covered in the textbook. I suggest trypsin or elastase.

a. What is the EC#?

b. What are the active site residues?

c. What does Swiss-Prot or KEGG say about this enzyme?

d. How many homologs of this enzyme are found in the CSA? Do these homologs have the same EC#? Do they have the same active site residues?

3. Active Site Alignment in PyMOL Using the ProMOL Plug-In

In this exercise, you will be searching for enzyme active sites using the ProMOL plug-in for the PyMOL-ProMOL molecular graphics environment. Before you start, you’ll need to install the software. The attached ProMOL User Guide (ProMOL_User_Guide.pdf) contains instructions for installing PyMOL and ProMOL on Windows, Macintosh and Linux operating systems. When you are done and have launched PyMOL, it should look like this on your screen (images taken from a MacBookPro running OS X Snow Leopard). Your version of PyMOL may differ slightly depending on your operating system and the version of PyMOL you are running. In every operating system, you should see two windows - the PyMOL GUI (on top in this image) and the molecular viewer (below).

Page 7: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

To launch the ProMOL plug-in, click on the Plugin dropdown menu and select ProMOL.

This will launch the GUI for ProMOL.

Page 8: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

The ProMOL User Guide describes the functions that can be found on each of the tabs and buttons. You are encouraged to explore these, but for this exercise, we’ll focus on the Motif Finder tab, which looks like this. The Motif Finder will compare the structure you submit (by PDB code) to a library of over 400 active site motifs, which are based on the active sites described in the Catalytic Site Atlas.

Page 9: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

In the Tools box in the lower left hand corner, make sure you click on the Show Alignment check box before proceeding.

Now enter the PDB code for 1hbz (a catalase from Micrococcus lysodeikticus ( http :// www . rcsb . org / pdb / explore / explore . do ? structureId =1 HBZ ) ) in the box labeled “PDBs to Search”. There is a very good template match to 1hbz in the ProMOL motif library, so you will see a clear result and alignment.

Page 10: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

and click the Start button. Before the search begins, a dialog box will appear that asks which set of motifs to use for comparison.

Select the “All motifs” radio button, then click the Select button to start the search. This process can take 5-15 minutes, depending on your operating system (Linux is easily the fastest) and the CPU on your computer. In any case, now would be a good time to get a cup of coffee.

Once the search process is complete, the results window will be populated alignments. Each alignment is given a score based on a Levenshtein distance, which reflects the number of differences between the query structure (1hbz in this case) and the template motifs that were identified as matches. In simple terms, the Levenshtein distance from the word “horse” to the word “house” is 1, since you need to change the 3rd letter from an “r” in horse to a “u” in house.

While you’re waiting for the alignment to be completed, go to the CSA entry for 1 hbz ( http :// www . ebi . ac . uk / thornton - srv / databases / cgi - bin / CSA / CSA _ Site _ Wrapper . pl ? pdb =1 hbz ) . Scroll down the page to see the residues that constitute the active site for 1hbz: histidine-61, serine-99 and asparagine-133, all on chain A. The Levenshtein distance is based only on the residues names (not their numbers) and whether they align in 3D space. So if a template has a Levenshtein distance of 0, that means it contains histidine, serine and asparagine that align closely with the query structure.

Once the Motif Finder search is completed, you should see the top 19 alignments for 1hbz in the Motif Finder results window. Let’s look at the first result (0-P_1mqf_1_11_1_6) for an explanation of the code.

Page 11: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

0 The Levenshtein distance. A value of 0 means that all the residues in the query match the residues in the template. In other cases, a range of distances is shown. A range of 0/2 means that the best fit has a Levenshtein distance of 0 and the next best fit has a distance of 2. In practice, Levenshtein distances greater than 1 usually don’t provide meaningful results.

P This is a motif created by ProMOL. You can learn more about designing your own motifs in the ProMOL Users Guide.

1mqf The PDB code for the template from the library that aligns with the query structure (1hbz)

1_11_1_6 The EC# for catalase, with the digits separated by an underscore rather than a period (a programming convention)

Page 12: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Now go to the results window and double click on 0-P_1mqf_1_11_1_6 to generate an alignment in the PyMOL visualization window. Remember that you need to have the Show Alignment check box in the Tools box in the lower left hand corner to see the alignment of the query (1hbz) and the template (1mqf).

The residues from the query structure (1hbz) are shown in red, while the residues from the template (1mqf) are shown in white. If you click on the individual residues, the image will change slightly.

Page 13: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

The residues you clicked on will then be listed in the PyMOL GUI pane. Keep a record of the residues from your query and template structures. The format for the results in the PyMOL GUI is /PDB_id/chain/residue’number/atom, e.g., /1mqf/A/SER’93/C.

This is clearly a very close alignment. If you look for homologs on the CSA page for 1 hbz ( http :// www . ebi . ac . uk / thornton - srv / databases / cgi - bin / CSA / CSA _ Site _ Wrapper . pl ? pdb =1 hbz ) , you will find that 1mqf is listed there.

Now check some of the other hits from the Motif Finder results page. If you double click on 0-P_o2u_3_4_21_4, you’ll find that there are only two residues in the template (1o2u) that match with the query. The alignment is not as good as that seen with 1mqf and there are only two residues in this template active site, so this is not as good a match.

Page 14: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Let’s look at an alignment with a Levenshtein distance of 1: 1-P_1bk7_3_1_27_1.

The Levenshtein distance is 1 because the query (1hbz) is missing one of the residues that is found in the template (1bk7). The alignment for the two matching residues is actually quite good. Notice that the matching residues are not part of the catalase active site in 1hbz.

Page 15: online.universita.zanichelli.itonline.universita.zanichelli.it/.../Voet_bioinformatic_project07.docx · Web viewBioinformatics Exercises. Over the last two decades, information has

Another interesting feature of enzyme active sites that is reinforced by this exercise is that the catalytic residues in enzymes can be quite distant from each other in the primary structure of an enzyme.

You are encouraged to repeat this same exercise with other PDB entries. Here are a two structures that give interesting results: 1btl, 1b6g. After running these, pick a few structures at random from the PDB and see what you find.

As you complete this exercise, make sure that you record the catalytic site residues from your query and template catalytic residues.

1btl: Catalytic site residues: SER-70, ???, ???, ???template: 1pzp: residues:

1b6g:Catalytic site residues: template: 2had: residues: