Exploring protein function and evolution using free online bioinformatics tools

4
Articles Exploring Protein Function and Evolution Using Free Online Bioinformatics Tools Received for publication, April 25, 2005, and in revised form, June 23, 2005 Todd Weaver‡ and Scott Cooper From the Chemistry and Biology Departments, University of Wisconsin, La Crosse, Wisconsin 54601 Keywords: Molecular model, secondary structure, motif, alignment, ortholog, paralog. Bioinformatics provides a set of powerful research tools for predicting the function of a newly discovered protein and has quickly become an important field of training at many universities and medical institutions [1, 2]. Bioinfor- matics can also be used to explore regions of similarity and identity within families of proteins (paralogs) and across species (orthologs). Because the number of protein families with available three-dimensional (3D) 1 structures has increased 3-fold over the past 8 years, an integrative approach may be taken in which paralog and ortholog sequence alignments are superimposed and visualized onto a representative 3D structure [3]. The teaching phi- losophy within undergraduate laboratory courses has also shifted away from cookbook experiments toward guided inquiry. The field of bioinformatics is an excellent choice for a guided-inquiry experience because of the huge da- tabases and wide variety of questions students can ask [4]. The following laboratory exercise highlights the suc- cessful integration of bioinformatics into the undergradu- ate laboratory setting. The procedures involve analysis of an amino acid sequence for motifs, domains, secondary structure, and paralog and ortholog alignments as well as mapping these alignments onto a representative protein structure. The exercise emphasizes laboratory skills deemed critical for success of students within their scien- tific and medical careers [5]. BACKGROUND We teach the protein structure unit of a bioinformatics course first developed at University of Wisconsin-La Crosse 4 years ago. The course is taught in a computer laboratory twice a year to about 40 students; thus we have modified and refined this exercise over eight course offer- ings. The protein structure unit is taught over several days, requiring 6–8 h in class plus some out of class time to complete. One of the major challenges has been finding protein sequences that allow students to answer all of the questions posed. All of the exercises make use of the free web-based programs Biology Workbench (workbench.sd- sc.edu), Protein Explorer (molvis.sdsc.edu/protexpl/frn- tdoor.htm), and Consurf (consurf.tau.ac.il), and the free download MDL® Chime (www.mdl.com/products/frame- work/chime/index.jsp), which is used to view protein struc- tures. We wanted students to develop an understanding of the algorithms used in protein bioinformatics as well as their strengths and limitations. In addition, we wanted students to experience how these tools add to our under- standing of protein function and evolution. Specifically there are four goals of this exercise: 1) to use bioinformat- ics tools to make predictions about a protein based upon the amino acid sequence, 2) to assess the accuracy of these predictions by comparison with the 3D structure of the protein and published literature, 3) to observe the location of conserved amino acids on the 3D structure of the protein, and 4) to compare the degree of sequence similarity/identity between orthologs and paralogs. LABORATORY EXERCISE Predictions from Primary Sequence Data—In the first part of the unit, students are assigned an amino acid sequence for an unknown protein. Each student then uses bioinformatics programs in Biology Workbench to analyze the primary amino acid sequence to make several predic- tions. The program ProSearch is used to predict motifs and contains excellent links to references describing each motif. Regions of secondary structure are predicted using PELE, which uses eight different structural prediction al- gorithms simultaneously and presents all of the results together. This allows students to visualize which areas are most likely to be an helix or sheet and also nicely illustrates that different algorithms will produce different results with the same data. Three programs are used to predict hydrophobic regions and transmembrane do- mains. GREASE performs a Kyte-Doolittle hydrophobicity profile, TMAP performs the same analysis but also identi- fies hydrophobic patches that are long enough to span a lipid bilayer, and TMHMM looks for hydrophobic helixes that are long enough to span a membrane. Thus, by com- paring all three outputs students can identify transmem- brane domains and whether or not they are helical. This is especially useful in identifying signal peptides, which can be used to determine the potential intra- or extracel- lular location of their protein (Fig. 1). ‡ To whom correspondence should be addressed: Dept. of Chemistry and Biology, University of Wisconsin, La Crosse, WI 54601. E-mail: [email protected]. 1 The abbreviations used are: 3D, three-dimensional; 2D, two- dimensional. © 2005 by The International Union of Biochemistry and Molecular Biology BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION Printed in U.S.A. Vol. 33, No. 5, pp. 319–322, 2005 This paper is available on line at http://www.bambed.org 319

Transcript of Exploring protein function and evolution using free online bioinformatics tools

Page 1: Exploring protein function and evolution using free online bioinformatics tools

Articles

Exploring Protein Function and Evolution Using FreeOnline Bioinformatics Tools

Received for publication, April 25, 2005, and in revised form, June 23, 2005

Todd Weaver‡ and Scott Cooper

From the Chemistry and Biology Departments, University of Wisconsin, La Crosse, Wisconsin 54601

Keywords: Molecular model, secondary structure, motif, alignment, ortholog, paralog.

Bioinformatics provides a set of powerful research toolsfor predicting the function of a newly discovered proteinand has quickly become an important field of training atmany universities and medical institutions [1, 2]. Bioinfor-matics can also be used to explore regions of similarityand identity within families of proteins (paralogs) andacross species (orthologs). Because the number of proteinfamilies with available three-dimensional (3D)1 structureshas increased 3-fold over the past 8 years, an integrativeapproach may be taken in which paralog and orthologsequence alignments are superimposed and visualizedonto a representative 3D structure [3]. The teaching phi-losophy within undergraduate laboratory courses has alsoshifted away from cookbook experiments toward guidedinquiry. The field of bioinformatics is an excellent choicefor a guided-inquiry experience because of the huge da-tabases and wide variety of questions students can ask [4].

The following laboratory exercise highlights the suc-cessful integration of bioinformatics into the undergradu-ate laboratory setting. The procedures involve analysis ofan amino acid sequence for motifs, domains, secondarystructure, and paralog and ortholog alignments as well asmapping these alignments onto a representative proteinstructure. The exercise emphasizes laboratory skillsdeemed critical for success of students within their scien-tific and medical careers [5].

BACKGROUND

We teach the protein structure unit of a bioinformaticscourse first developed at University of Wisconsin-LaCrosse 4 years ago. The course is taught in a computerlaboratory twice a year to about 40 students; thus we havemodified and refined this exercise over eight course offer-ings. The protein structure unit is taught over several days,requiring 6–8 h in class plus some out of class time tocomplete. One of the major challenges has been findingprotein sequences that allow students to answer all of thequestions posed. All of the exercises make use of the freeweb-based programs Biology Workbench (workbench.sd-

sc.edu), Protein Explorer (molvis.sdsc.edu/protexpl/frn-tdoor.htm), and Consurf (consurf.tau.ac.il), and the freedownload MDL® Chime (www.mdl.com/products/frame-work/chime/index.jsp), which is used to view protein struc-tures. We wanted students to develop an understanding ofthe algorithms used in protein bioinformatics as well astheir strengths and limitations. In addition, we wantedstudents to experience how these tools add to our under-standing of protein function and evolution. Specificallythere are four goals of this exercise: 1) to use bioinformat-ics tools to make predictions about a protein based uponthe amino acid sequence, 2) to assess the accuracy ofthese predictions by comparison with the 3D structure ofthe protein and published literature, 3) to observe thelocation of conserved amino acids on the 3D structure ofthe protein, and 4) to compare the degree of sequencesimilarity/identity between orthologs and paralogs.

LABORATORY EXERCISE

Predictions from Primary Sequence Data—In the firstpart of the unit, students are assigned an amino acidsequence for an unknown protein. Each student then usesbioinformatics programs in Biology Workbench to analyzethe primary amino acid sequence to make several predic-tions. The program ProSearch is used to predict motifsand contains excellent links to references describing eachmotif. Regions of secondary structure are predicted usingPELE, which uses eight different structural prediction al-gorithms simultaneously and presents all of the resultstogether. This allows students to visualize which areas aremost likely to be an � helix or � sheet and also nicelyillustrates that different algorithms will produce differentresults with the same data. Three programs are used topredict hydrophobic regions and transmembrane do-mains. GREASE performs a Kyte-Doolittle hydrophobicityprofile, TMAP performs the same analysis but also identi-fies hydrophobic patches that are long enough to span alipid bilayer, and TMHMM looks for hydrophobic � helixesthat are long enough to span a membrane. Thus, by com-paring all three outputs students can identify transmem-brane domains and whether or not they are � helical. Thisis especially useful in identifying signal peptides, whichcan be used to determine the potential intra- or extracel-lular location of their protein (Fig. 1).

‡ To whom correspondence should be addressed: Dept. ofChemistry and Biology, University of Wisconsin, La Crosse, WI54601. E-mail: [email protected].

1 The abbreviations used are: 3D, three-dimensional; 2D, two-dimensional.

© 2005 by The International Union of Biochemistry and Molecular Biology BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATIONPrinted in U.S.A. Vol. 33, No. 5, pp. 319–322, 2005

This paper is available on line at http://www.bambed.org 319

Page 2: Exploring protein function and evolution using free online bioinformatics tools

After making their predictions, the students then findtheir sequences using the BLASTP algorithm against thePDBFINDER (protein sequence database for 3D structuressolved by either x-ray or nuclear magnetic resonance tech-niques) and SWISSPROT-HUMAN (sequence from theDNA) databases in Biology Workbench. Although they al-ready have their assigned amino acid sequence, findingthe same sequence in these databases will provide themwith the accession number for a Protein Data Bank filecontaining the 3D coordinates of their protein, the cDNAsequence with any signal peptides intact, and accompa-nying references about their protein. By comparing theirpredictions with the actual data, the students can analyze

the accuracy of the programs. Specifically they are askedto compare how many � helices and � sheets were pre-dicted with how many were in the 3D structure using theprogram Protein Explorer (Fig. 2). By comparing predic-tions about motifs and cellular localization with publisheddata, the accuracy of the other bioinformatics programscan also be evaluated.

This part of the exercise is valuable for several reasons.Students learn the types of predictions that can be madeabout a protein using bioinformatics tools. By comparingdifferent algorithms, they see that different tools have dif-ferent limitations. By comparing their predictions to actualpublished results and 3D structures, they can also assess

FIG. 1. TMAP analysis of cathepsin K showing a Kyte-Doolittle hydropathy profile. The bar in the upper left corner indicates atransmembrane signal peptide, indicating that this protein could be secreted. Source: A typical student laboratory report.

FIG. 2. Analysis of the accuracy of the eight algorithms that predict � helixes and � sheets contained in PELE. Predictedsecondary structures can be compared with the 3D protein structure of cathepsin K viewed with Protein Explorer. Source: A typicalstudent laboratory report.

320 BAMBED, Vol. 33, No. 5, pp. 319–322, 2005

Page 3: Exploring protein function and evolution using free online bioinformatics tools

the accuracy of these algorithms. Finally, the exerciseallows students to apply their results to a biological ques-tion, i.e. what do these results say about the function andlocation of an enzyme in a cell, and is this consistent withthe activity of this enzyme.

Orthologs and Paralogs—In the second part of the unit,students use BLASTP to find six to seven sequences of the

same enzyme in different species (orthologs), going backas far as they can phylogenetically. They also use BLASTPto find six to seven other members of the same family ofproteins in humans (paralogs) using the SWISSPROT HU-MAN database in Biology Workbench. The program Clust-alW is then used to create alignments of the orthologs andparalogs (Fig. 3). This allows the students to determine

FIG. 3. Increased identity is observed in cathepsin K orthologs relative to cysteine protease human paralogs. Superimpositionof a 2D ClustalW alignment of cysteine protease paralogs onto the 3D structure of cathepsin K using ConSurf reveals that the mostconserved amino acids cluster near the active site of each enzyme. Source: A typical student laboratory report.

TABLE IProteins that work well for this exercise

Protein Protein family SwissProtsequence

Protein Data Bankstructure

Alcohol dehydrogenase Dehydrogenase ADHB 3HUDa

Calcium transporter/ATPase ATP hydrolase ATA1 1WPGa

Acyl coenzyme A: cholesterol acyltransferase Serine esterase EST1 1MX9a

Cathepsin Cysteine protease CATK 1BY8a

Chymase Serine protease MCPT1 1KLTCollagenase Metalloprotease MMP1 1SU3Cytochrome P450 2C9 Oxidoreductase CP2C9 1R9ODNA polymerse Nucleotidyltransferase DPO1B 1TVAa

Glutathione reductase Oxidoreductase GSHR 4GR1a

Glycogen phosphorylase Phosphotransferase PHS1 1FA9a

Hexokinase Phosphotransferase HXK1 1QHALactate dehydrogenase Oxidoreductase LDHA 1I10Lipase Carboxylic esterase LIPP 1LPAMitogen-activated protein kinase Kinase MK12 1CM8Peroxidase Oxidoreductase PERM 1MHLa

Phosphotyrosyl phosphatase Hydrolase PTN1 1XBOa

Cell division protein kinase Protein kinase CDK2 1HCKa

a Several structures are available.

321

Page 4: Exploring protein function and evolution using free online bioinformatics tools

which sequences are most conserved. The students arethen asked to explain if any differences in the degree ofsimilarity and identity make sense given the functions oforthologs and paralogs. This is an important concept forstudents to grasp, and as part of the prelaboratory exer-cise many incorrectly predict that the human serine pro-teases trypsin and chymotrypsin (paralogs) will be moreconserved than mouse and human trypsin (orthologs). Infact, orthologs have similar functions and thus more similarstructures. Having students use their own data and cometo the realization that similar function dictates similar struc-ture is a valuable learning experience.

The students also observe that the highly conservedregions are scattered throughout the alignment. We usethe program ConSurf to allow students to superimposetheir two-dimensional (2D) alignment onto the 3D structureof their protein (Fig. 3). When this is done, it is usuallyobvious that the highly conserved regions from the 2Dalignment all fold around the active site in the 3D align-ment. Again, students do this for both their ortholog andparalog alignments. This allows them to see that the mostconserved residues are in the active site for both align-ments. Thus, to be a cysteine protease (Fig. 3), someamino acids can be changed, whereas others cannot.

Student Conclusions—Finally, students craft a conclu-sion summarizing their findings. We ask them to specifi-cally comment on the following:

• How accurate were the bioinformatics programs youused in predicting motifs, secondary structure, etc.?

• Where is the most sequence similarity seen on the 3Dstructural alignments?

• Were orthologs or paralogs more highly conserved?• Is this consistent with the relative functions of or-

thologs and paralogs?

CONCLUSION

This exercise combines several key concepts in bio-chemistry and protein structure into an inquiry-basedproblem-solving format. By having students analyze anassigned protein sequence using several programs, theybecome familiar with the strengths and weaknesses of thealgorithms. Comparing this information to the publisheddata and 3D structures requires students to form a de-tailed explanation of where the protein is found in a cell,the location of its active site and other motifs, and theoverall function of the protein. By next aligning the proteinwith a collection of orthologs and paralogs, the students

are forced to wrestle with the integration between proteinstructure and function and they soon learn that having acommon function often requires a more conserved struc-ture. The fact that conserved residues cluster around theactive site in a 3D model but may be far apart in a 2Dalignment is also a valuable observation about the 3Dstructure of proteins.

Over time, we discovered that certain proteins work wellin this exercise and others don’t. First, the protein musthave a published high quality structure. Second, enzymeswork better than other proteins like globins, receptors, orstructural proteins because the motif prediction programsfind enzyme-active site motifs better than other motifs.Finally, there need to be several paralogs in humans. Wewere fairly naı̈ve the first time we tried the exercise andpicked several enzymes in the Kreb’s cycle only to find thatthey were fairly unique enzymes and did not belong to aneasily studied family of enzymes. Table I lists the proteinswe currently use, the families they belong to, and exam-ples of sequence and structural accession numbers. Ourexercise can also be found online (bioweb.uwlax.edu/GenWeb/Molecular/Bioinformatics/Unit_4/Lab_4 –1/lab_4–1.htm). We assign different proteins to each stu-dent, but this exercise could easily be adapted to a singleprotein that students are studying in your laboratory.

Overall, we have found the protein structure and func-tion unit very effective at using bioinformatics to challengestudents to really explore the links between protein struc-ture and function. Our students were able to handle theweb-based software, which includes downloading, sub-mitting, and analyzing numerous types of data. Takingadvantage of the expanding protein sequence and struc-ture databases has allowed development of this exerciseinto a popular module integrating protein structure andfunction.

REFERENCES

[1] L. J. Pike, J. E Sadler (2004) Proteomics, genomics and the future ofmedical education, Mol. Med. 101, 496–499.

[2] T. Ideker (2004) Systems biology 101: What you need to know, Nat.Biotechnol. 22, 473–475.

[3] A. G. Murzin, J.-M. Chandonia, A. Andreeva, D. Howorth, L. L. Conte,B. G. Ailey, S. E. Brenner, T. J. P. Hubbard, C. Chothia (2005) SCOP:Structural classification of proteins, available on-line at scop.mrc-lmb.cam.ac.uk/scop/.

[4] E. Bell (2001) The future of education in the molecular life sciences,Nat. Rev. Cell. Mol. Bio. 2, 221–225.

[5] R. Boyer (2003) Concepts and skills in the biochemistry/molecularbiology lab, Biochem. Mol. Biol. Educ. 31, 102–105.

322 BAMBED, Vol. 33, No. 5, pp. 319–322, 2005