REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative...

16
THE JOURNAL 0 1989 by The American Society for Biochemistry and of BIOLOGICAL CHEMISTRY Molecular Biology, Inc. Vol. 264, No. 9, Issue of March 25, pp. 5283-5298,1989 Printed in U. S. A. The REF52 Protein Database METHODS OF DATABASE CONSTRUCTION AND ANALYSIS USING THE QUEST SYSTEM AND CHARACTERIZATIONS OF PROTEIN PATTERNS FROM PROLIFERATING AND QUIESCENT REF52 CELLS* (Received for publication, October 12, 1988) James I. GarrelsS and B. Robert Franza, Jr. From the Cold Spring Harbor Luboratory, Cold Spring Harbor, New York 11 724 The construction and analysis of protein databases using the QUEST system is described, and the REF52 protein database is presented. A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ computer-analyzed two-di- mensional gel electrophoresis. The QUEST system pro- vides the tools to manage, analyze, and communicate these data. The REF52 database contains experiments with nor- mal and transformed rat cell lines. In this report, many of the proteins on the REF52 map are identified by name, by subcellular localization, and by mode of post- translational modification. The quantitative experi- ments analyzed andcompared here include 1) a study of the quantitative reproducibility of the analysis sys- tem, 2) a study of the clonal reproducibility of REF52 cells, 3) a study of growth-related changes in REF52 cells, and 4) a study of the effects of labeling cells for varying lengths of time. Of the proteins analyzed from REF52 cells, 10% are nuclear, 6% are phosphopro- teins, and 4% are mannose-labeled glycoproteins. The mannose-labeled proteins are more prominent in pat- terns from quiescent cells, while the synthesis of cyto- skeletal proteins is generally repressed at quiescence. A small set of proteins, selected for elevated rates of synthesis is generally repressed at quiescence. A small set of proteins, selected for elevatedrates of synthesis in quiescent versus proliferating cells includes one of the tropomyosin isoforms, a myosin light chain iso- form, and several prominent glycoproteins. These pro- teins are thought to be characteristic of the differen- tiated state of untransformed REF52 cells. Proteins induced early versus late after refeeding quiescent cells show very different patterns of growth regula- tion. These studies lay the foundations of the REF52 database and provide information needed to interpret the experiments with transformed REF52 cells, which are reported in the accompanying paper (Garrels, J., andFranza, B. R., Jr. (1989) J. Biol. Chem. 264, 5299-5312). *This work was supported by Grant P41 RR02188 from the National Institutes of Health Division of Research Resources (J. I. G.), by grants from the Cancer Research Institute (to J. I. G. and B. R. F.), by Grant 321100 from the New York State Science and Technology Foundation (to J. I. G.), hy National Institutes of Health Grant CA 13106 (to J. I. G.), and by National Institutes of Health Grant CA40512 (to B. R. F.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked ‘‘uduertkement” in accordance with 18 U.S.C. Section 1734 solelyto indicate this fact. $ To whom correspondence should he addressed. Two-dimensional gel electrophoresis can be used to resolve and quantify as many as 2000 proteins from unfractionated cell or tissue lysates. Each protein pattern contains informa- tion that potentially can be used to identify the differentiated state, the proliferative status, and the possible pathological states of the cells analyzed. Perhaps even more informative than the protein patterns themselves are the changes in the patterns that occur as cells respond to environmental stimuli or experimental perturbations. Each cell type has a different repertoire of responses to external factors, and it is the changes in these response patterns that are the best indicators of pathological states such as transformation. Research into many aspects of cellular regulation could benefit from the power of two-dimensional electrophoresis to reveal the pat- terns of genetic expression at the protein level. Although two-dimensional gel electrophoresis has been widely used, it is still not possible to fully interpret complex two-dimensional gel patterns. Improvements to the methods, equipment, and quality control procedures (1-4) have led to more reproducible patterns, and the development of special- ized computer software (1, 5-11) has facilitated the quanti- tative analysis of two-dimensional gels. Even quantitative analysis, however, adds little to the interpretation of two- dimensional gel patterns without efforts to identify many more of the spots and to accumulate knowledge about the proteins they represent. Many of the spots on two-dimensional gels can be identified with known proteins, and many more spots can be identified as proteins from particular subcellular fractions or as proteins with particular post-translational modifications. Additional knowledge about each of the proteins detected on two-dimen- sional gels comes from quantitative experiments designed to reveal the altered patterns of synthesis, modification, or turn- over that result from particular experimental manipulations. The information gained from all such studies can be accu- mulated, analyzed, and shared between laboratories in the form of protein databases. Several laboratories have begun the database approach to two-dimensional gel analysis. Neidhardt has placed the names of nearly 200 Escherichia coli proteins on a reference gel map, and he has determined the genetic map positions of genes coding for approximately 1000 of the proteins detected on two-dimensional gels (12, 13). McLaughlin and co-workers (14) and Boucherie and co-workers (15) have undertaken similar studies to identify yeast proteins, and Anderson and co-workers (16) have identified many of the proteins of human serum. Other laboratories have quantified many of the pro- teins on two-dimensional maps, either by manual means (17- 20) or by computer methods (21-26), and several laboratories have described systems being developed for the construction of fully computerized protein databases (1, 5-11). 5283

Transcript of REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative...

Page 1: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

THE JOURNAL 0 1989 by The American Society for Biochemistry and

of BIOLOGICAL CHEMISTRY Molecular Biology, Inc.

Vol. 264, No. 9, Issue of March 25, pp. 5283-5298,1989 Printed in U. S. A.

The REF52 Protein Database METHODS OF DATABASE CONSTRUCTION AND ANALYSIS USING THE QUEST SYSTEM AND CHARACTERIZATIONS OF PROTEIN PATTERNS FROM PROLIFERATING AND QUIESCENT REF52 CELLS*

(Received for publication, October 12, 1988)

James I. GarrelsS and B. Robert Franza, Jr. From the Cold Spring Harbor Luboratory, Cold Spring Harbor, New York 11 724

The construction and analysis of protein databases using the QUEST system is described, and the REF52 protein database is presented. A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ computer-analyzed two-di- mensional gel electrophoresis. The QUEST system pro- vides the tools to manage, analyze, and communicate these data.

The REF52 database contains experiments with nor- mal and transformed rat cell lines. In this report, many of the proteins on the REF52 map are identified by name, by subcellular localization, and by mode of post- translational modification. The quantitative experi- ments analyzed and compared here include 1) a study of the quantitative reproducibility of the analysis sys- tem, 2) a study of the clonal reproducibility of REF52 cells, 3) a study of growth-related changes in REF52 cells, and 4) a study of the effects of labeling cells for varying lengths of time. Of the proteins analyzed from REF52 cells, 10% are nuclear, 6% are phosphopro- teins, and 4% are mannose-labeled glycoproteins. The mannose-labeled proteins are more prominent in pat- terns from quiescent cells, while the synthesis of cyto- skeletal proteins is generally repressed at quiescence. A small set of proteins, selected for elevated rates of synthesis is generally repressed at quiescence. A small set of proteins, selected for elevated rates of synthesis in quiescent versus proliferating cells includes one of the tropomyosin isoforms, a myosin light chain iso- form, and several prominent glycoproteins. These pro- teins are thought to be characteristic of the differen- tiated state of untransformed REF52 cells. Proteins induced early versus late after refeeding quiescent cells show very different patterns of growth regula- tion. These studies lay the foundations of the REF52 database and provide information needed to interpret the experiments with transformed REF52 cells, which are reported in the accompanying paper (Garrels, J., and Franza, B. R., Jr. (1989) J. Biol. Chem. 264, 5299-5312).

*This work was supported by Grant P41 RR02188 from the National Institutes of Health Division of Research Resources (J. I. G.), by grants from the Cancer Research Institute (to J. I. G. and B. R. F.), by Grant 321100 from the New York State Science and Technology Foundation (to J. I. G.), hy National Institutes of Health Grant CA 13106 (to J. I. G.), and by National Institutes of Health Grant CA40512 (to B. R. F.). The costs of publication of this article were defrayed in part by the payment of page charges. This article must therefore be hereby marked ‘‘uduertkement” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

$ To whom correspondence should he addressed.

Two-dimensional gel electrophoresis can be used to resolve and quantify as many as 2000 proteins from unfractionated cell or tissue lysates. Each protein pattern contains informa- tion that potentially can be used to identify the differentiated state, the proliferative status, and the possible pathological states of the cells analyzed. Perhaps even more informative than the protein patterns themselves are the changes in the patterns that occur as cells respond to environmental stimuli or experimental perturbations. Each cell type has a different repertoire of responses to external factors, and it is the changes in these response patterns that are the best indicators of pathological states such as transformation. Research into many aspects of cellular regulation could benefit from the power of two-dimensional electrophoresis to reveal the pat- terns of genetic expression at the protein level.

Although two-dimensional gel electrophoresis has been widely used, it is still not possible to fully interpret complex two-dimensional gel patterns. Improvements to the methods, equipment, and quality control procedures (1-4) have led to more reproducible patterns, and the development of special- ized computer software (1, 5-11) has facilitated the quanti- tative analysis of two-dimensional gels. Even quantitative analysis, however, adds little to the interpretation of two- dimensional gel patterns without efforts to identify many more of the spots and to accumulate knowledge about the proteins they represent.

Many of the spots on two-dimensional gels can be identified with known proteins, and many more spots can be identified as proteins from particular subcellular fractions or as proteins with particular post-translational modifications. Additional knowledge about each of the proteins detected on two-dimen- sional gels comes from quantitative experiments designed to reveal the altered patterns of synthesis, modification, or turn- over that result from particular experimental manipulations. The information gained from all such studies can be accu- mulated, analyzed, and shared between laboratories in the form of protein databases.

Several laboratories have begun the database approach to two-dimensional gel analysis. Neidhardt has placed the names of nearly 200 Escherichia coli proteins on a reference gel map, and he has determined the genetic map positions of genes coding for approximately 1000 of the proteins detected on two-dimensional gels (12, 13). McLaughlin and co-workers (14) and Boucherie and co-workers (15) have undertaken similar studies to identify yeast proteins, and Anderson and co-workers (16) have identified many of the proteins of human serum. Other laboratories have quantified many of the pro- teins on two-dimensional maps, either by manual means (17- 20) or by computer methods (21-26), and several laboratories have described systems being developed for the construction of fully computerized protein databases (1, 5-11).

5283

Page 2: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5284 The REF52 Protein Database

In this paper we present the QUEST system for building and analyzing protein databases, and we introduce a database for studies of growth regulation and transformation in the rat cell line REF52. The structure of the database is described to show how the QUEST system can manage descriptive and quantitative data derived from many different experiments. T h e REF52 database was built by annotating many of the spots on the protein map, by classification of spots on the map into sets based on common biological properties, and by the quantitative analysis of experiments with normal and transformed cell lines. In this report, the software tools for database analysis are presented, and these tools are used t o analyze experiments on growth regulation and protein stabil- i ty in REF52 cells. In the following report (27), the studies of transformed REF52 cells are reported, and the responses to viral transformation are compared to the normal proliferative responses of REF52 cells.

EXPERIMENTAL PROCEDURES

Materials, Cell Culture, Sample Preparations, and Electrophoresis

Materials-Cell media was from Gibco (430-1200) and serum was Rehatuin F.S. from Armour Pharmaceutical (0268-00). Isotopes were [?3]methionine from Amersham Corp. (SJ104), [3H]mannose from Amersham Corp (TRK-364), and [33P]phosphate from Du Pont-New England Nuclear (NEZ-080). Material for sample preparation include ultrapure urea from Schwartz-Mann (821527), SDS' from Bio-Rad (161-0302), DNase I (6330), and RNase A (5679), from Worthington (Cooper). Reagents for gel electrophoresis include ampholytes (pH 3.5-10) from LKB (1809-101); SDS (161-0302), bisacrylamide (161- 0201), and glycine (161-0718), from Bio-Rad; acrylamide for IEF gel was from Bio-Rad (161-0102); acrylamide for slab gels was from Serva (810675); Tris base (T1503) and Tris-HC1 (T3253) from Sigma; Nonidet P-40 from Particle Data Laboratories; and dithiothreitol from Calbiochem (233155). 2,5-Diphenyloxazole for fluorographic processing was from Du Pont-New England Nuclear (NEF-901), and dimethyl sulfoxide was from Kodak (811-6840). Film was Kodak XAR, XS, or XRP (10 X 12-inch size), and films were processed using a Kodak M7B processor.

Cell Culture-REF52 cells were from stocks maintained at the Cold Spring Harbor Laboratory. Cells were grown in Dulbecco's modified Eagle's medium supplemented with 10% fetal calf serum. Cells were maintained by replating every 4-6 days.

Radiolabeling of Cells-In the standard radiolabeling procedure, the culture medium is not changed before addition of radioisotopes, but is reduced in volume to the minimum needed to cover the cells. The cell sample used for replicate gels (Matchset 1) was from REF52- PSV7-8 cells, a derivative of REF52 cells isolated by B. R. Franza after microinjections of the PSV7 plasmid (from T. Shenk), which contains the SV40 early region genes. These cells were labeled for 24 h in medium containing glucose at 10% of the normal level and 10% fetal calf serum. Confluent cells were labeled with 500 pCi of [35S] methionine in 100-mm culture dishes.

Clones of REF52 cells (Matchset 2) were isolated from passage 14 cells by picking colonies derived from single cells. The clones were radiolabeled with [35S]methionine at passage 4-5 in 60-mm dishes at 60% confluence. Cells for the REF52 growth-regulation experiments (Matchset 3) were plated at 10% of saturation density in 24-well Linbro plates and labeled at days 2, 3, 4, 6, and 12 after plating without a change of the culture medium. Cells were labeled for 2 h in 100 pl of medium using 125 pCi of [35S]methionine. Some wells were refed with fresh medium and serum at 15 h prior to labeling on day 6, at 5 h prior to labeling on day 12, and at 15 h prior to labeling on day 12. In the time-of-labeling experiment (Matchset 41, cells at 60% confluence in 24-well Linbro plates were labeled without a change of media by adding 250 pCi/well of [35S]methionine to wells containing 0.15 ml of medium.

In the mannose-labeling experiments, cells at 70-80% confluence in 24-well Linbro plates were refed 3 h prior to labeling with medium

The abbreviations used are: SDS, sodium dodecyl sulfate; HEPES, 4-(2-hydroxyethyl)-l-piperazineethanesulfonic acid; SSP, standard spot; IEF, isoelectric focusing; PCNA, proliferating cell nuclear antigen.

containing 10% normal glucose and supplemented with pyruvate, glutamine, and 5% dialyzed serum. Tunicamycin (1 pg/ml) was added to some wells. [3H]Mannose was dehydrated and dissolved in labeling medium before addition to cells at a concentration of 150 pCi/well in a volume of 100 pl. Some wells were radiolabeled with [35S]methionine at a concentration of 250 yCi in 100 pl. All labelings were carried out for 3 h. For tunicamycin-pretreated cells, tunicamycin was present at 1 pg/ml throughout the labeling. Because the medium was changed prior to labeling, serum-stimulation responses were expected.

The phosphate-labeled patterns shown in the text are from a series of experiments in which REF52 cells were grown in defined medium, although most of the phosphoproteins have also been detected in cells grown in medium containing 10% fetal calf serum. Cells at 70% confluence were labeled without reduction of the phosphate concen- tration in the medium by adding 2 mCi of [33P]phosphate to 0.6 ml of medium in 35-mm dishes. Cells were harvested after a labeling period of 24 h. The labeling had no apparent effect on the frequency of mitotic cells in the culture.

Nuclear Isolation-REF52 and WT6 cells (SV40-transformed REF52 cells) at 70% confluence were labeled for 2 h with 500 pCi of [3SS]methionine after refeeding with medium lacking methionine but still containing 10% fetal calf serum. After labeling, the cells were rinsed with cold phosphate-buffered saline, and scraped from the dish into a buffer containing 0.01% Nonidet P-40,50 mM Tris-HC1, 0.01% sodium deoxycholate, 1 mM phenylmethylsulfonyl fluoride, 0.6 mM MgC12, 2 mM HEPES, pH 8.0. Crude nuclei were pelleted and resus- pended in 0.25 M sucrose containing 1 mM MgC12, 3 mM HEPES, 7 mM NaC1, and 7 mM Tris-HC1, pH 8.0. The suspension was passed through a 26-gauge syringe needle, layered onto 1.1 M sucrose con- taining 1.5 mM MgC12, 3.5 mM HEPES, 5 mM NaCl, 5 mM Tris-HC1, pH 8.0, and centrifuged at 12,000 X g for 15 min. The pellet was resuspended in 0.3% SDS, 50 mM Tris-HC1, pH 8.0 at 100 "C, and prepared for electrophoresis by the same procedure as used for total cell lysates.

Sample Preparation-Samples of total cellular proteins were pre- pared from cultured cells that had been rinsed thoroughly with cold phosphate-buffered saline. Cells were scraped from the dish into a solution of hot (100 "C) 0.3% SDS, 1% P-mercaptoethanol, 50 mM Tris-HC1, pH 8.0. Samples were transferred to 1.5 ml centrifuge tubes, cooled on ice, treated with 0.1 volume of nuclease solution (1 mg/ml DNase I, 500 pg/ml RNase A, 50 mM MgC12 in 50 mM Tris- HC1, pH 7.0) for 1 min, frozen in liquid nitrogen, and lyophilized. After lyophilization, samples were redissolved to the original volume with sample buffer (9.95 M urea, 4.0% Nonidet P-40, 2% LKB pH 6- 8 ampholytes, and 100 mM dithiothreitol). Samples were stored at -70 'C. Further dilutions, if necessary, were done just before electro- phoresis using sample buffer containing 0.3% SDS.

Electrophoresis Procedures-Two-dimensional gels were run in the QUEST two-dimensional gel laboratory at the Cold Spring Harbor Laboratory according to previously published procedures (1, 4). All gels analyzed in this report were run using pH 3.5-10 LKB ampho- lytes in the first dimension and 10% acrylamide in the second dimen- sion. Many of the samples have also been run on other types of two- dimensional gels, typically using pH 5-7 and pH 6-8 ampholytes in the first dimension and 7.5% acrylamide in the second dimension, and using pH 3.5-10 ampholytes in the first dimension followed by 12.5% acrylamide in the second dimension. Typically 10 pl of sample containing 200,000-500,000 dpm of trichloroacetic acid-precipitable radioactivity and 5-20 pg of protein were applied to each gel.

Database Construction and Analysis Scanning, Quuntitation, and Pattern Matching-The equipment

and methods used by the QUEST system for two-dimensional gel scanning, spot quantitation, pattern matching, and entry of data into the QUEST protein databases are presented in the accompanying report (28).

Quantitatzue Data Stored in the QUEST Protein Databases-As described previously (28), groups of 6-10 gels called matchsets are typically processed together through the image analysis and matching programs and are entered as a unit into a QUEST protein database. A QUEST protein database is a permanent on-line collection of data reduced from the analysis of many matchsets. During the analysis of each matchset, the detected spots are quantified in terms of parts per million (ppm) of the total protein radioactivity ,loaded onto the respective gel. The data entered into the database for each gel of a matchset are 1) the normalized spot ppm values, 2) the quality and maximum film density for each spot, 3) the I- and y-coordinates (in

Page 3: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5285

the local coordinate system of the gel) for each spot, and 4) the spot matching information for each pair of gels in the matchset. For a matchset of 10 gels, only 25 bytes of quantitative and matching information are stored per spot. From a typical gel with 4000 entries in its gelspots file, including entries for “combined” and “undetected” spots (28), about 100,000 bytes of database storage are needed.

Data from a given matchset is stored in the QUEST databases as four data vectors (ppm, quality, x-coordinate, and y-coordinate) for each gel, and as one data vector (the list of matched spot numbers) for the match between each pair of gels in the matchset. Each data vector is a record with a header for identification, a variable length data region, and a pad of unused bytes for expansion in case the matchset is re-edited and re-entered into the database. These records are arranged serially in a large file; the addresses and lengths of each data record are recorded in the MATCHSETS records maintained by the QST data management system (5, 28).

The Network Structure of the Database and Retrieval of Quuntita- tiue Data-The spot matching information stored in the database allows the system to trace any given spot to the corresponding spot in any other gel of the same matchset. The match data vectors contain only local spot numbers and provide no information for matches to spots of gels outside the matchset. To trace the match of any given spot to its corresponding spot in a gel from another matchset, linker gels are used to form a bridge between matchsets.

Linker gels are gels that are members of more than one matchset. A linker gel is created by copying the gel image and gelspots files from one matchset to another. In the course of matching, refitting, and editing the spots of each matchset, some spots are normally added, deleted, combined, or shifted slightly in size or position. Therefore, the ppm, quality, and spot coordinate data stored in the database for the copies of a linker gel that appear in different matchsets will not be identical. However, no explicit matching of the spots between separate copies of linker gels is necessary. The system can automatically, whenever needed, take a spot from one copy of a linker gel and find the best match in another copy of the linker gel by searching for spots with matching coordinate positions. Therefore, given the gel-to-gel matches that are recorded within each matchset, and the matches that can be generated on demand between linker gels, the system is able to trace a spot from any given gel to the corresponding spot of any other gel within the network of linked matchsets.

The gel-to-gel matches within each matchset, and the links between the matchsets that are made possible by the linker gels, generate the network structure of the database. There is no particular reference or standard gel; any gel of any matchset can be a starting point for entry into the database. The starting gel is usually a member of a matchset that is currently loaded into the workstation so that its gel image can be displayed. After selecting any spot or set of spots on the starting gel, the system can be asked to display quantitative information for the corresponding spot or spots on any other gel of any matchset in the database. To track a spot from one gel to another, the system finds paths through the network, and if multiple paths exist, each path is followed. Sometimes different paths lead to differ- ent destination spots, indicating inconsistency in the database. The system can choose the destination spot reached by the majority of paths, and it can flag inconsistencies.

Relative to a strictly hierarchical structure, in which gels are matched only to standard gels which in turn are matched to higher standards, the network structure results in shorter average paths between gels (fewer matches to connect any two gels in the database). The network can be built using groups of related gels that match easily, and each new matchset can be linked by the inclusion of any gel already in the database. As a result, some gels that would be difficult to match directly can be accurately compared because they are matched through intermediates. For example, patterns of nuclear proteins might be difficult to match to patterns of phosphoproteins, yet because each is matched to total cell patterns they can be accurately compared.

Storage and Retrieval of Spot Annotations-Users at the worksta- tion can enter textual annotations for any spot of any displayed gel, and those annotations can be retrieved by users viewing any other gel in the database. Annotations can be made in various categories, which are selected from a menu. When a spot is selected for annota- tion, the system presents a form with the fields for spot number, gel number, matchset number, and annotation category already com- pleted. The user enters a short note, such as a protein name, and a longer body of explanatory text, such as the method used to identify

the protein. The annotation forms are managed by the QST data management system (5, 28).

Annotations are retrieved with reference to the gels of any matchset loaded into a workstation. When the matchset is loaded, the database system traces the paths from all gels named on annotation records to one gel of the current matchset, retaining a list of all spots that have annotations, their respective annotation categories, and their anno- tation record numbers. All spots with annotations in the selected category can be highlighted, and the short annotation note for each spot can be displayed as the user places a cursor over the spot. Full annotation records can also be reviewed for each spot. Because annotation records are traced through the network structure of the database, annotations entered while viewing one matchset can be retrieved while viewing another.

Storage and Retrieval of Spot Sets-Spot sets, created by the methods described in the text, can be stored as a permanent part of the database. Each spot set is defined relative to the spots of a chosen gel, and the set is stored as a string of bits, one for each spot of the chosen gel. Users who create and store spot sets enter names and descriptions of each set in the SPOTSETS form, which is one of the forms managed by the QST data management system (5,28). The set number is taken from the record number and the line number on the SPOTSETS form. To show the members of a stored spot set on the image of any other gel in the database, the database management software finds the proper paths through the database and translates the spot numbers of the gel on which the set was defined to the spot numbers of the gel on which the set is to be displayed.

Standard Spot (SSP) Numbering System-The standard spot num- bering system and the REF52 protein map, as described in the text, are based on the pattern of spots in a particular gel (Gel 8489). The standard spot numbers are not an inherent part of the database structure, instead they represent additional information attached to each spot, much like the annotations and spot sets. The system can report the standard spot number for any spot, regardless of the gel on which it is viewed, by tracing the spot through the database to the gel on which standard spot numbers are defined.

The standard spot numbers are determined by dividing the stand- ard gel (Gel 8489) into 15 horizontal and 15 vertical zones. The width of each zone is determined automatically so that each zone contains approximately the same number of spots. A numbering system based on two letters and two (or more) digits gives rise to spot numbers that are short, easy to remember, and which refer to defined regions on the map. For databases being generated for other species, we plan to use the same coordinate grid that has been defined for REF52 cells. The positions of the REF52 zones can be defined on patterns from other species by preparing gels that represent mixtures of proteins from the two species. This strategy will insure that all spots with the same letter prefix have approximately the same isoelectric point and molecular weight.

Isoelectric Point and Molecular Weight Coordinates-The QUEST database system calculates apparent isoelectric point (PI) and appar- ent molecular weight (Mr) for each spot on the protein map. The calculations are based on spots with known PI and M, values. These known values are entered as annotation records, and a program is run periodically to assign PI and M, values to all spots of a standard gel based on the known values. To make the assignments, the PI versus horizontal spot position and the logarithm of M, versus vertical spot position are each fitted as multiple straight line segments, which are then locally averaged to produce a smooth scale. As for the SSP numbers, the system can report consistent PI and M, values for each spot regardless of the gel on which is it viewed, by tracing the spot through the database to the standard gel on which the pl and M, values were calculated.

Measurements of Pattern Relatedness-The standard deviation of spot intensity ratios for spots of any two matched gels is calculated using all consistently matched spots having a global quality value (28) of at least 50 and having a peak density of at least 0.39 A on the darkest film for each gel. The standard deviation is calculated from the distribution of spot intensity ratios on a log2 scale, to avoid disproportionate effects from changes in the larger spots. The per- centage of spots that are altered quantitatively by 2-, 4-, 8-, and 16- fold between any two gels was calculated only from consistently matched spots that have a peak density of a t least 0.39 A on the darkest gel. Also not considered in the count of n-fold changes are those spots that have a density of less than 0.39/n A on the lighter gel. The denominator used to calculate the percentage of quantitative changes is the number of spots eligible for the comparison. Before the quantitative changes are scored, a normalization factor is auto-

Page 4: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5286 The REF52 Protein Database matically calculated. The normalization factor is the inverse of the average ratio for all spots that have a peak density of 0.39 A in a t least one gel. Multiplying all spot ratios by the normalization factor corrects for any systematic errors, such as errors of sample loading, that lead ppm values to be generally higher in one gel than another.

Qualitative differences were scored by the system and confirmed visually by examination of the gel images and the actual films. A qualitative difference between Gel A and Gel B is defined to exist when: 1) a spot exists in Gel A but no corresponding spot is detected in Gel B, and 2) the ppm value for the spot in Gel A is at least 10- fold greater than the ppm value calculated for a hypothetical spot in Gel B having a height equal to the minimum quantifiable density and having gaussian half-widths typical of other spots in the region. The minimum quantifiable density is assumed to be 0.10 A.

Creation and Analysis of Spot Sets-Spot sets are created at the workstation by manual or automatic means. The QUEST program provides two “worksets” for each gel of the current matchset. These worksets can be loaded from the permanent spot sets stored in the database, they can be edited by manually adding or removing spots from a workset, and they can be created automatically by the system. The members of either workset can be highlighted on the screen (as in Fig. l), and the members of the two worksets can be combined logically by “and”, “or”, or “not” operations to generate new worksets. Any workset can be stored to the database as a permanent spot set.

Spots that change quantitatively by any preset amount can be selected as a workset, using the same methods to score quantitative changes between two gels as described above. The set-score method, as described in the text, is a way to generate sets based on complex quantitative criteria involving multiple samples. Sets generated by either method can be limited to spots with a minimum global quality. Sets generated by quantitative criteria are usually reviewed manually by displaying the gel image with set members highlighted on one side of the screen and spot graphs for the set members on the other side. Spots can be interactively deleted or added to any set.

Average properties, such as PI, M,, and ppm, can be calculated for any workset. Because ppm values range over several orders of mag- nitude, these values are averaged as the logarithm of ppm. The distribution of log ppm often approximates a gaussian curve, and changes in the mean of the distribution are used to measure changes of average spot intensity for the members of a spot set. In the calculation of average log ppm, values that deviate from the mean by more than 2.5 S.D. are thrown out. The average ppm values reported in Table 111, and the values displayed as set graphs in Fig. 6, are calculated as average log ppm, but are reported as ppm values by exponentiating the result.

Creation of Specific Spot Sets Long- and Short-lived Proteins-Long-lived proteins were selected

from Matchset 4, assigning the highest scores to spots that are induced by 2-fold or more in each of the 24-h labelings (with or without the 48-h chase) relative to the 2-h labelings. Spots with high scores were reviewed manually to select only spots that are not decreased in relative intensity during the 48-h chase.

Short-lived proteins were selected by the inverse procedure, yield- ing proteins that are lower in 24-h labelings (with or without the 48- h chase) relative to 2-h labelings, and which do not increase during the chase. Both sets were restricted to contain only high quality spots that are also members of set 220G.

Proteins Induced and Repressed at Confluence-These proteins were selected, using the set-score method, by assigning maximum scores to spots that are induced (or repressed) by at least 3-fold at Day 6 versus Day 2, &fold at Day 6 versus Day 3, and 2-fold at Day 4 versus Day 2. Only spots that are also members of the high quality set (Set 220G) were considered, and a few artifactual spots were eliminated manually. The 100 remaining spots with highest scores were selected as the sets of proteins induced (Set 2620) and repressed (Set 262P) at confluence.

Proteins Induced at Quiescence-Spots were selected, using the set-score method, to give the highest scores to the spots that are induced most at Day 12 versus Days 2 and 3, up to a maximum score of 1.0 for a 10-fold increase. Only high quality spots (members of Set 220G) were considered. The 50 highest ranking spots were further edited to remove spots with spurious data and to remove spots that were not more intense at Day 12 than at Day 6. The remaining set (Set 262C) contained 39 spots.

Proteins Induced Early and Late after Refeeding-Proteins induced early were selected by the set-score method, giving maximum score

to spots that are at least %fold more intense in cells 5 h after refeeding than in cells from Day 12 or from Day 6 that have not been refed. The relative intensity of these spots a t 15 h after refeeding was not considered.

Proteins induced late after refeeding were selected by giving max- imum score to spots that are at least 3-fold more intense in cells labeled 15 h after refeeding than in 1) cells 5 h after refeeding, 2) cells at Day 12 that were not refed, and 3) cells at Day 6 that were not refed.

Both sets were initially created to contain 100 spots, but these were further selected to remove spots that are not members of the high quality set (Set 220G) and to manually remove faint spots or artifactual spots.

Access to the Database

The REF52 database has been developed as a public resource under an National Institutes of Health Biomedical Research Technology grant. The database can be distributed in hardcopy form or as computer files. A color wall map (22 X 22 inches) of the REF52 protein map is available with transparent overlays to indicate spot names, standard spot numbers, and membership of spots in perma- nent spot sets. The QUEST resource has facilities for visiting scien- tists and can provide limited services to analyze samples related to the REF52 database.

RESULTS

The types of data to be managed and processed by a protein database include: 1) the quantitative data derived from the computer programs for image analysis, spot quantitation, and pattern matching (28); 2) the spot annotations, such as pro- tein names and literature references, that are entered as text by users; and 3) the permanent spot sets, which represent proteins of common biological interest selected during the analysis of each experiment. For a database to be shared by multiple users, the system must also provide a standard spot map and a standard spot numbering system. Finally, because much of the data that can be shared through a database becomes fully interpretable only in the light of future research, complete laboratory records to document each experiment and each gel must be available in the database. The classes of data managed by the QUEST system will be described with reference to the workstation screen shown in Fig. 1.

Classes of Data Managed by the QUEST Databases The basic unit of organization for the QUEST databases is

the matchset. A matchset is any group of gels, usually repre- senting one experiment, that are quantified, matched, and edited as a unit. These steps are fully described in the previous report (28). The analysis of each matchset yields spot quan- titation data and spot matching data that are entered into the permanent on-line database. Interaction with the database takes place at a workstation, where gel images from any matchset are displayed and where a variety of graphic and descriptive information can be presented. In Fig. 1, a portion of a gel for REF52 cells is shown and some of the current knowledge for the spots on this gel is displayed to illustrate of the classes of data available from the database.

Quantitative data from the database are shown by the spot graphs in Fig. 1. One spot, the prominent cytoskeletal protein tropomyosin-1, has been selected, and quantitative data have been displayed for three matchsets. Each graph presents data from one matchset (the matchsets are numbered above the graphs and are further described below), and each bar presents the intensity of the selected spot in one gel. The intensity data for each spot are stored in the database as parts per million (ppm) of the total protein radioactivity applied to the gel. The vertical scale for the graphs is in units of ppm, with the full scale value indicated at the left. At the workstation, any matchsets from the database can be selected for graphic

Page 5: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5287

FIG. 1. Classes of data available in the QUEST databases. Shown on the right half is a portion of a gel (Gel 8488) for REF52 cells labeled 4 days after plating. The spots highlighted in yellow are members of the set of cytoskeletal proteins. The identified spots on the image are labeled by the protein names. The spot graphs in the upper left present quantitative data for tropomyosin-1 from three matchsets. The matchsets, labeled by number and name at the top, are described more fully in Table IV. Each bar represents the intensity of tropomyosin-1 in one gel, and the scales are set so that the highest bar in each graph is at full scale. The number to the immediate left of the graphs is the full scale height, in parts per million (ppm), for the first graph. The number to the far left is the standard spot (SSP) number for tropomyosin-1. The form shown below the graphs is an example of data from the QST database of gel laboratory data. Information has been displayed by the system to indicate the name of the sample and the experiment represented by the gel shown, to indicate the dpm loaded on the gel and its dates of exposure, and to provide basic information about the experiment.

analysis, and spot graphs can be displayed for any selected spots that are visible on the displayed image.

Annotations are lines of text entered by a user to describe any spot or set of spots indicated on a displayed gel. Each annotation record includes a category (such as “Names”), a short line of text (such as a protein name), and a longer comment (such as a description of the method used to identify the protein). Annotation data can be displayed by category for any spot, and all spots with annotations in the selected category can be highlighted automatically. In Fig. 1, the Names category has been selected, and the names of identified proteins have been displayed on the gel image.

Spot sets can be selected manually or by quantitative cri- teria using the database analysis tools presented below. The analysis of each experiment usually yields one or more care- fully selected spot sets that can be named and stored perma- nently in the database. Any stored spot set can be recalled from the database, and its members can be highlighted on the current gel image. In Fig. 1, the visible members of the set of cytoskeletal proteins are highlighted in yellow.

The standard spot (SSP) numbering system provides a unique identifier for each spot. Each SSP number contains a two-letter code, which serves to locate the spot by horizontal and vertical position on the standard spot map, followed by a series of digits. The SSP number for the spot graphed in Fig. 1 is seen at the left of the graph. Since the SSP number of a spot is independent of the gel on which it is viewed, the system can locate a given spot on any gel by its SSP number.

Laboratory data for all experiments, protein samples, and

gels run under the QUEST system is available thrqugh the QST record management system previously described (5,28). This system, which has been expanded to handle matchset descriptions, spot set descriptions, and annotation records, can display the forms and data records directly on the work- station screen. In Fig. 1, a GELDATA record summarizes the relevant data for the gel on display including information about the protein sample and the experiment it represents.

Database Structure The structure of the QUEST databases reveals how data

from many gels can be connected and how the database can grow by addition of data from many more experiments. In the examples of Fig. 1, quantitative data, annotations, and spot sets have been displayed on the image of a particular gel, yet each type of information originated from matchsets other than the one that includes the displayed gel. The method for linking data from the matchsets in the database is dia- grammed in Fig. 2. Matchsets are represented in the diagram as groups of gels (numbered nodes) interconnected by lines representing the matches. Within a typical matchset, every gel is matched to every other gel. For each node in the diagram, the database contains a spot list, which includes the x- and y-coordinate positions and intensity value for each spot in the gel. For each line in the diagram, the database contains two lists of matched spots (match lists). Each match list contains, for every spot of one gel, the number of the spot it is matched to in the other gel. There are two match lists because each pair of gels is matched in both directions.

Page 6: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5288 The REF52 Protein Database

FIG. 2. Schematic of QUEST database structure. Each hex- agonal array represents one matchset, with each numbered node representing one gel. Within each matchset, every gel has been matched to every other gel, as indicated by the lines. Two gels (Gels 5 and 6 ) are members of both matchsets, which automatically defines these gels as linker gels. The paths from Gel 3 to Gel 8 are illustrated by heavier lines. To trace a spot of Gel 3 to its corresponding spot of Gel 8, the system uses gel-to-gel matches and linker matches, as described under “Experimental Procedures.” Both paths will be traced and compared for consistency. The actual REF52 database contains 12 matchsets, including those presented in the accompanying paper (25), and has multiple paths between most matchsets.

The matchsets are interconnected (dotted lines in Fig. 2) at the linker gels. Any gel that is included in more than one matchset is automatically a linker gel. The database contains a separate spot list for each copy of a linker gel, and each copy of the linker gel is matched independently into its respective matchset. The separate spot lists for a linker gel are not necessarily identical because spots can be added, deleted, and refitted in the course of matching the gels in each matchset (28). Matching of the spots between any two copies of a linker gel is carried out automatically by the system as needed.

Paths through the database can be found automatically by the system to connect any gel with any other gel. Once a path is found, any spot of the starting gel can be traced to its corresponding spot in the destination gel. If two or more paths of equal length are available, as in the example shown in Fig. 2, the spot is traced along each path, and the most consistent result is used. Inconsistencies in the database are flagged to warn the user of possible errors.

In the example shown in Fig. 1, the user has selected a spot (tropomyosin-1) on a particular gel of a particular matchset and has requested quantitative data for this protein from three other matchsets. The system has found paths from the starting gel to each gel of the destination matchsets, and it has traced the selected spot to the corresponding spot of each destination gel. For each destination gel, the quantitative value of the spot was retrieved from the spot list in the database, and a bar of the appropriate height was plotted on a spot graph.

The database structure described above gives the user flex- ibility in choosing how to link each new matchset into the database. For a series of related experiments, it is often convenient to choose a representative gel as a linker to be matched into every matchset. Sometimes a large series of gels can be grouped into matchsets in several overlapping combi- nations, providing a highly interconnected group of match- sets. Both of the above strategies have been used in the construction of the REF52 database. The database structure does not require linkage of all matchsets. Groups of matchsets, developed by different users or representing different cell types, can co-exist in the database without linkage, effectively representing independent databases. If, at a later date, a new matchset is created containing one or more gels from each

group, then all gels within the formerly independent groups automatically become linked.

Building the REF52 Database The Protein Map-The REF52 protein map defines the

standard coordinate system and the standard numbering sys- tem for proteins in the REF52 database. The map is based on the coordinate system of a particular gel (Gel 8489), shown in Fig. 3, which is typical of the quality and resolution of the gels in the REF52 database. This gel, which contains proteins of the SV40-transformed REF52 derivative, WT2, was chosen because it contains virtually all proteins detected in normal and transformed REF52 cells and because its pattern can be easily matched to the pattern for both normal and trans- formed REF52 cells. The coordinate system of the map can be expressed in units of distance (millimeters) from the lower left corner of the scanned image, or in units of apparent isoelectric point (PI) and apparent molecular weight (Mr). The PI and M, coordinates are only approximations to the true physical values. Before assigning standard spot (SSP) numbers, the coordinate system was divided into 15 horizontal and 15 vertical zones (see letters at the edge of the gel in Fig. 3). The first two letters of each standard spot number are assigned according to the position of the spot within these zones, and the digits that follow serve to uniquely identify the spots within each region of the gel. The numbers are assigned beginning with the largest spots, and only spot numbers divisible by 10 have been used in the initial assignments. Since all gels in the REF52 database are matched directly or indirectly to Gel 8489, all spots can be assigned a standard spot number based on the standard coordinate system, even if they do not appear on Gel 8489.

Protein Names and Annotations-Many spots of the REF52 protein map have been associated with known proteins, as indicated in Fig. 3. For each named spot, the SSP number, PI, M,, and ppm value from Gel 8489 is given in Table I. Most of the known spots have been identified by immunoprecipi- tation using antibodies provided by other investigators, as referenced in Table I.

The annotation categories used in the REF52 database are listed in Table 11. Each spot that has been identified with a known protein has been annotated under the Names category. Other annotation categories are used to record information regarding subcellular location (e.g. nuclear, mitochondrial), post-translational modification (e.g. mannose-labeled, phos- phate-labeled), mode of regulation (e.g. heat shock, serum- stimulated), NHp-terminal sequence (obtained using protein recovered from the gel), complete sequences from the PIR (Protein Identification Resource) database, and literature ci- tations. One category records the sources of antibodies known to recognize a given spot, and another category records pro- teins that are coprecipitated by antibodies to known proteins, possibly indicating important interactions. The number of annotations currently available in the REF52 database are presented by category in Table 11.

A Reference Set of High Quality Spots-More than 2600 different spots have been detected in the gels run of REF52 cells, and the average ppm of these spots is 11-14 ppm (line I, Table 111). A better reference set of REF52 spots has been created by the use of spot quality. As described in the previous report (28), spot quality is a value from 0 (lowest) to 100 (highest) that is assigned to each spot based on its shape and the amount of overlap with neighboring spots. Global quality selection, as described previously (28), is an automatic pro- cedure to create a spot set based on the quality of each spot averaged over the gels of a matchset. This procedure also

Page 7: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database

D E F G H J K L M N P Q R S f I EF

1 1 1 1 1 1 1 1 1 I I I I I

200

I00

80

60

m '0 X

2 40

30

20

4.5 5.0 5.5 6.0 6.5 7.0 7.5 PH

5289

1 p R Q P N M L K J H

G

F

E

D

FIG. 3. The reference map for the REFS2 database. The gel shown is from analysis of the SV40- transformed REF52 cell line, WT2. For all gels in the REF52 database, only the area within the box has been analyzed. This gel is used to define the spot numbering system (SSP numbers) using the letter coordinates indicated at the margin. The scales of apparent isoelectric point (PI) and apparent molecular weight (M,) are indicated. Some of the known spots on the map are indicated by numbers. Full information, including standard spot numbers, PI, and M, coordinates, for the known spots are given in Table I.

removes spots that have low intensity in all gels. The set of high quality spots used here (Set 220G) contains the 1606 spots of global quality 50 or higher in the matchset of REF52 cells (Matchset 3 described below). As expected, these spots have higher average ppm (19-20 ppm in REF52 cells) than does the set of all spots (line 2 uersus line 1, Table 111). Membership in many of the specialized spot sets created below has been limited to those spots that are also members of the reference set of high quality spots (Set 220G).

Spot Sets Based on Subcellular Localization and Post-trans- latwnul Modificatwn-Many future studies will be more easily interpreted if we identify those spots on the REF52 map that are phosphorylated, glycosylated, or are members of particular subcellular fractions. Both phosphorylation and glycosylation substantially shift the position of a protein on the two- dimensional map, allowing the modified forms to be easily distinguished from unmodified forms. The knowledge that a protein is either phosphorylatedor glycosylated has functional significance and provides the opportunity to look for altered levels of modification in response to biological stimuli. The classification of REF52 proteins into nuclear, mitochondrial, and cytoskeletal fractions also helps to limit the possible functions of each protein and allows us to determine whether

proteins in distinct subcellular compartments are differen- tially affected by growth regulation and transformation. The spot sets created to describe these groups are presented in Table 111.

Phosphute-lubeled proteins were scored by alignment of gels containing [93P]phosphate-labeled proteins (Fig. 4C) with gels containing [%]methionine-labeled proteins. A set of 122 phosphate-labeled spots was created (line 3, Table 111); other spots on the phosphate-labeled gel did not appear on [36S] methionine-labeled gels because they were too faint or, in a few cases, because they did not contain methionine. Among the known members of this set are the nuclear matrix proteins lamins-B and -C, the heat shock protein HSP-90, and a myosin light chain isoform. Phosphorylated forms of the intermediate filament protein vimentin and the heat shock protein HSP-80 were not sufficiently resolved from their unmodified forms for separate quantitation. Approximately 6% of the high quality spots on the REF52 map can be detected by phosphate labeling.

An important and easily identified class of glycoproteins are those labeled by mannose. A gel containing [3H]mannose- labeled proteins (Fig. 5A) was aligned with a gel containing [35S]methionine-labeled proteins (Fig. 5B) , and with a gel

Page 8: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5290 The REF52 Protein Database TABLE I

Known proteins in REF52 cells For each named spot, the standard spot number (SSP), apparent isoelectric point (PI), apparent molecular

weight (Mr), and intensity (ppm) in Gel 8489 are given.

PI M. PPm Ref./notes Names

1. 2. 3. 4. 5. 6. 7. 8. 9.

10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45.

Actins (a, 0, y) Actins (6, c ) Vimentin Vimentin-derived (ViD) Vimentin-derived (ViD) a-Tubulin P-Tubulin Tropomyosin-1 Tropomyosin-2 Tropomyosin-3 Tropomyosin-4 Tropomyosin-5 Tropomyosin-6 Minor tropomyosin a-Actinin Vinculin (acidic) Vinculin (basic) Intermediate filament-associated Non-muscle caldesmon Non-muscle caldesmon satellite Myosin light chain isoform Myosin light chain (phos) Myosin light chain isoform Calpactin I (lipocortin 11, p36) Calpactin I1 (lipocortin I, p35) PCNA PCNA satellite Lamin-B Lamin-B (phos) Lamin-C Lamin-C Lamin-C Lamin-C (phos) Lamin-C (phos) Lamin-C (phos) Lamin-C (phos) Lactate dehydrogenase isoform HSP-110 HSP-100 HSP-90 HSP-80 HSP-73 0-F1-ATPase Mitcon-2 (IEF24) Mitcon-3

SSP

GGlO GG20 FKlO EJ40 DH20 FK20 EJlO DG20 DF20 DF30 DE30 DE40 DG180 DF230 GQlO LRlO LR20 GQ20 NNlO MNlO DD20 DD40 DD50 SF10 QFlO DFlO DF170 GM40 FMlO QL40 QL20 PL40 NL20 NL40 NL210 NL220 LFlO HRlO EQlO FPlO FNlO GMlO EJ20 HL20 HNlO

5.39 5.45 5.26 5.05 4.93 5.25 5.16 4.82 4.88 4.88 4.82 4.90 4.87 4.84 5.46 6.25 6.31 5.40 6.60 6.40 4.85 4.81 4.97 7.60 7.08 4.86 4.82 5.38 5.34 7.09 6.96 6.90 6.74 6.71 6.65 6.58 6.28 5.67 5.03 5.21 5.22 5.50 5.18 5.61 5.67

45,700 45,600 55,700 50,700 48,200 55,700 53,900 39,700 36,300 34,900 32,400 31,900 39,400 37,100

102,900 121,000 121,400 97,600 81,000 81,200 19,600 20,000 19,500 36,800 38,700 34,600 34,700 69,100 69,200 60,900 60,900 61,000 61,200 61,300 61,400 61,400 35,100

112,700 103,900 90,900 78,500 72,200 52,200 58,300 76.200

16,400 2,800 5,860

99 65

2,330 4,640

845 499 96

450 237 26 4

1,920 702 413 542 336 211 414 93 60

1,140 874 501 32

163 70 64

188 24

103 34 21 16

155 334 726

2,120 644

1,440 580 366 250

36 36 34 42 42 1 1

34" 34" 34" 34" 34" 34"

19 34 34 43

44 44

d

e

45 45

f f f f f f f f f B

46 46 46 46 46 47

32,48 32

a Identified using polyclonal anti-tropomyosin serum from Drs. Fumio Matsumura and Shigeko Yamashiro-

Identified using monoclonal antiserum from Drs. Fumio Matsumura and Shigeko Yamashiro-Matsumura Matsumura (Rutgers).

(Rutgers). Also coprecipitates with anti-tropomyosin antiserum (Footnote a). e Identified using monoclonal antisemm from Dr. Chandra Kumar (Schering Corp.).

Identified using antibodies provided by Drs. Tony Hunter and Jon Cooper (Salk Institute and Hutchinson Cancer Center), Drs. Robert Ralston and J. Michael Bishop (University of California, San Fransico), and from Dr. Ray Erickson (Harvard).

e Identified using antibodies provided by Dr. Stanley Cohen (Vanderbilt) and Dr. John Northrup (Yale). Identified in collaboration with Dr. Peter Hornbeck (National Institutes of Health) using antibodies provided

#Identified by comigration with purified lactate dehydrogenase, in collaboration with Dr. Garth Anderson by Dr. Larry Gerace (The John Hopkins).

(Roswell Park) and Dr. Michael Lambert (Scripps Clinic).

containing [36S]methionine-labeled proteins prepared from tunicamycin-treated cells (Fig. 5C). Tunicamycin inhibits the transfer of high mannose carbohydrate complexes to their protein acceptors. As expected, each of the mannose-labeled spots was absent or reduced in intensity in tunicamycin- treated cells (Fig. 5, C uersus B) . The unmodified precursor protein (labeled p in Fig. 5C) were easily detected in the presence of tunicamycin. Among the mannose-labeled pro- teins is the heat shock protein HSP-100 and the "p45" group of secreted proteins (Fig. 5A) described by Santaren and

Bravo (29). Secreted proteins, labeled s in Fig. 5A, were identified by two-dimensional gel analysis of proteins from the medium of REF52 cells (gels not shown). The average ppm value for mannose-labeled proteins is increased as REF52 cells reach confluence (line 4, Table 111). Approxi- mately 4% of the high quality spots on the REF52 map can be labeled with mannose.

Nuclear proteins were defined by quantitative comparison of a two-dimensional gel of nuclei purified from WT6 cells (Fig. 4A) with a gel of the corresponding WT6 cell lysate (Fig.

Page 9: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5291 TABLE I1

Spot annotations in the REF52 database Annotation categories presently used in the REF52 database are

listed. Annotations are entered as records displayed on an "Annota- tion" form, and each annotation record can refer to more than one spot. The number of annotation records and the number of spots annotated in each category are listed.

Category Records spots

Names 49 62 Subcellular 4 283 Modifications 6 195 Antibodies 15 35 Coprecipitation a 22 N-terminal sequence 1 1 PIR" 15 15 Ref. 15 46

PIR, Protein Identification Resource.

4B). WT6 cells are SV40-transformed REF52 cells, and they contain virtually all proteins detected in REF52 cells. Only spots enriched by 2-fold or more in the nuclear fraction were selected to be members of the nuclear protein set (line 5, Table 111). The nuclear set includes two known proteins, PCNA and lamin-B. PCNA (proliferating cell nuclear anti- gen) is a growth-regulated nuclear protein required for DNA- replication (reviewed in Ref. 30), and lamin-B is a nuclear matrix protein (31). Some cytoskeletal proteins, such as ac- tins, vimentin, and tubulins, are prominent in the nuclear fraction, but they are not enriched by nuclear purification and therefore do not qualify for inclusion in the nuclear set. Approximately 10% of the high quality spots on the REF52 map represent nuclear proteins.

Mitochondrial proteins on two-dimensional gel patterns have been identified by Anderson (32), who used mitochon- drial poisons to block the uptake and processing of precursor forms from the cytoplasm. We have used the K+ ionophore nonactin, one of the drugs shown by Anderson to block the appearance of mature mitochondrial proteins (32), to identify a set of 25 mitochondrial spots on the REF52 map. To further confirm that these proteins appear as mature forms only after a delay for processing, we examined protein patterns derived from cells labeled for only 5 min. In these patterns, each of the putative mitochondrial proteins was absent or greatly reduced in intensity. The set of mitochondrial proteins (line 6, Table 111) contains one identified protein, the p-subunit of F1-ATPase, and it contains two other proteins (Mitcon-2 and Mitcon-3) that have been described elsewhere (see Table I). The selection procedure used here does not necessarily iden- tify all mitochondrial proteins that are detected on the REF52 map.

A set of 25 cytoskeletal proteins was assembled from the list of known proteins (Table I). Most of these proteins are synthesized at high rates in REF52 cells (line 7, Table 111). Their average labeling intensity rises early in proliferation, but falls at confluence (Day 6) and at quiescence (Day 12), probably because most members of this set are stable proteins (see below).

Matchsets in the REF52 Database-The REF52 database contains 12 matchsets that have been analyzed quantitatively. Four of them, which represent studies of normal REF52 cells, are presented here (Table IV), and the rest, which represent studies of normal and transformed REF52 cells, are presented in the following paper (27). Each matchset is identified by a number and a short name that are unique in the REF52 database. The matchsets presented here were designed to quantify the variations in the spot patterns that result from (a) experimental error in the electrophoresis and analytical systems (Matchset l), (b) clonal variability of REF52 cells (Matchset 2), ( c ) growth of cells to confluence and quiescence followed by refeeding (Matchset 3), and (d) variations in the time of labeling (Matchset 4). The interpretation of other experiments in the REF52 database, such as those presented in the following paper (27), will depend on the range of quantitative variation found in these experiments.

Software Tools for Analysis of the QUEST Databases The database analysis tools allow us to extract meaningful

biological results from large amounts of quantitative and descriptive data in the database. The tools presently available are designed 1) to explore the global relationships between protein patterns, 2) to select sets of proteins from the total pattern based on quantitative and descriptive criteria, 3) to characterize the average behavior of selected protein sets, and 4) to characterize individual proteins.

Relatedness of Protein Patterns-To begin the analysis of any matchset, it is useful to determine which gels in the matchset are most related. Three methods are used for pair- wise comparisons of gel patterns (Table V). One method is to compute the spot intensity ratios for all spots (above a thresh- old of spot quality and spot intensity), and to plot these as a histogram on a log, scale (1). The standard deviation (S.D.) of this distribution is influenced by both the number and the magnitude of the differences between the two samples. A second method to compare two gel patterns is to count the spots that differ by 2-, 4-fold, etc. Spots with low density or low quality in both gels are excluded from the comparison, and the result is divided by the number of spots that could be compared to yield a percentage of quantitative differences. Finally, the number of spots that are detected in one gel but

TABLE 111 Spot sets for proteins classified by subcellular lecalization and post-translational modification

Each spot set is characterized by its permanent set number, by the number of spots contained, by the number of spots that are also members of the set of high quality spots (Set 220G), by the average intensity of the members (averaged as logarithms) in gels from Matchset 3, and by the known proteins they contain.

1. 2. 3. 4. 5. 6. 7.

Spot set name set no, Total spots High quality spots "2" Day Day "7 2 (numbers refer to lines Known proteins in set

in Table I)

spot 3 4

auerage ppm in REF52 All spots 2610 1606 13 12 13 14 11 All high quality spots 220G 1606 1606 Phosphate-labeled 213L 122 97

19 20 19 19 20 23 22 21 21 18

Mannose-labeled 220D 65 60 35 4a 57 74 7a Nuclear 260T 199 165 21 21 19 18 17 Mitochondrial 218A 24 24 45 50 49 46 45 Cytoskeletal 220H 25 25 419 677 612 443 401

1-45 1-45 3, 29, 33-36, 40, 41 39 26-36 43-45 1-25

Page 10: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5292 The REF52 Protein Database A. NUCLEAR PROTEINS B. TOTAL PROTEINS C. PHOSPHOPROTEINS IEF , r:

1 0 0

80

60

50 m 'P 40

30

20

i

-. .

:.

p

5.0 5.5 6.0 6.5 7.0 4.5 5.0 5.5 6.0 6.5 7.0 4.5 5.0 55 6.0 6.5 7.0 PH

FIG. 4. Nuclear proteins and phosphate-labeled proteins compared to total cell proteins. Panel A is a gel of nuclear proteins isolated from WT6 cells (SV40-transformed REF52 cells) and these are compared to a total WT6 cell lysate in Panel B. Shown in Panel C is a gel of phosphate-labeled proteins from a total cell lysate of REF52 cells. The nuclear proteins have been identified in the database by computer matching of the spots in Panels A and B, and only spots with 2-fold or more enrichments in the nuclear fraction are included in the nuclear set. The phosphate-labeled protein set was scored manually by alignment of the spots shown in Panel C with Gel 8489. Many alignments between the three patterns have been pointed out by arrows.

are absent from the other gel can be counted. The definition of qualitative difference used here required that the ppm value of the spot, in the gel where it is detected, be at least 10-fold greater than ppm value of a barely detectable spot in the other gel.

Spot Sets Created by Quantitative Criteria-To find the particular spots that differ quantitatively between gels, two automatic methods are used to create spot sets. In the first method, the matched spots of two gels are compared, and spots with an intensity ratio of greater than a specified factor are selected. To create sets of spots that satisfy quantitative criteria involving multiple samples, the spot sets generated by pairwise comparisons can be combined using "and," and "not" operations.

The second method for automatic set creation is called the set-score method. By this method, many pairwise gel compar- isons can be specified at once, each with an upper bound and a lower bound for the spot intensity ratio. For each compari- son of two gels all matched spots receive a score. A spot with an intensity ratio of greater than the upper bound received a score of 1.0, and a spot with an intensity ratio of lower than the lower bound received a score of -1.0. Spots with intensity ratios between the lower and upper bounds receive graded scores between -1.0 and 1.0, respectively. A final score be- tween -1.0 and 1.0 is reached for each spot by averaging the scores, with optional weighting factors, from the specified pairwise gel comparisons. After the scores are assigned, all spots are automatically ranked by score, and the spot graphs can be reviewed in order of rank. A spot set can be created to contain a predetermined number of spots based on rank, or a set can be created to contain only spots within a range of scores. The set-score method is effective because it combines the results of many pairwise comparisons in a graded way, whereas the previous method eliminates a spot from the set if it falls just below the threshold ratio in any comparison.

Characterization of Spot Sets-Spot sets are usually created to represent proteins with similar biological properties, and

in many cases it is useful to determine the quantitative behavior of the set as a whole. If the average intensity of the spots in the set changes with experimental perturbations to the cells, it is likely that at least a subset of the member proteins is changing in a coordinated manner. Because spot intensities vary widely, average spot intensities are calculated from the logarithm of ppm values. On a log scale, the ppm values have a distribution that is more nearly gaussian in shape, and intensity changes among the most intense proteins do not disproportionately affect the average intensity.

To display the average set intensities, we use set graphs, which are analogous to the spot graphs presented in Fig. 1. Each bar of a set graph (Fig. 6) represents the average inten- sity of the members of the set in one gel, and each set graph displays the average intensity of the set for the gels of one matchset. The vertical scale on each set graph is defined so that the height of the first bar is 1.0.

Quantitative Analysis of the REF52 Database Controls for Reproducibility-The experiments designed to

test quantitative reproducibility were analyzed by relatedness of the spot patterns, as reported in Table V. Fewer than 2% of the spots differ in intensity by 2-fold or more between patterns representing proteins of the same sample (line 1). Only 2-3% of the spots in patterns derived from duplicate samples differ by 2-fold or more (lines 13-16), and approxi- mately 3% of the spots in patterns derived from independent clones of REF52 cells differ by 2-fold or more (lines 2-4). The standard deviation (S.D.) values for replicate gels, for dupli- cate gels, and for clones are 0.30, 0.30-0.40, and 0.45-0.46, respectively. These values are small in comparison to the changes resulting from growth regulation and transformation, as presented below and in the following report (27). Long- and Short-lived Proteins-The time-of-labeling ex-

periment (Matchset 4) was designed to measure the effect of variations in the labeling intervals, and to give information about relative protein stability. As shown in Table V, com-

Page 11: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5293

- I

120 - I00 -

80 -

60 -

50 -

40 -

E 120 - 100 -

80 ~

yg $ 6 0 - z

5 0 -

40 -

C 120 - 100 - 80 -

60 -

40

5.0 5.5 6.0 6.5 7.0 7.5 PH

FIG. 5. Proteins detected from REF52 cells labeled for 3 h with [‘Hlmannose are shown in Panel A. The corresponding cell lysate labeled with [35S]methionine is shown in Panel B, and proteins from a parallel dish of cells treated for 3 h with 1 pg/ml of tunicamycin is shown in Panel C. Alignment of spots between the three panels is shown by arrows. Spots known to represent secreted proteins (de- tected by analysis of samples prepared from the media of REF52 cells) are indicated by s. Spots labeled p in Panel C are precursor forms of glycoproteins that accumulate in the presence of tunicamy- cin.

parisons of spot patterns from cells labeled for 30 min with patterns from cells labeled for 2 h reveal substantial differ- ences (9% of spots differ by 2-fold or more, line 17) and the comparison of cells labeled for 2 h with cells labeled for 24 h revealed much larger differences (17% of spots differ by 2- fold or more, line 19). This shows that labeling times must be carefully controlled if comparisons between different experi- ments are to be meaningful.

To measure long-term protein stability, samples were pre- pared from cells that were labeled for 24 h and “chased” with unlabeled media for 48 h (Matchset 4). Proteins that are not detected after the long chase are short-lived or secreted; proteins that are detected as spots of higher intensity after the chase are more stable than average proteins. To select and characterize the long- and short-lived proteins, spot sets

were created by the set-score method. The set of long-lived proteins was selected from those spots that are more intense in 24-h labelings than in 2 h labelings, and the set of short- lived proteins was selected from those spots that are more intense at 2 h than at 24 h (see “Experimental Procedures” for details of the set selections). One important result is that the set of short-lived proteins contains 10 times as many of the mannose-labeled proteins as does the set of long-lived proteins (line 2 versus line 1, Table VI). The stability of mannose-labeled proteins is further characterized below.

The set graphs (Fig. 6) reveal that long-lived proteins (Set 260M) are repressed, on average, by 30% at confluence and at quiescence, and that short-lived proteins (Set 260N) are induced, on average, by more than 2-fold at confluence and at quiescence (bars 4 and 6 versus bar 1, Matchset 3). The members of each set are sensitive to refeeding (bar 5 versus bar 4, Matchset 3). Repression of long-lived proteins a8 cells cease to proliferate is expected, except for proteins that are destined to accumulate to high amounts in the cell. Although the short-lived proteins represent a higher fraction of the total protein synthesis (higher average ppm values) at conflu- ence than during proliferation, this result does not necessarily imply that the rates of synthesis for short-lived or secreted proteins is increased in absolute terms. The relative increase could be explained by a decrease in general protein synthesis at confluence while the synthesis rates for short-lived proteins remain constant.

Set graphs for the mannose-labeled proteins (Set 220D, Fig. 6) confirm that many mannose-labeled proteins are short- lived or secreted. The fact that the average intensity of the members of the set is high after 30-min and 2-h labeling periods (bars 1-4, Matchset 4), but is substantially lower in patterns representing cells labeled for 24 h, suggests that many members of the set are secreted.

The set graphs for cytoskeletal proteins (Set 220H, Fig. 6) were also examined because the members of this set are expected to be long-lived. These proteins are indeed stable during the long chase (bars 7 and 8 higher than bars 5 and 6, Matchset 4), but surprisingly, they are no more intense, on average, after 24-h labelings than after 2-h labelings (bars 5 and 6 versus bars 3 and 4, Matchset 4). This result explains why no cytoskeletal proteins were scored in the set of long- lived proteins (line 1, Table VI). It also suggests that newly synthesized cytoskeletal proteins are no more stable than average cellular proteins, but that those cytoskeletal protein molecules that survive for more than 24 h are extremely stable.

Protein Induced and Repressed at Confluence-Sets of 100 proteins with greatest induction or repression at confluence were selected by the set-score method (lines 3 and 4, Table VI). Phosphate-labeled proteins and nuclear proteins are found preferentially among the proteins repressed at conflu- ence, while the mannose-labeled proteins and the short-lived proteins are found preferentially among the proteins induced at confluence.

Set graphs (Fig. 6) show that the average intensity of proteins induced at confluence (Set 2600) rises steadily dur- ing proliferation (bars 1-4, Matchset 3), and that these pro- teins are markedly repressed when confluent cells are refed (bar 5 versus bar 4, Matchset 3). An inverse pattern was found for proteins repressed at confluence (Set 260P). The time-of- labeling experiment (Matchset 4) reveals that proteins in- duced at confluence (Set 2600) are less stable than average proteins (bars 5-8 versus bars 1-4), and that proteins re- pressed at confluence (Set 260P) are more stable than average proteins (bars 5-8 versus bars 1-4). These results confirm the

Page 12: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5294 The REF52 Protein Database

TABLE IV Matchsets from the REF52 database

Matchset numbers are permanent in the REF52 database. The short names are useful for labeling of spot graphs and set graphs. The samples are listed in the same order as presented in spot graphs and set graphs. Matchsets 2, 3, and 4 contain Gel 8489 (the gel used for the REF52 map) as a linker, but data from the linker gel is not included in the manhs.

Matchset Label time and Description Gels in matchset number/name isotope

1 2 3 4 5 6 i 8 9 10

1 REPS 24 h [36S]Met

2 CLONES 28 h [36S]Met

3 REF52 2 h [3SS]Met

One sample, PSV7 cells (SV40-trans- formed REF52)

cells, labeled at subconfluence

days after plating, quiescent at Day 12

confluent, test time of labeling

Subclones of REF52 Clone 5 Clone 8 Clone 9

REF52 cells uersus Day 2 Day 3 Day 4

REF52 cells, sub- 30 min 30 min 2 h

10 replicate gels

Clone 10

Day6 Day6, Day 12 Day 12, Day 12, refed, refed 5 h" 15 h"

refed, 15 ha

4 TIME Variable [36S]Met

2 h 24 h 24 h 24-h + 24-h+ 48-h 48-h chase chase

Hours from refeeding with fresh medium and serum to start of labeling.

TABLE V Relatedness of gels in pairwise comparisons

Values in the S.D. column are standard deviations derived from a histogram (not shown) of spot intensity ratios on a log (base 2) scale. In the n-fold columns are the percentage of spots that differ by n-fold or more between the gels compared. Qualitative changes are those spots detected in one gel a t more than 10-fold the level of sensitivity and not detected in the other gel. In the "Spots compared" column is given the number of high quality spots used to make the comparisons.

Quantitive differences Qualitative Rnnt. Gels compared 'y"

S.D. >2-fold >4-fold >8-fold >16-fold Up Down 'Ompared

Replicate gels (Matchset 1) 1. Average of 9 comparisons

Clones of REF52 cells (Matchset 2) 2. Clone 5 uersus Clone 8 3. Clone 5 uersus Clone 9 4. Clone 5 uersus Clone 10

5. Day 2 Day 3 6. Day 2 Day 4 7. Day 2 Day 6 8. Day 2 Day 12 9. Day 2 Day 6r15

10. Day 2 Day 12r15 11. Day6 Day 6r15 12. Day 12 Day 12r15

13. 30 min 30 min 14. 2 h 2 h 15. 24 h 24 h 16. 24 h + chase 24 h + chase 17. 30 min 2 h 18. 30 min 24 h 19. 2 h 24 h 20. 2 h 24 h + chase 21. 24h 24 h + chase

REF52 cells (Matchset 3)

Time of labeling (Matchset 4)

0.30 1.7 0.2 0.1 0.0 0 0 1121

0.45 2.9 0.46 3.1 0.45 3.0

0.2 0.1 0.1

0.1 0.1 0.2

0.0 0.0 0.1

0 0 0

0 0 0

940 940 940

0.47 4.8 0.75 13.4 1.16 24.3 1.18 29.6 0.66 9.1 0.84 16.9 1.04 19.3 0.97 20.2

0.9 1.6 6.6 9.1 1.6 4.2 5.3 4.7

0.1 0.5 1.5 2.2 0.1 1.2 0.7 0.8

0.0 0.1 0.5 0.4 0.0 0.0 0.2 0.2

0 1 4

11 0 5 0 3

951 951 951 951 951 951

1019 1346

0.38 2.2 0.32 2.2 0.30 2.0 0.40 3.0 0.62 9.0 1.12 17.9 0.95 17.3 1.34 28.1 1.01 20.5

0.4 0.4 0.4 0.8 1.6 6.8 5.1 9.7 4.0

0.0 0.1 0.1 0.2 0.5 3.0 2.3 2.8 1.6

0.0 0.0 0.0 0.1 0.2 1.9 0.6 1.1 0.1

897 1401 1251 899

1177 1177 1386 1148 1148

0 0 0 0 0 5 6

25 2

relationships between protein stability and induction or detected in cells labeled for 24 h (bars 5-8 versus bars 1-4, repression at confluence reported above. Matchset 4).

Spot graphs for some of the individual spots with induction spots RH20 (PI 7.349 Mr 479600) and RH30 (PI 7*18y Mr

or repression at confluence are presented in Fig. 7. Spot DGlO 489700) are members Of a basic Of mannose-1abe1ed (PI 4.98, M, 45,100) is a prominent mannose-labeled protein proteins (Fig. 5A) that are highly induced at confluence and

quiescence (bars 4 and 6 versus bar 1, Matchset 3 in Fig. 7), that is known to be secreted (see Fig. 5A). It is synthesized at and which are repressed by refeeding (bars and 4-fold higher levels in confluent and quiescent cells than in bar 6, Matchset 3). The time-of-labeling experiment Proliferating cells (bars 4 and 6 versus bar 1, Matchset 3), and (Matchset 4) shows that the intensity of spot RH20 is lower its synthesis is repressed by refeeding (bar 5 versus bar 4 and in 2-h labelings than in 30-min labelings (bars 3 and 4 versus bars 7 and 8 versus bar 6, Matchset 3). It is not readily bars 1 and 2, Matchset 4), which probably indicates that it is

Page 13: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5295 2 3 4

CLOMS REF52 TIME

2 3 4

CLONES REFS2 TIME

SET 26BM LONG-LIVED PROTEINS

SET ZGBN WORT-LIVED PROTEINS

SET 2280 WWCSE-LABELED :::I :::- 1.8

8 . 4 .

. 8.6

. 8.8

-

8.3 1 . 8 .

8 .5 . . 0.2

1 3 1 3 5 7 1 3 5 7

SET 22BH CYTOSKELETFL

SET 2680 H I M R FIT CONLUENCE

SET ZGBP LOWER AT CONRLENCE

1 3 1 3 5 7 1 3 5 7

SET 262C H I W S T AT CUIESCENCE

1 3 1 3 5 7 1 3 5 7

KT 2628 IhWCED EARLY WTER REFEED :::I 8 .6 j:II 8.9 8.6 . . 8 .4 e. 2 8.6 8.3 .

1 3 1 3 5 7 1 3 5 7

K T 26zE IhWCED LATE ffTER REFEED 1 . 9 -

1 . 6 .

1 . 2 . 0 . 8 .

8 . 4 . 0 . 4 . 0 . 4 . 0 . 2 .

8 . 8 .

1 3 1 3 5 7 1 3 5 7

FIG. 6. Set graphs for selected spot sets. Each set graph rep- resents data from one spot set (set number and name given above each row of graphs), for one matchset (matchset number and name given above the columns). The bars represent the average ppm values for the members of the set in each of the gels of the indicated matchset. Average set intensities are computed from the logarithms of individual spot ppm values, to avoid overemphasis of changes among the major proteins. The scale for each graph is set so that the first bar of each graph has a height of 1.0.

E 1 0 275.6 Mannose-labeled

RH20 46.7 Mannose-labeled

RH30 Mannose-labeled

9 1 0 CRLPRCTIN I

FIG. 7. Selected spot graphs. Each spot graph represents data for one spot (identified by standard spot number at the left) in one matchset (matchset number and name given above columns). Each bar represents the spot intensity in one gel of the matchset. The uertical scale is in units of ppm (parts per million of total radioactivity applied to the gel). Each graph is scaled independently; the full scale ppm value for the first graph is given at the left.

converted to related forms. Like spot DG10, these two pro- teins are difficult to detect in cells labeled for 24 h (bar 5-8, Matchset 4 , although neither has been detected in the me- dium of REF52 cells (data not shown). The dramatic induc- tion of these proteins a t confluence and quiescence suggests that they may be expressions of the differentiated state of REF52 cells.

One of the spots repressed at confluence is calpactin I (Spot SFlO), a protein first identified as a substrate for the tyrosine kinase activity of the pp60 src protein of avian sarcoma virus (33). Spot graphs (Fig. 7) show that this protein is more stable than average REF52 proteins (Matchset 4), and as expected for stable proteins, it is synthesized at lower relative rates in confluent and quiescent cells than in proliferating cells (bars 4 and 6 versus bar 1, Matchset 3).

Proteins Induced at Quiescence-Although the cell lineage from which REF52 cells originated is not known, proteins characteristic of the differentiated state would be expected to be expressed most highly at quiescence. A set of the 50 spots most elevated at quiescence was selected by the set-score method. This set was further selected to remove low quality spots and spots that are higher at confluence (day 6) than at quiescence (Day 12). The final set contains 39 proteins (line 5, Table VI), of which seven are mannose-labeled, four are nuclear, and one is cytoskeletal. The set graphs (Set 262B, Fig. 6) reveal that these spots are induced, on average, ap- proximately 11-fold at quiescence (bar 6 versus bar 1, Match- set 3) but only about 3-fold at confluence (bar 4 versus bar 1, Matchset 3). They are strongly repressed in refed cells (bar 5 versus bar 4 and bars 7 and 8 versus bar 6, Matchset 3) suggesting that the members of this set are synthesized pri- marily in nondividing cells.

One member of this set is tropomyosin-6 (Fig. 7), a member of the tropomyosin family that migrates close to the position of known smooth muscle tropomyosins (34). The ppm value for tropomyosin-6 is 19-fold higher in quiescent than in proliferating cells (bar 6 versus bar 1, Matchset 3). Another member of the set, one of the myosin light chain isoforms (spot DD50), is induced by 48-fold at quiescence (spot graphs

Page 14: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5296 The REF52 Protein Database

TABLE VI Spot sets defined by quantitative analysis of REF52 cells

The spots sets were created from Matchsets 3 and 4, as described under “Experimental Procedures.” Each spot set has been characterized by the size of its intersection with other spot sets from the REF52 database: all high quality (Set 220G), nuclear (Set 260T), phosphate-labeled (Set 213L), mannose-labeled (Set 220D), mitochondrial (218A), cytoskeletal (Set 220H), long-lived (Set 260M), and short-lived (Set 260N).

Set name All Nuc Phos Mann Mito Cyto Long Short Set no‘ 220G 260T 213L 220D 218A 220H 260M 260N

1. Long-lived 260M 66 10 7 2 4 0 66 0 2. Short-lived 260N 83 8 3 23 1 0 0 83 3. Higher at confluence 2600 100 9 3 23 2 4 2 25 4. Lower at confluence 260P 100 17 13 3 2 3 14 2 5. Highest at quiescence 262C 39 4 0 7 0 1 0 6 6. Induced early after refeed 262B 69 6 6 6 0 1 2 4 7. Induced late after refeed 2623 48 5 2 0 3 1 8 4

not shown). One other cytoskeletal protein that is known to be induced at quiescence in REF52 cells is the smooth-muscle isoform of a-actin (34). (Smooth muscle a-actin is not a member of the set selected here because the actin isoforms were not quantified separately). Further characterization of the cytoskeletal proteins and glycoproteins induced in quies- cent REF52 cells should help to reveal whether REF52 cells are related to smooth muscle cells.

Proteins Induced Early and Late after Refeeding-Spot sets were created to contain proteins induced early (5 h) and late (15 h) after quiescent cells are refed with fresh medium and serum (lines 6 and 7, Table IV). Both sets contain phosphate- labeled proteins and nuclear proteins, but only the set of proteins induced early contains mannose-labeled proteins. The set graphs (Set 262B, Fig. 6) reveal that, on average, the proteins induced early (bar 7 versus bars 6 and 8, Matchset 3) are only slightly repressed during growth to confluence (bars 1-4, Matchset 3), and that, on average, they are neither long-lived nor short-lived (Matchset 4). In contrast, proteins induced late after refeeding (Set 2623) are, on average, highly growth-regulated (decreasing in bars 1-4, Matchset 3), and they tend to be more stable than average proteins (Matchset 4).

The proteins induced early after refeeding are interesting because many of them are elevated only transiently. Among the known proteins is one of the tropomyosin isoforms, tro- pomyosin-3, and one of the heat shock proteins, HSP-80. Also in the set are the members of the “p45” family of mannose- labeled proteins (Fig. 5A and Ref. 29). Spot graphs are shown in Fig. 7 for HSPdO and for another interesting member of the set, spot ME10 (PI 6.41, M, 29,700). The latter spot is readily detected in 2-h labelings (bars 2 and 4, Matchset 4), but is barely detected in cells labeled for 30 min (bars 1 and 2) or for 24 h (bars 5-8). No precursor form has been found that might explain its delayed appearance.

Only one known protein, calpactin-I (Fig. 7), is a member of the set induced 15 h after quiescent cells are refed. Because the set does not contain PCNA, a nuclear protein known to be induced just prior to the initiation of DNA synthesis in mitogenically stimulated cells (35), we conclude that the set selected here represents inductions that occur prior to DNA replication. Further studies of the early and late responses of mitogenically stimulated cells are presented in the accompa- nying paper (27).

DISCUSSION

The REF52 protein database, introduced here, has im- proved our ability to interpret the complex two-dimensional gel patterns obtained from studies of REF52 cells and related cell lines. The database allows us to store and access the

available knowledge for each of the proteins on the two- dimensional map. Some of the knowledge is descriptive, con- sisting of protein names, subcellular localizations, known protein modifications, literature citations, and other useful information accumulated through the experience of the users. Other knowledge is quantitative and is derived from experi- ments with cultured cells that employ computer-analyzed two- dimensional gel electrophoresis as an assay procedure. The interpretation of each new experiment is assisted by the prior knowledge from the database, and in turn each new experi- ment adds new knowledge to the database.

How much knowledge is needed for adequate interpretation of two-dimensional gel patterns? The REF52 database is still very small in terms of the number of identified proteins, the number of permanent spot sets, and the number of quantita- tive experiments that it could contain. However the infor- mation that is now available is large compared to what has been previously available for the interpretation of these gel patterns. The REF52 database contains numerous marker spots that are indicators for proliferating cells, quiescent cells, cells in heat shock, and cells responding to mitogenic stimu- lation. It also contains, as presented in the following report (27), markers for cellular transformation by tumor viruses. Therefore, even at the present state of knowledge, a reason- able “diagnosis” of the proliferative status of the cell can be made from its two-dimensional gel pattern.

How can a database such as that created for REF52 cells be used to guide future research efforts? Two-dimensional gels have long been successful in identifying new members of protein families, such as the multiple forms of actin (36) and the multiple forms of tropomyosin (34, 37) that have later been analyzed biochemically and genetically. This paper shows that quiescent REF52 cells are characterized by dra- matically induced synthesis of at least 39 proteins, including a tropomyosin isoform, a myosin light chain isoform, and several major glycoproteins. It is likely that the further char- acterization of these and other proteins in the set would reveal more about the differentiated state of REF52 cells.

Two-dimensional gels can also be used to discover and characterize unknown proteins with roles in particular cellular structures or regulatory pathways. In this report, spots rep- resenting growth-regulated proteins were selected and re- corded in the REF52 database as permanent spot sets. Efforts to purify some of these proteins and to clone their genes could add significantly to our knowledge of the cellular machinery for cell division and for the regulation of the cell cycle. Although the sets selected by just a single trait usually contain too many members to be a focus for future research, the accumulated information available in the database often al- lows the focus to be narrowed to one or a few proteins of

Page 15: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

The REF52 Protein Database 5297

greatest interest. As the database grows, we can select proteins based on their patterns of regulation in multiple experiments, their subcellular localizations, their post-translational modi- fications, their half-lives, their tissue specificity, and many other properties. To characterize relatively abundant pro- teins, such as those that appear in quiescent REF52 cells, the two-dimensional gels can provide enough material for partial protein sequencing (38) and for antibody production (39). For less abundant proteins, such as those that are transiently induced after mitogenic stimulation, the database can guide efforts to enrich or to purify the protein by revealing which cell types and cell fractions contain the protein in greatest abundance and by revealing the experimental conditions that best stimulate its synthesis.

How complete can a protein database derived from two- dimensional gels ever be? The gels analyzed here certainly do not reveal all proteins contained in the cell, and many of the proteins of greatest interest may be missed. In the previous paper (28), it was shown statistically that at least 1400 spots are likely to be hidden beneath other spots or obscured by the background. Other proteins are not detected because they are below the limits of sensitivity, too basic, or too low in molec- ular weight. We know from the analyses of nuclear proteins and phosphoproteins presented here, and from immunopre- cipitations of proteins such as p53, SV40 T-antigen, adeno- virus ElA, Fos, and Fos-related antigens presented elsewhere (34,40,41), that many proteins of interest are not represented among the spots of the current REF52 protein map. Many of these proteins would be detected if they could be resolved from the more intense neighboring spots. Higher resolution can be achieved by the use of multiple gels optimized for narrower ranges of pH and molecular weight, and by prefrac- tionation of cell lysates before electrophoresis. Higher reso- lution should also result from further improvements to the methods for two-dimensional gel electrophoresis. Therefore, as it becomes easier to produce and analyze large numbers of two-dimensional gels; and, as the quality of the gels improves further, it will be possible to analyze routinely many more proteins than those presented here.

The REF52 database has been developed as a resource for studies of transformation and growth regulation in the REF52 family of cell lines, and as a prototype for future databases. As a resource, the database can be used to interpret two- dimensional gel patterns of REF52 and related rat cell lines. The entire database generated from these studies is available as computer-generated wall maps with annotated overlays, and as computer-readable data files. As a prototype, the REF52 database is providing the experience needed to build databases of larger size and broader scope. The work reported here and in the accompanying papers (27, 28) shows that the computer methods are now adequate for analysis of complex two-dimensional gel images and that useful results can be distilled from the massive amounts of data available on these gels.

Future improvements to the analysis system can come through continuing software enhancements and through im- plementation of the software on more powerful computers. The most serious obstacle to the development of widely shared databases is the lack of standardization in the electrophoresis technique and difficulties in the long term reproducibility of the gel patterns. These problems will be overcome when more highly standardized electrophoresis reagents (especially am- pholytes) and more highly automated electrophoresis equip- ment become available. We believe that the establishment of successful model protein databases will further stimulate de- velopment of the technology for standardized two-dimen-

sional electrophoresis, and that larger, widely shared data- bases will become a reality in the future.

Acknowledgments-We thank James D. Watson for support and encouragement. We thank Bill Topp for providing the REF52 cell line. For helpful discussions and comments on the manuscript, we thank Fumio Matsumura, William Welch, Calvin McLaughlin, Doug Hanahan, Michael Lambert, Stephen Blose, and Richard Roberts. For generously providing antibodies, we thank Fumio Matsumura, Shigeko Yamashiro-Matsumura, Stanley Cohen, John Northrup, Robert Ralston, Ray Erikson, Anthony Pegg, Peter Hornbeck, and Chandra Kumar. We thank Cecile Chang and Heidi Sacco for expert management of the QUEST facilities, Mary Hannaford for assistance in preparing the manuscript, Jim Duffy for graphics, and Phil Renna for photography. Finally, we thank all the members of the QUEST facility for helping to make this work possible.

REFERENCES 1. Garrels, J. I. (1979) J. Bwl. Chem. 2 5 4 , 7961-7977 2. Anderson, N. L., and Anderson, N. G. (1978) Anal. Biochem. 8 5 ,

3. Anderson, N. G., and Anderson, N. L. (1978) Anal. Biochem. 8 5 ,

4. Garrels, J. I. (1983) Methods Enzymol. 100 , 411-423 5. Garrels, J. I., Farrar, J . T., and Burwell, C. B. IV, (1984) in Two-

Dimensional Gel Electrophoresis of Proteins (Celis, J. E., and Bravo, R., eds) pp. 38-90, Academic Press, Inc., Orlando, FL

6. Lipkin, L. E., and Lemkin, P. F. (1980) Clin. Chem. 26, 1403- 1412

7. Vo, K.-P., Miller, M. J., Geiduschek, E. P., Nielsen, C., Olson, A., and Xuong, N. H. (1981) Anal. Biochem. 112,258-271

8. Skolnick, M. M. (1982) Clin. Chem. 28,979-986 9. Taylor, J., Anderson, N. L., Scandora, A. E., Jr., Willard, K. E.,

341-354

331-340

and Anderson, N. G. (1982) Clin. Chem. 28,861-866 10. Tarroux, P. (1983) Electrophoresis 4,63-70 11. Anderson, N. L., Nance, S. L., Tollaksen, S. L., Giere, F. A., and

Anderson, N. G. (1985) Electrophoresis 6, 592-599 12. Neidhardt, F. C . , Vaughn, V., Phillips, T. A., and Bloch, P. L.

(1983) Microbiol. Rev. 47,231-284 13. Phillips, T. D., Vaughn, V., Bloch, P. L., and Neidhardt, F. C.

(1987) in Escherichia coli and Salmonella typhimurium: Cellular and Molecular Biology (Neidhardt, F. C., Ingraham, J. L., Low, K. B., Magesanik, B., Schaechter, M., and Umbarger, H. E., eds) pp. 919-967, American Society for Microbiology, Wash- ington, D.C.

14. Ludwig, J. R., 11, Foy, J. J., Elliott, S. G., and McLaughlin, C . S. (1982) Mol. Cell Biol. 2 , 117-126

15. Bataille, N., Peypouquet, M.-F., and Boucherie, H. (1987) Yeast

16.

17.

18.

19. 20.

21.

22.

23.

24.

25.

26.

27.

28. 29.

30.

2 , l -11 Anderson, L., and Anderson, N. G. (1977) Proc. Natl. Acad. Sci.

Pedersen, S., Bloch, P. L., Reeh, S., and Neidhardt, F. C. (1978)

Fey, S. J., Bravo, R., Larsen, P. M., Bellatin, J., and Celis, J. E.

Bravo, R., and Celis, J. E. (1982) Clin. Chem. 2 8 , 766-781 Fransen, L., Van Roy, F., and Fiers, W. (1983) J. Biol. Chem.

258,5276-5290 Leavitt, J., Goldman, D., Merril, C., and Kakunaga, T. (1982)

Clin. Chem. 28,850-860 Anderson, N. L., Giometti, C. S., Gemmell, M. A., Nance, S. L.,

and Anderson, N. G. (1982) Clin. Chem. 28,1084-1092 Lester, E. P., Lemkin, P. F., Lowery, J. F., and Lipkin, L. E.

(1982) Electrophoresis 3,364-375 Wirth, P. J., Doniger, J., Thorgeirsson, S. S., and DiPaolo, J. A.

(1986) Cancer Res. 46,390-399 Celis, J. E., Ratz, G. P., Celis, A., Madsen, P., Gesser, B., Kwee,

S., Madsen, P. S., Nielsen, H. V., Yde, H., Lauridsen, J. B., and Basse, B. (1988) Leukemia 2,561-601

Lefkovits, I., Kuhn, L., Valiron, O., Merle, A., and Kettman, J. (1988) Proc. Natl. Acad. Sei. U. S. A. 8 5 , 3565-3569

Garrels, J. I., and Franza, B. R., Jr. (1989) J. Biol. Chem. 2 6 4 ,

Garrels, J . I. (1989) J. Biol. Chem. 2 6 4 , 5269-5282 Santaren, J. F., and Bravo, R. (1987) Exp. Cell Res. 168, 494-

Mathews, M. B. (1988) in Growth Control During Cell Aging

U. S. A . 74,5421-5425

Cell 14,179-190

(1981) Cell Bwl. Znt. Rep. 5 , 491-500

5299-5312

506

Page 16: REF52 Protein Database · A protein database pro- vides the means to store and compare quantitative and descriptive data for up to 2000 proteins from many experiments that employ

5298 The REF52 Protein Database (Wang, E., and Warner, H.R ., eds) C. R. C. Press, Boca Raton, FL, in press

31. Francke, W. W. (1987) Cell 48,3-4 32. Anderson, L. (1981) Proc. Natl. A d . Sei. U. S. A. 78, 2407-

33. Bishop, J. M. (1985) Cell 41,23-38 34. Franza, B. R., Jr., and Garrels, J. I. (1984) Cancer Cells 1, 137-

35. Bravo, R., and Macdonald-Bravo, H. (1984) EMBO J. 3, 3177-

36. Garrels, J. I., and Gibson, W. (1976) Cell 9, 793-805 37. Garrels, J. I. (1979) Deu. Biol. 73, 134-152 38. Aebersold, R. H., Leavitt, J., Saavedra, R. A., Hood, L. E., and

Kent, S. B. H. (1987) Proc. Natl. Acud. Sci. U. S. A. 84,6970- 6974

2411

146

3181

39. Spitz, M. (1986) Methods Enzymol. 121, 33-41

40.

41.

42.

43. 44.

45.

46.

47. 48.

Harlow, E., Franza, B. R., Jr., and Schley, C. (1985) J. Virol. 56,

Franza, B. R., Jr., Sambucetti, L. C., Cohen, D. R., and Curran,

Ochs, D. C., McConkey, E. H., and Guard, N. L. (1981) Exp. Cell.

Lin, J. J. C., and Feramisco, J. R. (1981) Cell 24, 185-193 Yamashiro-Matsumura, S., and Matsumura, F. (1988) J. Cell.

Mathews, M. B., Bernstein, R. M., Franza, B. R., Jr., and Garrels,

Welch, W. J., Garrels, J. I., Thomas, G. P., Lin, J. J.-C., and

Steinberg, R. A. (1984) J. Cell BioL 98, 2174-2178 Larsen, P. M., Fey, S. J., Bravo, R., and Celis, J. E. (1983)

533-546

T. (1987) Oncogene 1,213-221

Res. 135,355-362

Biol. 106,1973-1983

J. I. (1984) Nature 309,374-376

Feramisco, J. R. (1983) J. Bwl. Chem. 258, 7102-7111

Electrophoresis 4,247-256