Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

11
REGULAR ARTICLE Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra David Goldberg 1 , Mark Sutton-Smith 2 , James Paulson 3 and Anne Dell 2 1 Scripps-PARC Institute for Advanced Biomedical Sciences, Palo Alto, CA, USA 2 Department of Biological Sciences Imperial College London, London, UK 3 Department of Molecular Biology, Joint Department of Molecular Experimental Medicine, The Scripps Research Institute, La Jolla, CA, USA Matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS) is the pre-eminent technique for mass mapping of glycans. In order to make this technique practical for high- throughput screening, reliable automatic methods of annotating peaks must be devised. We describe an algorithm called Cartoonist that labels peaks in MALDI spectra of permethylated N-gly- cans with cartoons which represent the most plausible glycans consistent with the peak masses and the types of glycans being analyzed. There are three main parts to Cartoonist. (i) It selects annota- tions from a library of biosynthetically plausible cartoons. The library we currently use has about 2800 cartoons, but was constructed using only about 300 archetype cartoons entered by hand. (ii) It determines the precision and calibration of the machine used to generate the spectrum. It does this automatically based on the spectrum itself. (iii) It assigns a confidence score to each annotation. In particular, rather than making a binary yes/no decision when annotating a peak, it makes all plau- sible annotations and associates them with scores indicating the probability that they are correct. Received: April 28, 2004 Revised: October 6, 2004 Accepted: October 15, 2004 Keywords: Glycan / Glycomics / High-throughput screening Proteomics 2005, 5, 865–875 865 1 Introduction Profiling of N-linked glycan structures using MALDI-TOF-MS has become a widely used and desirable method for rapid acquisition of information about the nature and diversity of structures of glycans derived from native and recombinant gly- coproteins, and even more complex mixtures from cells and tissues [1–3]. Annotation of the spectra and assignment of gly- can structures to the mass peaks is typically done manually by an expert. Peaks are conveniently annotated with symbol struc- tures, or cartoons, that represent the topology of the structure but not precise nature of the glycosidic bonds. Although the only information provided by the spectra is molecular weight, specific monosaccharides can be assigned for portions of the structure with high probability, based on constraints imposed by the biosynthetic pathway. Thus, all core regions are com- prised of three mannoses (Man) and two N-acetylglucosamine (GlcNAc) residues and many hybrid and complex structures carry a single fucose (Fuc) on the core. Moreover, the antennae added to the core mannose residues in many complex N-glycans typically contain only GlcNAc, Gal, sialic acid (NeuAc or NeuGc) and Fuc, and frequently contain common patterns such as the Lewis X trisaccharide (Galb1-4[Fuca1-3])GlcNAc). Thus, the most plausible structures corresponding to each peak can be deduced and represented in symbolic form by taking into account prior knowledge of the biosynthetic pathways known to occur in the organism from which the glycans are derived. The resulting glycan profile can then be used to determine which additional analytical methods should be employed to verify the Correspondence: Dr. David Goldberg, Scripps-PARC Institute for Advanced Biomedical Sciences, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA E-mail: [email protected] Fax: 11-650-812-4471 Abbreviations: Fuc, fucose; Gal, galactose; GalNAc, N-acetylga- lactosamine; GlcNAc, N-acetylglucosamine; Hex, hexose; Hex- NAc, N-acetylhexosamine; LacdiNAc, GalNAcb1-4GlcNAc; Lac- NAc, N-acetyl-lactosamine; NeuAc, N-acetylneuraminic acid; NeuGc, N-glycolylneuraminic acid 2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de DOI 10.1002/pmic.200401071

Transcript of Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Page 1: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

REGULAR ARTICLE

Automatic annotation of matrix-assisted laser

desorption/ionization N-glycan spectra

David Goldberg1, Mark Sutton-Smith2, James Paulson3 and Anne Dell2

1 Scripps-PARC Institute for Advanced Biomedical Sciences, Palo Alto, CA, USA2 Department of Biological Sciences Imperial College London, London, UK3 Department of Molecular Biology, Joint Department of Molecular Experimental Medicine,

The Scripps Research Institute, La Jolla, CA, USA

Matrix-assisted laser desorption/ionization-mass spectrometry (MALDI-MS) is the pre-eminenttechnique for mass mapping of glycans. In order to make this technique practical for high-throughput screening, reliable automatic methods of annotating peaks must be devised. Wedescribe an algorithm called Cartoonist that labels peaks in MALDI spectra of permethylated N-gly-cans with cartoons which represent the most plausible glycans consistent with the peak masses andthe types of glycans being analyzed. There are three main parts to Cartoonist. (i) It selects annota-tions from a library of biosynthetically plausible cartoons. The library we currently use has about2800 cartoons, but was constructed using only about 300 archetype cartoons entered by hand. (ii) Itdetermines the precision and calibration of the machine used to generate the spectrum. It does thisautomatically based on the spectrum itself. (iii) It assigns a confidence score to each annotation. Inparticular, rather than making a binary yes/no decision when annotating a peak, it makes all plau-sible annotations and associates them with scores indicating the probability that they are correct.

Received: April 28, 2004Revised: October 6, 2004

Accepted: October 15, 2004

Keywords:

Glycan / Glycomics / High-throughput screening

Proteomics 2005, 5, 865–875 865

1 Introduction

Profiling of N-linked glycan structures using MALDI-TOF-MShas become a widely used and desirable method for rapidacquisition of information about the nature and diversity ofstructures of glycans derived from native and recombinant gly-coproteins, and even more complex mixtures from cells andtissues [1–3]. Annotation of the spectra and assignment of gly-can structures to the mass peaks is typically done manually by

an expert. Peaks are conveniently annotated with symbol struc-tures, or cartoons, that represent the topology of the structurebut not precise nature of the glycosidic bonds. Although theonly information provided by the spectra is molecular weight,specific monosaccharides can be assigned for portions of thestructure with high probability, based on constraints imposedby the biosynthetic pathway. Thus, all core regions are com-prised of three mannoses (Man) and two N-acetylglucosamine(GlcNAc) residues and many hybrid and complex structurescarry a single fucose (Fuc) on the core. Moreover, the antennaeadded to the core mannoseresidues in many complex N-glycanstypically contain only GlcNAc, Gal, sialic acid (NeuAc or NeuGc)and Fuc, and frequently contain common patterns such as theLewis X trisaccharide (Galb1-4[Fuca1-3])GlcNAc). Thus, themost plausible structures corresponding to each peak can bededuced and represented in symbolic form by taking intoaccount prior knowledge of the biosynthetic pathways known tooccur in the organism from which the glycans are derived. Theresulting glycan profile can then be used to determine whichadditional analytical methods should be employed to verify the

Correspondence: Dr. David Goldberg, Scripps-PARC Institute forAdvanced Biomedical Sciences, 3333 Coyote Hill Road, Palo Alto,CA 94304, USAE-mail: [email protected]: 11-650-812-4471

Abbreviations: Fuc, fucose; Gal, galactose; GalNAc, N-acetylga-lactosamine; GlcNAc, N-acetylglucosamine; Hex, hexose; Hex-

NAc, N-acetylhexosamine; LacdiNAc, GalNAcb1-4GlcNAc; Lac-

NAc, N-acetyl-lactosamine; NeuAc, N-acetylneuraminic acid;NeuGc, N-glycolylneuraminic acid

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

DOI 10.1002/pmic.200401071

Page 2: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

866 D. Goldberg et al. Proteomics 2005, 5, 865–875

structures of key mass peaks (e.g., nanoelectrospray (nanoES)-MS/MS before and after glycosidase digestions, methylationanalysis, etc.). An attractive feature of the resulting MALDI-TOFglycan profiles is that they can be obtained with very smallamounts of biological samples, and provide a structural ‘finger-print’ that can be readily compared with the profiles of othersamples (e.g., two tissues or cell lines, two batches of recombi-nant glycoprotein). For these reasons, the Consortium forFunctional Glycomics has adopted this method for screeningthe glycan profiles of murine and human tissues and cellpopulations (http://www.functionalglycomics.org).

Currently MS glycan analysis is restricted to specialistlaboratories and data is manually interpreted. While manualannotation of MALDI spectra is relatively straightforward fora trained mass spectroscopist, it is challenging for those notskilled in glycan MS and it is also very time-consuming.Typically only the major peaks are annotated even thoughmany minor peaks are observed in complex samples. If usedfor assessing glycan profiles from a series of samples, theannotation becomes a major bottleneck in the analysis. Evenworse, the manual annotation can be resource limiting oreven prohibitive for a high-throughput glycomics type screenof the glycan profiles of glycoproteins, cells, and tissues. Toaddress this need we have developed an algorithm calledCartoonist, which performs automated annotation ofMALDI-TOF spectra. Cartoonist incorporates the assump-tions used in manual annotation of spectra with an algo-rithm capable of identifying glycan peaks, assigning plau-sible glycan structures and providing a confidence levelassignment to aid further analysis as desired. Cartoonist cananalyze a complete spectrum in several seconds, and pro-duce results either as a table, or in several different graphicalformats. The graphical forms consist of the spectra anno-tated with either cartoons of the structures in symbol form,or with formulas giving the counts of each monosaccharide.

We illustrate the use of Cartoonist by describing its resultson spectra from mouse kidney. This is only one of over 100 dif-ferent samples that have been successfully processed by Car-toonist. This automated method of annotating MALDI spectrashould be of immediate utility to investigators performingMALDI-TOF glycan analysis on biological samples. Elementsof the algorithm are anticipated to have utility in developingsimilar algorithms for automated analysis of O-linked glycansand for nanoES-MS/MS and/or MALDI-TOF/TOF dataacquired from collisional activation of selected molecular ions.

2 Materials and methods

2.1 Mouse kidney

The C57BL/6 mice were bred and maintained in a restricted-access barrier facility under specific pathogen-free condi-tions. The kidney was excised from a mouse of severalmonths of age and stored at 2807C until the N-glycans wereisolated from the tissue.

2.2 Isolation of N-glycans and chemical

derivatization

N-Glycans were released from a kidney peptide extract bypeptide:N-glycanase (PNGase F) treatment and were subse-quently permethylated before analysis by MS, using MALDI-TOF-MS in the strategy described previously [4]. The spectrashown in the paper were acquired from analysis of 10% ofglycan preparations isolated from a single mouse kidney.Permethylation was performed using the sodium hydroxideprocedure as described previously [5].

2.3 MALDI-TOF analysis and data type

MALDI data were acquired using an ABI Perspective Bio-systems Voyager-DE sSTR mass spectrometer (Cheshire,UK) in the reflectron mode with delayed extraction. Per-methylated samples were dissolved in 10 mL methanol, and1 mL dissolved sample was premixed with 1 mL matrix (2,5-dihydrobenzoic acid; Sigma-Aldrich, Poole, UK) before load-ing onto a 100-well stainless steel sample plate. The input toCartoonist was generated from Voyager Data Explorer datafiles by using the “Copy, Trace Data” function from the “Edit”menu.

2.4 Assigning cartoons to peaks

The first step in matching peaks to potential glycans is toconstruct a table of potential glycans. Each row of the tablecontains a potential glycan, its theoretical mass, and theabundance of its 10, 11, . . ., 15 isotopes. (The most abun-dant isotope for a glycan of mass 6000 is at m 1 4 (18%), with13% at 15: this is the reason for going up to the 15 isotope).For the instrument used here, after bias correction the massuncertainty is less than 0.1 Da over the majority of the massrange. Thus, for each peak there is at most one set of glycans(those with the same atomic number, and hence having mas-ses within 0.1 Da of each other) that can be assigned to thepeak. However before bias correction, or for high mass wheresignal intensity, resolution and accuracy are reduced, thereare often several possible sets of glycans with masses differingby 1 Da that can be assigned to a peak, or alternately severalpossible peaks that can be assigned to a set of glycans. Wehandle this systematically by fixing a tolerance, and for eachpotential glycan in the table, finding all candidate peaks of thespectra that match the theoretical mass of the glycan withinthe tolerance. From among the candidate peaks we select theone whose isotope abundance series (envelope) best matchesthe envelope in the spectrum. The table of potential glycans isgenerated by using all possible sets of monosaccharideswithin a range set by the experimenter. For mammalian tis-sues, we use constraints such as the minimum number ofHexNAc and Hex residues is two and three respectively. Inaddition to limiting the number of each monosaccharide usedto generate potential glycans, the experimenter can alsoinclude relations that must be satisfied.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 3: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Proteomics 2005, 5, 865–875 IPC Proceedings 867

2.5 Cartoon library

The cartoon library is constructed by giving a set of archetypecartoons and generating the library from them using a set ofrules. A sample of archetype cartoons is shown in Fig. 1a.Currently we are using only three rules. The first rule con-cerns a Fuc attached to the GlcNAc at the reducing end of thecore. The archetype set always contains such a Fuc; the firstrule generates glycans without that Fuc. The second ruleconcerns bisecting GlcNAcs. The archetype set never con-tains them; the second rule generates them automatically.The third rule concerns sialic acids. The archetype set onlycontains N-acetylneuraminic acid (NeuAc); they are system-atically replaced by NeuGc. An example application of therules is shown in Fig. 1b. The result of applying these rules isthat 300 reference N-glycans is expanded to a pool of 2800 N-glycans. We are working on the addition of further rules,which will enable the user-specified archetype set to be madesignificantly smaller. The glycan library can contain multiplecartoons for the same atomic number, and currently Cartoo-nist discriminates between them by using its ranking rules.Multiple cartoons can arise from different glycans of similarmass, or from alternate topologies for a glycan (illustratedwith red labels in Figure 1a).

2.6 Cartoon archetype set

The main biosynthetic pathways for the assembly of N-gly-cans from the common dolichol-linked glycan have beenwell established in mammalian cells [6, 7]. Since then iden-tification and cloning of a number of glycosyltransferasesand gene-targeting technology has allowed more biosyn-thetic routes to be identified [8]. This has been aided by toolsthat predict the structures of glycans generated from the ac-tivities of glycosyltransferase [9, 10]. Complementary infor-mation is provided by structural studies performed duringthe past three decades [8]. All this information was used toassign by hand a cartoon(s) to the peaks from a few dozenspectra that matched an entry in the potential glycan table.

These cartoons were used to initialize the archetype set.Once this was done, the archetype set was semi-auto-matically expanded as follows. Each time Cartoonist was runon a new spectrum, it flagged intense peaks that matched anentry in the potential glycan table but did not have a cartoon.These peaks were examined by hand, and for those that hadthe mass of an N-glycan that can be produced by mammaliancells, a cartoon for that glycan was added to the archetype set.

2.7 Calibration and precision

In order to determine the confidence of a peak assignment,we need to correct for any miscalibration of the spectrum,and obtain an estimate for the precision of the spectral peakmasses. The calibration of a spectrum is determined by firstfinding about 15 high-confidence peak assignments. Theseare peaks that are relatively large (typically among the high-est 200 peaks), have isotope envelopes that closely matchtheir theoretical values, and have a mass very close to themass of a predicted glycan. For each such peak of mass mi,the deviation between theoretical and observed peak masses,di = pred – obs is computed. The (mi, di) pairs are fitted to aline a m 1 b, giving corrected deviations di

0 = di – (a mi 1 b).Since some of the assignments may be wrong, we find thebest linear fit using an algorithm that is insensitive to out-liers [11]. The constants a and b represent a “software cali-bration” of the machine, and the spread of the deviations di

0

give a measure of the precision. As a final refinement, thisnew measure d0 is used to recompute a new set of high con-fidence peak assignments which are then used to get arefined estimate of d0. The precision of the spectra is con-trolled by the spread of the deviations d0 for correct peakassignments. The difficulty is that we need to know thespread in order to determine if a peak assignment is correct.We break this circularity in the following way. First we gen-erate a list of all potential peak assignments. These are peaksthat have a reasonable isotope envelope and come withinabout 100 ppm of a predicted glycan. The d0 for these peakswill cluster around zero as in Fig. 2. If there is a central

Figure 1. Cartoon generation. Cartoons are generated from an archetype set (a) using rules (b). Part (a) shows a sample of eight archetypecartoons. The labels are not masses, but rather represent the number of GlcNAc, Hex, Fuc, and NeuAc monosaccharides, respectively. Thered labels mark an example of two different cartoons with the same label (i.e., identical monosaccharide composition). Part (b) shows asingle archetype cartoon, and the eight library cartoons it generates.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 4: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

868 D. Goldberg et al. Proteomics 2005, 5, 865–875

Figure 2. Distribution of deviations for the kidney spectrum. Thehistogram shows the deviations (theoretical mass – observedmass) for all peaks that can be assigned to a glycan, includingvery low confidence assignments.

cluster with some distant outliers, we assume the outliers arenot valid assignments, and remove them. We use the spreadof the remaining of deviations as a measure of precision ofthe spectral masses. Specifically, we take the standard devia-tion s of the remaining deviations and assume that anydeviation that is more than 3s is very unlikely. Most of thespectra (about 80%) have a very prominent gap separatingthe central cluster from outliers, and so any reasonablemethod of gap detection can be used. Figure 2 is typical, witha clear cluster of deviations within roughly 0.05 Da of 0. Thespecific algorithm we employ works as follows. We sort thedeviations by magnitude from smallest to largest giving ud1u� ud2u� . . . �udnu (from now on we drop the primes, but theseare still corrected deviations). Then we compute the differ-ence between successive magnitudes Dk = udku – udk–1u. If adifference Dk is more than 2.0 times the maximum of all theprevious Dj with j , k, it is flagged as a potential gap. Thelargest potential gap Dk (i.e., the one with the largest Dk /max j , k (Dj)) satisfying k . n, (where n is the number ofpeaks used to compute the linear fit am 1 b) is used toidentify the outliers. Specifically, outliers are all peaks withdeviations dj beyond the gap, that is udju� udku. As an example,for the kidney spectra used in this paper, the largest gap has aratio of 9.0. For randomly generated deviations following aGaussian distribution, such a large gap would occur lessthan 1 time out of a 1000.

2.8 Confidence score

We compare the deviation tothe standard deviation s andcompare the observed isotopeenvelope and the predictedenvelope to compute a con-fidence score. Figure 3 illus-trates that the theoretical iso-tope abundances and the peakintensities of isotopes canmatch quite closely. In detail,for a peak of mass m, we used0/s as a measure of con-fidence based on closeness.We also measure the height ofthe peaks at mass m2 1, m,m1 1, . . ., m1 3 and comparethese heights to the sequence0, f0, f1, f2, f3, where fk is pre-dicted abundance of the 1kisotope of the associated gly-can (unless the mass is large,in which case we start with theisotope whose abundance isless than 10%). We treat the

sequence of heights (hi) and the sequence of frequencies (fi)as vectors, normalize them and compute the norm of theirdifference as a measure of the match between theoretical andobserved isotope envelopes. The reason for examining a peak(if any) at mass m2 1 is to preclude matching the 11 (orhigher) isotope in the spectra to a 10 isotope of the poten-tially matching glycan. The confidence score is the sum ofthe deviation and vector difference norm, scaled so that apeak with a deviation of 1s and a vector difference of 1s hasscore 10, with lower scores indicating higher confidenceassignments. The hand-annotated annotations of Fig. 4 havea median confidence score of 4, with scores ranging from 1to 21. The distribution of confidence scores for assignmentsmade by Cartoonist to the mouse kidney spectrum is shownin Fig. 5.

3 Results

Cartoonist is an algorithm that annotates raw MALDI-TOFdata of permethylated N-glycans released from glycopro-teins. The raw MALDI-TOF data is transformed by Cartoo-nist into a spectrum annotated with N-glycan cartoons. Allmass spectral peaks that match mass-to-charge values ofpermethylated N-glycans are annotated with cartoons. Thecartoons provide both compositional information as well aspredictions of probable topologies. Since a ‘fingerprint’ of agiven analyte is completed in a few seconds this algorithmgoes some way towards high-throughput MALDI-TOF-basedglycan profiling.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 5: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Proteomics 2005, 5, 865–875 IPC Proceedings 869

Figure 3. Isotope envelopes. A portion of the mouse kidneyspectrum is shown illustrating the isotope envelope of a glycan.The envelope is for isotopes of the glycan with theoretical mass4084.05. The 12 isotope (theoretical mass 4086.05, observedmass 4085.82) is predicted to be the most abundant and is indeedthe most intense peak. The large dots are drawn at a height thatcorresponds to the theoretical isotope intensities.

It is outside the scope of this paper to discuss themethodologies of N-glycan profiling which are well docu-mented elsewhere [5] but suffice to say that N-glycans areenzymatically released from the glycoproteins isolated from

mammalian matrices. Theseare subsequently permethy-lated to allow high sensitivityMALDI-TOF glycan profiling,and to provide highly infor-mative follow-up MS/MSstudies by directing fragmen-tation pathways. Once theMALDI-TOF experiment hasbeen completed, a text file ofthe spectrum is imported di-rectly into Cartoonist, whichthen produces a report asso-ciating glycans to peaks aswell as image files showingthe spectrum with annota-tions.

The Cartoonist programis described in detail inSection 2. The algorithm isbased on three main ele-ments. (i) It restricts annota-tions to a library of about2800 biosynthetically plau-

sible cartoons constructed using only about 300 archetypecartoons entered by hand. (ii) It determines the precisionand calibration of the machine used to generate the spec-trum. It does this automatically based on the spectrumitself. (iii) It assigns a confidence score to each cartoonidentification. In particular, rather than making a binaryyes/no decision when assigning cartoons, it makes potentialassignments and associates a confidence score to each one.The whole process requires minimal user intervention andtherefore is ideally suited for the nonspecialist.

Figure 4. Manual annotation of mouse kidney spectrum by a human expert. For clarity, only 19 of the major peaks have been annotated.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 6: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

870 D. Goldberg et al. Proteomics 2005, 5, 865–875

Figure 5. Histogram of annotation confidence. The histogramshows the number of annotated peaks with confidence scoresranging from 1 to 30. Lower scores indicate higher confidenceassignments. There are almost 65 peaks that had an annotationwith confidence score less than 10. These are almost certainlyvalid annotations.

The core 300 archetype cartoon library is based on publishedstructures along with the most recent working knowledge ofthe N-glycan biosynthetic pathway [8]. The cartoons providecompositional information as well has high probabilitytopological configurations. High mannose, hybrid and com-plex N-glycans form the three main classes of N-glycans thatform the cartoon library. All cartoons are based on the com-mon pentasaccharide N-glycan core structure composed ofthree mannoses and two GlcNAcs derived from the commonlipid-linked oligosaccharide precursor, oligosaccharide doli-chol-pyrophosphate, synthesized in the endoplasmic reticu-lum [12]. High mannose structures contain up to six addi-tional mannose residues attached to core in a topologydefined by the oligosaccharide precursor degradation path-way [13].

The remainder of the library is composed of hybrid andcomplex N-glycans. Hybrid structures have one or twoGlcNAc residues attached to the 3-arm, and mannose resi-dues attached to the 6-arm of the N-glycan core. Complexstructures have GlcNAc residues attached to the 3-arm andthe 6-arm of the core. Each of the GlcNAc residues attachedto the core forms a ‘stub’ that may be elongated by an array ofglycosyltransferases forming antennae. This results in thehybrid N-glycans existing as mono- and bi-antennary species,and complex N-glycans existing as mono-, bi-, tri-, tetra, andpenta-antennary species. In addition, the library contains afew truncated structures including some without a completecore.

The GlcNAc antennae stubs are usually linked to galac-tose residues in b1-4 or b1-3 linkages resulting in the gen-eration of type-2 (Galb1-4GlcNAcb-) or type-1 (Galb1-3GlcNAcb-) lactosamine unit (LacNAc), respectively. Thetype-2 LacNAc unit is much more prevalent than the type-1

LacNAc unit. Less commonly, type-2 chain production canproceed through an alternative route when a GalNAc residueis added to a GlcNAc residue generating a “LacdiNAc” unit(GalNAcb1-4GlcNAcb-). The LacNAc and LacdiNAc struc-tures may be further elongated by additional LacNAc andLacdiNAc repeats to form N-glycans with polylactosaminerepeats such as the “I blood group antigen”. The galactoseand GlcNAc residues may then be acted on by an array ofglycosyltransferases to produce terminal and subterminalsequences such as the Lewis antigens and various sialylatedstructures (see Fig. 6). These hybrid and complex N-glycansform the bulk of the archetype set.

To maintain the cartoon set at manageable levels, con-straints are imposed based on knowledge of biosyntheticpathways. These constraints can be removed or altereddepending on the analyses being carried out. For example,for initial assignment of mammalian N-glycan data cartoonantennae are limited to eight types. These can be combinedwith lactosamine units and fucosylated lactosamine units, asshown in Fig. 6. A sample of the archetype mammaliandatabase is shown in Fig. 1a. Multiple cartoons of the samemass are given ranks. For example, LacdiNAc moieties aregiven a low rank by default, but if the glycans are derivedfrom pituitary tissue which is known to be rich in LacdiNAcmoieties then the rank of these cartoons can be raised.

To illustrate the use of Cartoonist we have selected datafrom MALDI profiling of the mouse kidney because thisorgan has been extensively characterized in our studies ofknockout mice [14, 15]. N-Glycans were isolated from mousekidney as previously described [4] and MALDI profiles oftheir permethylated derivatives were obtained according toour published strategies [3, 16]. Figure 4 shows a portion ofthe experimental data which is manually annotated withcartoons. The experimental data can be analyzed by Cartoo-nist in a number of ways. (i) Approximately 20 of the mostabundant peaks can be annotated with high-probability car-toon assignments. (ii) Portions of the spectrum can be moreextensively annotated with or without constraints on thetypes of possible sequences, that is, showing all cartoons, nomatter their rank. This is illustrated in Fig. 8 which haszoomed in on the spectrum between 2500 and 3500 Daltons,showing additional annotations for the peaks in Fig. 7 as well

Figure 6. Biologically plausible mammalian antennae. Left of thebrace, the top row shows the default set of eight basic antennaeused in the cartoons that annotate mammalian glycan spectra.They can be modified if additional information is known. Thebottom row shows a lactosamine unit and a fucosylated lactosa-mine. On the right side of the brace are a few examples of themany ways the basic antennae and lactosamine units can becombined to form longer antennae.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 7: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Proteomics 2005, 5, 865–875 IPC Proceedings 871

Figure 7. Automatic annotation of mouse kidney spectrum by Cartoonist. For clarity only the 18 most intense peaks have been annotated.Cartoonist is capable of annotating all peaks exhibiting correct isotope envelopes, including very minor components. When the librarycontains multiple cartoons for a peak, only the peak(s) of highest rank are shown. The confidence of the annotation is indicated by the colorof the arrow, with darker errors indicating higher confidence. The spectrum contains a high number of glycans, and so the small numbersof annotated peaks are all of very high confidence.

as annotations for minor peaks that were not annotated there.All matching cartoons from the library are shown, althoughthose of lower rank are greyed out. Yet another format anno-tates with monosaccharide compositions, as shown in Fig. 9.The results of Cartoonist can also be presented in tabular form,listing each spectral peak with a potential glycan assignment,the observed peak mass and theoretical glycan mass and infor-mation about the confidence of the match (not shown).

The annotations take into account information providedby the user such as biological source and prior structuralstudies in the form of ranks in the cartoon library. The Car-toonist-annotated the spectrum shown in Fig. 7 was derivedusing the following constraints. First, the sample is ofmammalian origin. This is captured by the library itself,which is restricted to the glycans found in mammals. Thenthe cartoons in the library are ranked, by giving demerits to:(i) terminal GlcNAcs (i.e., not substituted by a galactose), (ii)LacdiNAc moieties, (iii) sialyl-Lewis X moieties (iv), a sialic

acid linked to another sialic acid, (v) antennae containingmultiple fucose residues, (vi) hybrid structures, (vii) cartoonswith five antennae. As an example, consider the two cartoonsin Fig. 1a with the same monosaccharide composition “5 4 20”. The leftmost one has a higher rank, since it is given only asingle demerit for LacdiNAc, whereas the rightmost cartoonhas two demerits for the two terminal uncapped GlcNAcs.Although this ranking system is simple, the highest rankedcartoons are similar to what a human expert would assign ascan be seen by comparing Figs. 4 and 7. Currently, we usethese ranks for all the glycan spectra, and get reasonableresults. We plan to improve the performance of Cartoonist byconstructing some standard ranking sets that can be selectedby researchers (e.g., a ranking for human cells would giveNeuGc a very low rank).

Whether done by machine or by hand, there are twoissues that have to be addressed when making assignments.First, what is the pool of potential glycans and cartoons that

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 8: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

872 D. Goldberg et al. Proteomics 2005, 5, 865–875

Figure 8. Automatic annotation of a portion of the mouse kidney spectrum. All matching cartoons from the library are shown, but the onesof lower rank have been greyed out. Only the portion of the spectrum between 2500 and 3500 is shown.

assignments are drawn from? Without some restrictions theoutput of the labelling process will be cluttered with many bio-logically implausible structures. Of course the pool of potentialglycans will depend on the source of the sample being ana-lyzed. As described in Section 2, the experimenter specifies pa-rameters on the set of potential glycans, and creates a set ofarchetype cartoons which is then used to automatically gen-erate a much larger library of cartoons used for mapping. Inthis way different databases can be created for different species,tissues, etc. The second issue concerns the precision and cali-bration of the data. For example, should a peak of mass 2285.33be matched to a potential glycan of theoretical mass 2286.15,even though it is 0.8 Da low? This obviously depends on theprecision of the machine. But calibration is also important, be-cause frequently the measured mass spectral peaks are sys-tematically biased [17]. Figure 10 shows an analysis of twospectra taken at different times on the same TOF instrumentfrom the same mouse kidney sample. They show deviation(theoretical mass – observed mass) plotted against the mass ofthe peak. In each case the deviation varies linearly with themass of the peak. A linear fit is shown, and the spread around

the fitting line is similar (less than 0.1 Da), but the calibration(slope of line) is quite different. From the plot it is clear thatnear 2200 Da, the measured peak masses are consistentlyabout 0.05 Da lower than the theoretical glycan masses for thefirst spectra, and about 0.7 Da lower for the second spectra.Thus 2285.33 is an excellent correspondence to 2286.15 for thesecond spectra, but quite unlikely for the first one. Most of thespectra showaconsistent bias inpeakmasses,similar toFig. 10,but the bias varies considerably between spectra, even thosetaken on the same instrument. Section 2 explains how Cartoo-nist can automatically adjust for the bias and also estimate thenoise in mass values. The significance of this is that theexperimenter can get detailed information from the spectrawithout having expert knowledge about the spectrometer, andcan spend less time on continual recalibration during high-throughput data collection.

When confronted with a potential glycan identificationthat is somewhat uncertain, Cartoonist does not apply anarbitrary cut-off to decide whether to report the assignment.Rather it associates a confidence score to each potentialassignment, representing the probability that the assign-

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 9: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Proteomics 2005, 5, 865–875 IPC Proceedings 873

Figure 9. Automatic annotation of mouse kidney spectrum by monosaccharide composition.

Figure 10. Plot of mass against deviation for two mouse kidney spectra. Two different spectra are shown for the same kidney sample. Eachplot shows the 50 most intense peaks that have a glycan assignment. The y-axis gives the deviation, the difference between the theoreticalglycan mass and the observed peak mass. Note that the left hand spectrum has deviations of at most 0.2 Da, whereas the right-handspectrum has deviations as large as 3.0 Da.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 10: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

874 D. Goldberg et al. Proteomics 2005, 5, 865–875

ment is valid. The score is indicating using color in Figs. 7 and 8.The score is based on two factors. (i) How does the deviation ofthe measured peak mass from the theoretical glycan masscompare to the estimated precision of the machine? (ii) Howreasonable is the shape of the isotope envelope? It is straight-forward to compute theoretical heights for the 11, 12, . . . iso-tope peaks of a glycan based on the atomic composition of theglycan and isotope frequencies of atoms. Cartoonist comparesthe theoretical and observed isotopes peaks, and a significantmismatch will reduce the confidence of our assignment. Thesefactors are combined to form a confidence score. Figure 5 showsthe distribution of confidence scores for the annotated peaks inthe kidney spectrum. A score of 10 corresponds roughly to onestandard deviation. Even low intensity peaks can have a highconfidence assignment: there are 13 annotations with an excel-lent confidence score (10 or less) that are assigned to peaks thatare not among the 500 most intense peaks.

4 Discussion

Limitations imposed by sample preparation are an issue inMS-based high-throughput glycoproteomics. However,downstream bioinformatics, from the acquisition of MS datato precise glycan identification is an even bigger bottleneck.This is an issue that can be readily addressed by computerscientists. Accordingly, our main objective is to develop andmake available to glycoscientists algorithms that enable gly-cans and glycoproteins to be screened and identified via MS.We have implemented a prototype algorithm into a programcalled Cartoonist that can perform cartoon annotation ofMALDI-TOF screens of mammalian N-glycans. As exempli-fied by data depicted in Figs. 7–9, Cartoonist is a highly ver-satile program that allows the rapid annotation of data fromMALDI profiling of N-glycans. The current version of Car-toonist annotates raw MALDI-TOF data from permethylatedN-glycans. It is intended that future versions will accom-modate native as well as other derivatized glycans. This willprovide the community with programs capable of analyzingdata generated from the wide variety of experiments that arecurrently used to study glycan structures.

We have applied Cartoonist to over 100 spectra and in allcases the program has provided meaningful assignments.Each MALDI spectrum was first expertly manually assignedcompositions and probable structures before being auto-matically annotated by Cartoonist. Comparison of the manu-ally interpretations with the automatic cartoon assignmentsshow that the compositional information was invariably “cor-rect” and the disparity between the predicted structures andthe cartoons was minimal. Comparison of Figs. 4 and 7 istypical. Fifteen of the hand-annotated peaks were also identi-fied by Cartoonist. For 12 of the 15 the annotations wereidentical, for the other three Cartoonist offered two possiblecartoons, one of which matched the hand-annotated one (forthe peak with two alternate expert annotations, Cartoonistfound both). The remaining differences are due variations in

the set of peaks chosen for annotation. Cartoonist selectspeaks for annotation starting with the most intense peak andcontinuing systematically to the second most intense, etc.,whereas human annotators deviate slightly from this.

In cases where Cartoonist does not reach sensible structurepredictions, “biological constraints” can be applied by adjustingthe ranks in the working cartoon library accordingly. Peaks withhigh mass-to-charge values (.m/z 4000) tend to be assignedmultiple cartoons as these are indistinguishable from eachother when a single MALDI-TOFglycan profiling experiment isperformed. Consequently, these ambiguous glycan assign-ments usually become targets for follow-up experiments, e.g.,exoglycosidase, MS/MS experiments, etc. To take a straightfor-ward example, if a peak is assigned as a sialylated tri-antennaryN-glycan or as a tri-antennary N-glycan with exclusively Lewis Xthen these can be distinguished by a sialidase experiment.Alternatively, the same peak can be selected for a MS/MSexperiment and diagnostic fragment ions can easily show ifsialic acid or Fuc residues are present on the nonreducing ter-minal sequences. This example demonstrates the value of sim-ple cartoon assignments as they can direct follow-up experi-ments by indicating important structural features that may beassociated of a given glycan, e.g., specific terminal sequencesthat correlated with biological activity [8]

A further issue is idea of “constraints”. To generate anaccurate annotated N-glycan kidney MALDI profile, rankshad to be introduced into the library in order to get a favorableoutput. These assumptions are based on prior knowledge ofthe N-glycan structures and relative glycosyltransferases ac-tivities present in the mouse kidney. It is intended that thisnew capability will be implemented in the form of an interfacewhere the end-user can enter information, such as the bio-logical source of the analyte. Appropriate ranks will then beimposed based on the latest knowledge of structures and gly-cosyltransferase activities. In view of these limitations it isintended that the cartoon library is an open source file that canbe easily modified by the end-user in order to allow the proto-type algorithm to ‘learn’ from progressive analysis.

While numerous programs for annotating peptide spec-tra exist, for example [18], only a few programs aimed atfacilitating mass spectrometric analysis of glycans have beenreported. Almost all are based on MS/MS fragmentationanalysis, and none perform the careful calibration and pre-cision analysis that we feel are key to automated high-throughput analysis that can reliably detect when high-mass,low-intensity peaks represent complex, possibly unchar-acterized glycans. Progress in this area first manifested itselfas a web-based tool called the saccharide topology analysistool (STAT) that is based on an algorithm that providessequence information from glycan fragment MSn spectra[19]. The user manually inputs molecular and fragment ionmasses to generate a list of every possible compositions withevery possible structure. This tool can be used to obtainsequence information from MSn spectra of glycans with amaximum of ten monosaccharide residues exclusive offucose. These restrictions prevent the analysis of fucosylated

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de

Page 11: Automatic annotation of matrix-assisted laser desorption/ionization N-glycan spectra

Proteomics 2005, 5, 865–875 IPC Proceedings 875

glycans and are a drawback when analysing N-glycans,which are inherently large due to the trimannosyl core. Fur-thermore, bisected N-glycans can not be sequenced by STAT.The sequence information obtained using STAT is based onglycosidic cleavages and can not be used to identify cross ringcleavages or “internal” cleavage fragments which can provideimportant information. To our knowledge this tool has notbeen extended to analysis of permethylated glycans whichare known to fragment in a highly predictable manner, andprovide a means of reliable high sensitivity analysis. Anotherweb-based tool is GlycoMod [20], which can create a list ofbiologically plausible monosaccharide compositions of agiven mass. It does not consider topologies, and thus is ana-logous to the composition mode of Cartoonist shown inFig. 9. A third web-based tool called GLYCO-FRAGMENThas been created to assist manual interpretation of glycanfragment MSn spectra [21]. This tool generates all possiblefragment ions from a user defined IUPAC imported struc-ture. This information includes fragments derived from gly-cosidic cleavages and cross-ring cleavages can provide link-age information. Although this is a useful tool it does notpredict the structure of a glycan from raw mass spectral dataor quantitative ion distributions, an absolute requirement ifautomated glycan analysis is to be fully realized. The web-tools described in [21] provide a means to analyse glycanfragment MSn data by the user manually inputting mass dataor structural data. However, due the lack of constraints suchas biosynthetic knowledge these algorithms tend to produceoverwhelming amounts of information that requires con-siderable amounts of interpretation for N-glycans.

Yet another algorithm called StrOligo has been created toanalysis MS/MS data of N-glycans tagged at their reducingends with 1-phenyl-3-methyl-5-pyrazolone (PMP) [22]. This al-gorithm uses some biosynthetic constraints to analysis MS/MSPMP-derivatized N-glycans analyzed by MALDI-quadrupoleTOF-MS/MS. Analogous to the GLYCOFRAGMENT algo-rithm it predicts spectral peaks from user inputted structures.Its use has been restricted to some relatively small bi-antennarycomplex N-glycans and its application to complex biologicalmaterials rather than standards has not been explored. Anotherapproach, GlycosidIQ [23], is based on a commercial databaseof glycans. GlycosidIQ uses a modest-sized database of 1674glycans, and compares an experimental MS/MS spectrum tothe theoretically computed spectra of all the glycans in thedatabase, returning a ranked list of matches.

Cartoonist is the first comprehensive attempt to automateanalysis of released intact glycan mixtures in an experimentalsetting using MALDI-TOF-MS. The idea is to use this infor-mation as a leverage point to direct further analysis of a gly-coprotein using a coherent sequential mass spectrometricstrategy. Due to the rapid speed of the program on-fly experi-ments may be potentially performed in the future. Forinstance, initial MALDI-TOF mass spectrometric screeningwill provide information on likely molecular ions that containa possible immunogenic epitope, such as Sialyl-Lewis X. Thisinformation may be fed into a second algorithm that adjusts

the MS software and corresponding hardware for specific MSn

experiments. It is envisaged that such tools will allow themultidimensional problem of solving glycan and glycopro-tein structures to eventually become an automated task.

This work was supported in part by the Biotechnology andBiological Sciences Research Council (BBSRC), the WellcomeTrust, and resources provided by the Consortium for FunctionalGlycomics funded by grants from the NIGMS (GM62116) andthe NCRR. AD is a BBSRC Professorial Research Fellow.

5 References

[1] Dell, A., Morris, H. R., Science 2001, 291, 2351–2356.

[2] Harvey, D. J., Proteomics 2001, 1, 311–328.

[3] Dell, A., Chalabi, S., Easton, R. L., Haslam, S. M. et al., Proc.Natl. Acad. Sci. USA 2003, 100, 15631–15636.

[4] Sutton-Smith, M., Morris, H. R., Dell, A., Tetrahedron:Asymmetry 2000, 11, 363–369.

[5] Dell, A., Sutton-Smith, M., in: Celis, J. E. (Ed.), Cell Biology: ALaboratory Handbook, Vol. 4 Academic Press, San Diego,CA, in press.

[6] Kornfeld, R., Kornfeld, S., Annu. Rev. Biochem. 1985, 54,631–664.

[7] Schachter, H., Glycobiology 1991, 1, 453–461.

[8] Lowe, J. B., Marth, J. D., Annu. Rev. Biochem. 2003, 72, 643–691.

[9] Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori,M., Nucleic Acids Res. 2004, 32, D277–D280.

[10] Kawano, S., Okuno, Y., Hashimoto, K., Yamamoto, H. et al.,Genome Informatics 2003, 145, 304–305.

[11] Fischler, M. A., Bolles, R. C., CACM 1981, 24, 381–395.

[12] Abeijon, C., Hirschberg, C. B., Trends Biochem. Sci. 1992, 17,32–36.

[13] Herscovics, A., Biochimi. Biophys. Acta 1999, 1473, 96–107.

[14] Chui, D., Sellakumar, G., Green, R. S., Sutton-Smith, M. etal., Proc. Natl. Acad. Sci. USA 2001, 98, 1142–1147.

[15] Wang, Y., Tan, J., Sutton-Smith, M., Ditto, D. et al., Glyco-biology 2001, 11, 1051–1070.

[16] Wong, K. N., Easton, R. L., Patankar, M. S., Sutton-Smith, M.et al., J. Biol. Chem. 2003, 278, 28619–28634.

[17] Taylor, J. A., Johnson, R. S., Anal. Chem. 2001, 73, 2594–2604.

[18] Field, H. I., Fenyo, D., Beavis, R. C., Proteomics 2002, 2, 36–47.

[19] Gaucher, S. P., Morrow, J., Leary, J. A., Anal. Chem. 2000, 72,2331–2336.

[20] Cooper, C. A., Gasteiger, E., Packer, N. H., Proteomics 2001,1, 340–349.

[21] Lohmann, K. K., von der Lieth, C.-W., Proteomics 2003, 3,2028–2035.

[22] Ethier, M., Saba, J. A., Ens, W., Standing, K. G., Perreault, H.,Rapid Commun. Mass Spectrom. 2002, 16, 1743–1754.

[23] Joshi, H. J., Harrison, M. J., Schulz, B. L., Cooper, C. A. et al.,Proteomics 2004, 4, 1650–1664.

[24] Zamfir, A., Vakhrushev, S., Sterling, A., Niebel, H. J. et al.,Anal. Chem. 2004, 76, 2046–2054.

2005 WILEY-VCH Verlag GmbH & Co. KGaA, Weinheim www.proteomics-journal.de