How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra...

46
How We Handle Mass Spectra NIST Mass Spectrometry Data Center

Transcript of How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra...

Page 1: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

How We Handle Mass Spectra

NIST Mass Spectrometry Data Center

Page 2: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8
Page 3: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Numbers of Spectra

020,00040,00060,00080,000

100,000120,000140,000160,000180,000200,000

'78 '80 '83 '86 '88 '90 '93 '98 '02

ReplicatesCompounds

Red Books EPA NIST

NIST/EPA/NIH Mass Spectral L ibrary

Page 4: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Libraries Dis tributed/Year

0500

10001500200025003000350040004500

'88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04

Page 5: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

The Data

Cl

11

17

10

7

5

9

8

NH20

6

15

12

14

4

19

13

1

16

2

3

H

H

H

H

H

N

NH

NH

N

NH2

O

Page 6: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Connection Table

Cl

1 2

3

4

S

SSS

SD

SD

4

3

2

1

1 2 3 4

Page 7: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

From Structure to Spectrum:A Mass “Fragmentogram”

P

F

O

CH3OCH

CH3

CH3

P

F

O

CH3OCH

CH3

CH3

+ ++

e- 2e-

mass = 140 u

P

F

OH

CH3OH P

F

O

CH3OCH

CH3+

+

mass = 99 u

+

mass = 125 u

+ CH3

CH2

CHCH2

Page 8: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Molecular Fingerprints

VX

HD

GB

Page 9: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

I will discuss

• Library Searching– Full and Partial Spectra

• Spectrum Purification

• Chemical Structure Representation

• Peptide Spectra Libraries

Page 10: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

0 50 100 150 200 250 3000

200

400

600

800

1000

Instrument ‘Noise Signature’250 Hexachlorobenzene Spectrasame instrument, calibration mix

Bars show quartiles

Page 11: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Instrument Effects

Page 12: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Library Search

unknown

MF=93

MF=68

sarin

Page 13: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Spectral Similarity

• M = f(Abundance) Peak in Measured Spectrum

• R = f(Abundance) Peak in Reference Spectrum

• Sum over all peaks

• f(Abundance) – Abundance

– Abundance * m/z

– Certainty

� �

�RM

MR

Page 14: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Top Hit Top 2 Hits Top 3 Hits

Correlation – Weighted 74.9 86.9 91.7

Correlation 72.9 85.9 90.8

Euclidean Distance 71.9 83.9 88.9

Absolute Distance 67.9 80.3 85.5

PBM - Published 64.7 78.4 84.8

Hites/Hertz/Biemann 64.4 77.2 83.2

Algorithm Performance12,592 Replicate Spectra against NIST Library

Percent CorrectModel

Page 15: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

0 20 40 60 80 1000.0

0.2

0.4

0.6

0.8

1.0

False Positives(108,000 compounds)

False Negatives(21,000 replicate spectra)

Match Factor Threshold

FractionRecovered

FP/FP Above Given Match Factor for NIST Library Spectra

Page 16: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

80 85 90 95 100

0.0

0.2

0.4

0.6

0.8

m/z weighting

no weighting

Match Factor

FractionRecovered

FP/FNExpanded View

FN

FP x 10,000

Page 17: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

0 20 40 60 80 100

0

50

100

150

200 decanedecalinTMBHCB

malathion

DMPB

sarin

HCB = hexachlorobenzeneDMPB = dimethylpenobarbitalTMB = 1,2,3-trimethylbenzene

FP Depends on Spectrum Uniqueness

Match Factor

FP

Page 18: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Multiple Ion Monitoring

• What is is?– Use 2-5 Major Peaks in Spectrum of Target

• 10 – 100 more sensitive

• What’s the problem? – Can match major Target peaks with Minor Sample

Peaks

• What we have done:– Examine risk using library as source of potential false

positive IDs

Page 19: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

False Positive Risk vs Number of Peaks Used

Figure 1. Median FPP vs. NP

0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5

Number of Peaks

1

1/2

1/4

1/8

1/16

1/32

1/64

1/128

BMA

Figure 1. Median FPP vs. NP

0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5

Number of Peaks

11

1/21/2

1/41/4

1/81/8

1/161/16

1/321/32

1/641/64

1/1281/128

BMA

Number of Peaks

FP/spectrum

(median)

Abundance Ratio:Biggest Search Peak/Matching Peak in FP

Page 20: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

m/z Difference

Rel

ativ

e P

rob

abili

ties

s

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

m/z Difference

Rel

ativ

e P

rob

abili

ties

s

Mass Spectral Peak Occurrences are Correlated

Difference in Peak Position (m/z)

Joint Occurrence

Prob.

Big Peaks

SmallPeaks

MediumPeaks

Page 21: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

FP Observed and Computed(from individual peak probabilities)

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10

Observed FPP Percentile

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10

Observed FPP Percentile

FP Percentile/10

FP

Actual

No Peak Correlation

Page 22: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Search Results Depend on Search Spectrum Quality

AMDIS: http://chemdata.nist.gov

Page 23: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Real DataTotal ion

chromatogram

A mass spectrum

(scan)

Page 24: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Chromatogram with single ion

Page 25: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

AMDIS Analysis of Data

AMDIS Match = 81O P

O

F

Page 26: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Order of Analysis

• Noise Analysis – find ‘Noise Factor’

• Find and quantify maximizing ions

• Combine to create ‘Model Peak’

• Use Model Peak shape (intensity vs time) to purify spectra

• Find best matching library spectrum

Page 27: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

IntensityKNoise noise=

Intensity

Noise

Derive Noise Factor

Page 28: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Finding Possible Peaks for Each m/z

Maxim um rate

Scan number n

Page 29: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Find Possible Compounds:Do Ions Maximize at Same Time?

1036

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

Page 30: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Separate the Components

1036

147

11141

168511

422

103

818

4

508

.3 yes

.2

82

751

75

16

15

13

30514 19

22

8237147

6

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

264

.3

yes.2

NO .6

5781 23

96

7

Page 31: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

A ‘Model Peak’ Provides Shape

1036

147

11141

168511

422

103

818

4

508

.3 yes

.2

82

751

75

16

15

13

30514 19

22

8237147

6

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

264

.3

yes.2

NO .6

5781 23

96

7

The model shape is defined as the sum of all of the ion chromatograms that maximize within the range and have a sharpness value within 75% of the maximum.

Page 32: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

AMDIS Testing – Closely Eluting Components

Page 33: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Representing Chemical Identity

• Visual: 2D Structure

• Text: IUPAC Name

• Digital: No Accepted, Open Method

• Solution:

The IUPAC/NIST Chemical Identifier

Page 34: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Connection Table

Cl

1 2

3

4

S

SSS

SD

SD

4

3

2

1

1 2 3 4

Page 35: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Chemical Identity Problems

CH3CH3

CH3CH3

Registry Number possible for each exact form, mixture, unknown, unspecified

Experts required

Expensive, ambiguous and error prone

Page 36: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Requirements

• Different compounds have different identifiers– Keep all distinguishing structural information

IChI - 1 IChI - 2=

=

Page 37: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Requirements

• One compound has only one identifier– Omit unnecessary information

NOO

NOO

N+ OO

NOO

Same INChI

= ==

Page 38: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

3 Steps to INChI

• Chemistry– ‘Normalize’ Input Structure

• Implement chemical rules

• Math– ‘Canonicalize’ (label the atoms)

• Equivalent atoms get the same label

• Format– ‘Serialize’ Labeled Structure

• Output as character string (‘name’)

Page 39: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

formula

connectivity

stereo

isotope

Chemical Substances“Layers”

Page 40: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

NitrobenzeneCH5

CH3

CH1

CH2

CH4

C6

N+7 O

8O

9

Canonical numbering

Descr iption Layers

formula C6H5NO2

connectivity 8-7(9)6-4-2-1-3-5-6

H-atoms 1-5H

charges

Page 41: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

MSG

C4 C5

O8

CH2

2

O9

CH21

CH3O10OH

7

NH26

Na+1

Canonical numbering

Descr iption Layers

formula C5H8NO4.Na

connectivity 6-3(5(9)10)1-2-4(7)8;

H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);

stereo sp3 3-;

charges -1;+1

C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1-2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1

Page 42: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

INChITest

Version

Input/Result

Mobile H On/Off

Include Org-Metal Bonds

Page 43: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Peptide Mass Spectra:Libraries for Organisms

• Proteins are linear sequences of amino acids– characteristic of Genome (organism)

• Peptides are ‘digested’ fragments of proteins

• MS ‘sequences’ peptides to reveal source Protein

• Peptides fragmentation spectra are not quite predictable

• Peptide fragmentation spectra for a ‘genome’ can be contained in one Library.

Page 44: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Spectrum Prediction Programs

Page 45: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

Peptide Spectra Reference Library (multiple measurements each of 10,000 peptides)

HLQLAIR/2+

Page 46: How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra against NIST Library Model Percent Correct 0 20 40 60 80 100 0.0 0.2 0.4 0.6 0.8

MS Mapped to the Genome

From Eric Deutsch, ISB, 6/2004