How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra...

Post on 11-Jul-2020

1 views 0 download

Transcript of How We Handle Mass Spectra - NIST...2004/10/26  · Algorithm Performance 12,592 Replicate Spectra...

How We Handle Mass Spectra

NIST Mass Spectrometry Data Center

Numbers of Spectra

020,00040,00060,00080,000

100,000120,000140,000160,000180,000200,000

'78 '80 '83 '86 '88 '90 '93 '98 '02

ReplicatesCompounds

Red Books EPA NIST

NIST/EPA/NIH Mass Spectral L ibrary

Libraries Dis tributed/Year

0500

10001500200025003000350040004500

'88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04

The Data

Cl

11

17

10

7

5

9

8

NH20

6

15

12

14

4

19

13

1

16

2

3

H

H

H

H

H

N

NH

NH

N

NH2

O

Connection Table

Cl

1 2

3

4

S

SSS

SD

SD

4

3

2

1

1 2 3 4

From Structure to Spectrum:A Mass “Fragmentogram”

P

F

O

CH3OCH

CH3

CH3

P

F

O

CH3OCH

CH3

CH3

+ ++

e- 2e-

mass = 140 u

P

F

OH

CH3OH P

F

O

CH3OCH

CH3+

+

mass = 99 u

+

mass = 125 u

+ CH3

CH2

CHCH2

Molecular Fingerprints

VX

HD

GB

I will discuss

• Library Searching– Full and Partial Spectra

• Spectrum Purification

• Chemical Structure Representation

• Peptide Spectra Libraries

0 50 100 150 200 250 3000

200

400

600

800

1000

Instrument ‘Noise Signature’250 Hexachlorobenzene Spectrasame instrument, calibration mix

Bars show quartiles

Instrument Effects

Library Search

unknown

MF=93

MF=68

sarin

Spectral Similarity

• M = f(Abundance) Peak in Measured Spectrum

• R = f(Abundance) Peak in Reference Spectrum

• Sum over all peaks

• f(Abundance) – Abundance

– Abundance * m/z

– Certainty

� �

�RM

MR

Top Hit Top 2 Hits Top 3 Hits

Correlation – Weighted 74.9 86.9 91.7

Correlation 72.9 85.9 90.8

Euclidean Distance 71.9 83.9 88.9

Absolute Distance 67.9 80.3 85.5

PBM - Published 64.7 78.4 84.8

Hites/Hertz/Biemann 64.4 77.2 83.2

Algorithm Performance12,592 Replicate Spectra against NIST Library

Percent CorrectModel

0 20 40 60 80 1000.0

0.2

0.4

0.6

0.8

1.0

False Positives(108,000 compounds)

False Negatives(21,000 replicate spectra)

Match Factor Threshold

FractionRecovered

FP/FP Above Given Match Factor for NIST Library Spectra

80 85 90 95 100

0.0

0.2

0.4

0.6

0.8

m/z weighting

no weighting

Match Factor

FractionRecovered

FP/FNExpanded View

FN

FP x 10,000

0 20 40 60 80 100

0

50

100

150

200 decanedecalinTMBHCB

malathion

DMPB

sarin

HCB = hexachlorobenzeneDMPB = dimethylpenobarbitalTMB = 1,2,3-trimethylbenzene

FP Depends on Spectrum Uniqueness

Match Factor

FP

Multiple Ion Monitoring

• What is is?– Use 2-5 Major Peaks in Spectrum of Target

• 10 – 100 more sensitive

• What’s the problem? – Can match major Target peaks with Minor Sample

Peaks

• What we have done:– Examine risk using library as source of potential false

positive IDs

False Positive Risk vs Number of Peaks Used

Figure 1. Median FPP vs. NP

0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5

Number of Peaks

1

1/2

1/4

1/8

1/16

1/32

1/64

1/128

BMA

Figure 1. Median FPP vs. NP

0.00001

0.0001

0.001

0.01

0.1

1

1 2 3 4 5

Number of Peaks

11

1/21/2

1/41/4

1/81/8

1/161/16

1/321/32

1/641/64

1/1281/128

BMA

Number of Peaks

FP/spectrum

(median)

Abundance Ratio:Biggest Search Peak/Matching Peak in FP

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

m/z Difference

Rel

ativ

e P

rob

abili

ties

s

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

m/z Difference

Rel

ativ

e P

rob

abili

ties

s

Mass Spectral Peak Occurrences are Correlated

Difference in Peak Position (m/z)

Joint Occurrence

Prob.

Big Peaks

SmallPeaks

MediumPeaks

FP Observed and Computed(from individual peak probabilities)

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10

Observed FPP Percentile

0.1

1

10

100

1000

10000

1 2 3 4 5 6 7 8 9 10

Observed FPP Percentile

FP Percentile/10

FP

Actual

No Peak Correlation

Search Results Depend on Search Spectrum Quality

AMDIS: http://chemdata.nist.gov

Real DataTotal ion

chromatogram

A mass spectrum

(scan)

Chromatogram with single ion

AMDIS Analysis of Data

AMDIS Match = 81O P

O

F

Order of Analysis

• Noise Analysis – find ‘Noise Factor’

• Find and quantify maximizing ions

• Combine to create ‘Model Peak’

• Use Model Peak shape (intensity vs time) to purify spectra

• Find best matching library spectrum

IntensityKNoise noise=

Intensity

Noise

Derive Noise Factor

Finding Possible Peaks for Each m/z

Maxim um rate

Scan number n

Find Possible Compounds:Do Ions Maximize at Same Time?

1036

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

Separate the Components

1036

147

11141

168511

422

103

818

4

508

.3 yes

.2

82

751

75

16

15

13

30514 19

22

8237147

6

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

264

.3

yes.2

NO .6

5781 23

96

7

A ‘Model Peak’ Provides Shape

1036

147

11141

168511

422

103

818

4

508

.3 yes

.2

82

751

75

16

15

13

30514 19

22

8237147

6

.0 .1 .7.6.5.4.2 .3 .9.8

1 2

264

.3

yes.2

NO .6

5781 23

96

7

The model shape is defined as the sum of all of the ion chromatograms that maximize within the range and have a sharpness value within 75% of the maximum.

AMDIS Testing – Closely Eluting Components

Representing Chemical Identity

• Visual: 2D Structure

• Text: IUPAC Name

• Digital: No Accepted, Open Method

• Solution:

The IUPAC/NIST Chemical Identifier

Connection Table

Cl

1 2

3

4

S

SSS

SD

SD

4

3

2

1

1 2 3 4

Chemical Identity Problems

CH3CH3

CH3CH3

Registry Number possible for each exact form, mixture, unknown, unspecified

Experts required

Expensive, ambiguous and error prone

Requirements

• Different compounds have different identifiers– Keep all distinguishing structural information

IChI - 1 IChI - 2=

=

Requirements

• One compound has only one identifier– Omit unnecessary information

NOO

NOO

N+ OO

NOO

Same INChI

= ==

3 Steps to INChI

• Chemistry– ‘Normalize’ Input Structure

• Implement chemical rules

• Math– ‘Canonicalize’ (label the atoms)

• Equivalent atoms get the same label

• Format– ‘Serialize’ Labeled Structure

• Output as character string (‘name’)

formula

connectivity

stereo

isotope

Chemical Substances“Layers”

NitrobenzeneCH5

CH3

CH1

CH2

CH4

C6

N+7 O

8O

9

Canonical numbering

Descr iption Layers

formula C6H5NO2

connectivity 8-7(9)6-4-2-1-3-5-6

H-atoms 1-5H

charges

MSG

C4 C5

O8

CH2

2

O9

CH21

CH3O10OH

7

NH26

Na+1

Canonical numbering

Descr iption Layers

formula C5H8NO4.Na

connectivity 6-3(5(9)10)1-2-4(7)8;

H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);

stereo sp3 3-;

charges -1;+1

C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1-2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1

INChITest

Version

Input/Result

Mobile H On/Off

Include Org-Metal Bonds

Peptide Mass Spectra:Libraries for Organisms

• Proteins are linear sequences of amino acids– characteristic of Genome (organism)

• Peptides are ‘digested’ fragments of proteins

• MS ‘sequences’ peptides to reveal source Protein

• Peptides fragmentation spectra are not quite predictable

• Peptide fragmentation spectra for a ‘genome’ can be contained in one Library.

Spectrum Prediction Programs

Peptide Spectra Reference Library (multiple measurements each of 10,000 peptides)

HLQLAIR/2+

MS Mapped to the Genome

From Eric Deutsch, ISB, 6/2004