Post on 11-Jul-2020
How We Handle Mass Spectra
NIST Mass Spectrometry Data Center
Numbers of Spectra
020,00040,00060,00080,000
100,000120,000140,000160,000180,000200,000
'78 '80 '83 '86 '88 '90 '93 '98 '02
ReplicatesCompounds
Red Books EPA NIST
NIST/EPA/NIH Mass Spectral L ibrary
Libraries Dis tributed/Year
0500
10001500200025003000350040004500
'88 '89 '90 '91 '92 '93 '94 '95 '96 '97 '98 '99 '00 '01 '02 '03 '04
The Data
Cl
11
17
10
7
5
9
8
NH20
6
15
12
14
4
19
13
1
16
2
3
H
H
H
H
H
N
NH
NH
N
NH2
O
Connection Table
Cl
1 2
3
4
S
SSS
SD
SD
4
3
2
1
1 2 3 4
From Structure to Spectrum:A Mass “Fragmentogram”
P
F
O
CH3OCH
CH3
CH3
P
F
O
CH3OCH
CH3
CH3
+ ++
e- 2e-
mass = 140 u
P
F
OH
CH3OH P
F
O
CH3OCH
CH3+
+
mass = 99 u
+
mass = 125 u
+ CH3
CH2
CHCH2
Molecular Fingerprints
VX
HD
GB
I will discuss
• Library Searching– Full and Partial Spectra
• Spectrum Purification
• Chemical Structure Representation
• Peptide Spectra Libraries
0 50 100 150 200 250 3000
200
400
600
800
1000
Instrument ‘Noise Signature’250 Hexachlorobenzene Spectrasame instrument, calibration mix
Bars show quartiles
Instrument Effects
Library Search
unknown
MF=93
MF=68
sarin
Spectral Similarity
• M = f(Abundance) Peak in Measured Spectrum
• R = f(Abundance) Peak in Reference Spectrum
• Sum over all peaks
• f(Abundance) – Abundance
– Abundance * m/z
– Certainty
� �
�RM
MR
Top Hit Top 2 Hits Top 3 Hits
Correlation – Weighted 74.9 86.9 91.7
Correlation 72.9 85.9 90.8
Euclidean Distance 71.9 83.9 88.9
Absolute Distance 67.9 80.3 85.5
PBM - Published 64.7 78.4 84.8
Hites/Hertz/Biemann 64.4 77.2 83.2
Algorithm Performance12,592 Replicate Spectra against NIST Library
Percent CorrectModel
0 20 40 60 80 1000.0
0.2
0.4
0.6
0.8
1.0
False Positives(108,000 compounds)
False Negatives(21,000 replicate spectra)
Match Factor Threshold
FractionRecovered
FP/FP Above Given Match Factor for NIST Library Spectra
80 85 90 95 100
0.0
0.2
0.4
0.6
0.8
m/z weighting
no weighting
Match Factor
FractionRecovered
FP/FNExpanded View
FN
FP x 10,000
0 20 40 60 80 100
0
50
100
150
200 decanedecalinTMBHCB
malathion
DMPB
sarin
HCB = hexachlorobenzeneDMPB = dimethylpenobarbitalTMB = 1,2,3-trimethylbenzene
FP Depends on Spectrum Uniqueness
Match Factor
FP
Multiple Ion Monitoring
• What is is?– Use 2-5 Major Peaks in Spectrum of Target
• 10 – 100 more sensitive
• What’s the problem? – Can match major Target peaks with Minor Sample
Peaks
• What we have done:– Examine risk using library as source of potential false
positive IDs
False Positive Risk vs Number of Peaks Used
Figure 1. Median FPP vs. NP
0.00001
0.0001
0.001
0.01
0.1
1
1 2 3 4 5
Number of Peaks
1
1/2
1/4
1/8
1/16
1/32
1/64
1/128
BMA
Figure 1. Median FPP vs. NP
0.00001
0.0001
0.001
0.01
0.1
1
1 2 3 4 5
Number of Peaks
11
1/21/2
1/41/4
1/81/8
1/161/16
1/321/32
1/641/64
1/1281/128
BMA
Number of Peaks
FP/spectrum
(median)
Abundance Ratio:Biggest Search Peak/Matching Peak in FP
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
m/z Difference
Rel
ativ
e P
rob
abili
ties
s
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80 90 100
m/z Difference
Rel
ativ
e P
rob
abili
ties
s
Mass Spectral Peak Occurrences are Correlated
Difference in Peak Position (m/z)
Joint Occurrence
Prob.
Big Peaks
SmallPeaks
MediumPeaks
FP Observed and Computed(from individual peak probabilities)
0.1
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10
Observed FPP Percentile
0.1
1
10
100
1000
10000
1 2 3 4 5 6 7 8 9 10
Observed FPP Percentile
FP Percentile/10
FP
Actual
No Peak Correlation
Search Results Depend on Search Spectrum Quality
AMDIS: http://chemdata.nist.gov
Real DataTotal ion
chromatogram
A mass spectrum
(scan)
Chromatogram with single ion
AMDIS Analysis of Data
AMDIS Match = 81O P
O
F
Order of Analysis
• Noise Analysis – find ‘Noise Factor’
• Find and quantify maximizing ions
• Combine to create ‘Model Peak’
• Use Model Peak shape (intensity vs time) to purify spectra
• Find best matching library spectrum
IntensityKNoise noise=
Intensity
Noise
Derive Noise Factor
Finding Possible Peaks for Each m/z
Maxim um rate
Scan number n
Find Possible Compounds:Do Ions Maximize at Same Time?
1036
.0 .1 .7.6.5.4.2 .3 .9.8
1 2
Separate the Components
1036
147
11141
168511
422
103
818
4
508
.3 yes
.2
82
751
75
16
15
13
30514 19
22
8237147
6
.0 .1 .7.6.5.4.2 .3 .9.8
1 2
264
.3
yes.2
NO .6
5781 23
96
7
A ‘Model Peak’ Provides Shape
1036
147
11141
168511
422
103
818
4
508
.3 yes
.2
82
751
75
16
15
13
30514 19
22
8237147
6
.0 .1 .7.6.5.4.2 .3 .9.8
1 2
264
.3
yes.2
NO .6
5781 23
96
7
The model shape is defined as the sum of all of the ion chromatograms that maximize within the range and have a sharpness value within 75% of the maximum.
AMDIS Testing – Closely Eluting Components
Representing Chemical Identity
• Visual: 2D Structure
• Text: IUPAC Name
• Digital: No Accepted, Open Method
• Solution:
The IUPAC/NIST Chemical Identifier
Connection Table
Cl
1 2
3
4
S
SSS
SD
SD
4
3
2
1
1 2 3 4
Chemical Identity Problems
CH3CH3
CH3CH3
Registry Number possible for each exact form, mixture, unknown, unspecified
Experts required
Expensive, ambiguous and error prone
Requirements
• Different compounds have different identifiers– Keep all distinguishing structural information
IChI - 1 IChI - 2=
=
Requirements
• One compound has only one identifier– Omit unnecessary information
NOO
NOO
N+ OO
NOO
Same INChI
= ==
3 Steps to INChI
• Chemistry– ‘Normalize’ Input Structure
• Implement chemical rules
• Math– ‘Canonicalize’ (label the atoms)
• Equivalent atoms get the same label
• Format– ‘Serialize’ Labeled Structure
• Output as character string (‘name’)
formula
connectivity
stereo
isotope
Chemical Substances“Layers”
NitrobenzeneCH5
CH3
CH1
CH2
CH4
C6
N+7 O
8O
9
Canonical numbering
Descr iption Layers
formula C6H5NO2
connectivity 8-7(9)6-4-2-1-3-5-6
H-atoms 1-5H
charges
MSG
C4 C5
O8
CH2
2
O9
CH21
CH3O10OH
7
NH26
Na+1
Canonical numbering
Descr iption Layers
formula C5H8NO4.Na
connectivity 6-3(5(9)10)1-2-4(7)8;
H-atoms 1-2H2,3H,6H2(H-,7,8,9,10);
stereo sp3 3-;
charges -1;+1
C5H9NO4.Na/c6-3(5(9)10)1-2-4(7)8;/h1-2H2,3H,6H2,(H,7,8)(H,9,10);/q;+1/p-1/t3-;/m1./s1
INChITest
Version
Input/Result
Mobile H On/Off
Include Org-Metal Bonds
Peptide Mass Spectra:Libraries for Organisms
• Proteins are linear sequences of amino acids– characteristic of Genome (organism)
• Peptides are ‘digested’ fragments of proteins
• MS ‘sequences’ peptides to reveal source Protein
• Peptides fragmentation spectra are not quite predictable
• Peptide fragmentation spectra for a ‘genome’ can be contained in one Library.
Spectrum Prediction Programs
Peptide Spectra Reference Library (multiple measurements each of 10,000 peptides)
HLQLAIR/2+
MS Mapped to the Genome
From Eric Deutsch, ISB, 6/2004