High-Throughput Computer-Assisted Identification Platform ...

Page: 1 / 25 Knorr / 2nd European GCxGC Symposium, 2011

High-Throughput Computer-Assisted Identification Platform

of Small Molecules

ACD/Labs European Users Meeting 2012

June 12-13, 2012

A. Knorr, A. Monge, D. Arndt, E. Martin, and P. Pospisil

Pospisil et al, ACD/Labs European Users’ Meeting 2012


Aim of CASI

Goal

• To accelerate and standardize the identification of small molecules with highest

confidence possible

• To Increase the throughput of identified compound structures by fully automatic

process

• Standardize the identification process

CASI is Computer-Assisted Structure Identification platform This platform identifies automatically compound structures in highly complex matrices



Example

Compound ?

Smoke of a conventional cigarette, measured by GCxGC-TOF



CASI Automated Workflow

1

GCxGC-TOF

2

Search in Mass Spectra

Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file

+ KI experimental values

+ relative second retention time

Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database

Sorted Hits Confirmed Hits



CASI Software Architecture

NIST MS

Search RapidMiner

(Modeling)

Dragon

(Modeling)

CASI Core Engine (Java + Tomcat)

Oracle

11gR2

Oracle Application Express

(Web Application)

ACD PhysChem

ACD Batch

ACD name-to-

structure

SOAP

Hibernate

Query

Mass

Spectra

Pipeline

Pilot

(Chemistry)

MS Databases

(NIST Format)

ACD/Labs + Pipeline Pilot Server

CASI Server

Java API CLJava API

Input data

· JDX file (mass spectra file)

· Experimental KI

· Experimental 2D retention index

CLCL

ORACLE Server

HPC environment

CL: command line




Latest spectra databases acquired (versions 2010)

> 650’000 Mass Spectra available in CASI

1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database




1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



Computational modeling:

Kovats Indices

Several good models were published.

We model Kovats Indices for selected instruments.

GCxGC-TOF 2D retention

Never published.

BP for unknown compounds

It is known that Boiling Point is correlated to Kovats Indices.

BP can be also predicted or structure-retrieved (ACD/Labs).



Predictive QSPR Model for KI, 2DrelRT and BP

GA – Support Vector Regression,

8 Molecular Descriptors

Kovats Index 2DrelRT Boiling Point

Linear Regression: BP calc. by

ACD/PhysChem vs. BP calc. by KI

GA – Linear Regression,

15 Molecular Descriptors

Validation r2 = 0.981 r2 = 0.855 r2 = 0.942

Selected models



1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



Experimental method to determine

2D relative retention time (2DrelRT)

Never published



Experimental 2D Relative RT Approach for GCxGC-TOF

• Currently, no 2D relative RT approach (2DrelRT) published

• Advantage of having 2DrelRT approach compared to 2D absolute RT (2DabsRT) approach: correction of

systemic GC fluctuations, enhanced reproducibility

• Challenge is the definition of a reference system accessible to all 2D peaks

• Our approach is to refer 2D peaks to hypothetical reference points based on linear regression on deuterated

n-alkanes (patented)

2

nd

dim

ensio

n s

epara

tion t

ime [

seconds]

1st dimension separation time [seconds]

deuterated

n-alkane 1

Example

compound

hypothetical n-alkane retention

(for 1D RT range < 1D RTdA2)

deuterated

n-alkane 2

deuterated

n-alkane 3

2D RThypothetical

n-alkane

abs 2D RTComp



2DrelRT for GCxGC-TOF Data - Reproducibility

• Comparison of 3 different studies showed significantly better reproducibility using relative

than absolute 2DRT.

• 1219 compounds found (signal/noise >250), no outlier correction

0

100

200

300

400

0.3

0.6

0.9

1.2

1.5

2.0

2.5

3.0

3.5

4.0

5.0

6.0

7.0

8.0

10.0

>10.0

Relative standard deviation of absolute / relative retention time

for the second dimension of GCxGC-TOF (RSD, %)

Nu

mb

er

of

co

mp

ou

nd

s 2D-relRT

2D-absRT

4.3%2.5%

90th Percentile

(1097 compounds):



1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



Experimental method to determine

In-silico fragmentation tools

Not included in CASI Score

In-silico

fragmentation



Evaluation of in-silico fragmentation tools for CASI

• In-silico tool ”AutoAssignment” by ACD/Labs was used to

evaluate the rate of assignment of molecular

substructures to fragments

• Results by “AutoAssignment” for 90 MS of confirmed

structures and the corresponding 50 hits for each proposal

from NIST MS Search (in total 4500 structures)

(AutoAssignment parameters optimized by ACD/Labs)

27%

• Reject approach for low resolution mass-spectrometry by GCxGC-TOF

• Evaluate approach for accurate mass-spectrometry by GC-APCI-TOF (selectivity of substructure

assignment will increase by accurate mass of fragments/isotopic pattern)

Statement from ACD/Labs (Graham McGibbon, ACD/Labs MS Product Manager):

“The AutoAssignment tool was designed primarily to facilitate fragmentation interpretation rather

than for definitive structure differentiation, which currently needs comparisons of specific

characteristic ions and authentic spectra.”



1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



CASI Score is a combination of NIST Match Factor (from step 2)

and of the three components of step 3:

predquerytrainBPBPpredquerytrainDRTDRT

predquerytrainKIKI

BPBPSEPhypDRTDRTSEPhyp

KIKISEPhypNIST MFCASI Score

,,2,2,

,,

22



CASI Score function fitting – KI example

Predicted KI 2000.0

Experimental KI 2200.0

Standard Error of

Prediction65.8

Curve fitting (variable) 10.0

Score 0.908

Example: 8x50 structures with highest

NIST Fit

- red squares: Score correct structure

- blue dots: Score wrong proposals

from NIST

Score by MS Similarity and

Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9

Example Compound No.

CA

SI

Sc

ore

Good Hits

Hit ranking of correct

structures

8

0 0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.

Visualization of curve fitting

0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 1.000


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

5

3

0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 1.000


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

6

2

0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI

mo

du

le

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.999


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

6

2

0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.996


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

6

2

0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.990


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

7

10 0 0

0

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.977


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

7

10 0 0

0

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.908


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

8

0 0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.630


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

8

0 0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.000


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

7

0 0 01

0

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.000


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

5

10

1 1

0

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.000


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

3

1 10

3

0

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

Predicted KI 2000.0


Standard Error of

Prediction65.8


Score 0.630


NIST Fit



from NIST


Predicted KI

0

200

400

600

800

1000

0 1 2 3 4 5 6 7 8 9


CA

SI

Sc

ore

Good Hits


structures

8

0 0 0 00

2

4

6

8

10

#1 #2 #3 #4 #5-50

Hit-No.


0.0

0.2

0.4

0.6

0.8

1.0

500 1500 2500 3500

Kovats Index

Sc

ore

of

KI m

od

ule

• Training (fitting) of the CASI Score is done over all predicted modules (KI, BP,

2DrelRT) simultaneously and is based on #1-Hits.

• All fittings are done automatically (no manual intervention).



1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



Ranking of suggested hits

by CASI score



Analysis page example


Nr. 1

Nr. 2

Nr. 3

Nr. 4 Nr. 1

Nr. 2

Nr. 3

Nr. 4

CASI NIST


Performance Indicators of CASI Platform

Discrimination between identified and unknowns:

CASI score < 795:


0

10

20

30

40

50

300 400 500 600 700 800 900 1000

CASI Score

Fre

qu

en

cy

Correct hits of the validation set

Selected hit of unknown compounds

0

10

20

30

40

50

300 400 500 600 700 800 900 1000

NIST MS Search Match Factor

Fre

qu

en

cy

Correct hits of the validation set

Hit with highest score for unknowns

NIST score < 825:

Distribution of the CASI scores or NIST Match Factors for the correct hits (blue) of the validation set and of the hits of

unknown compounds selected by default (1st highest score, pink) for a set of 176 unknown compounds

Below this threshold: high probability to be true negatives (false proposals given by mass-spectral

databases)


Dataset for CASI Platform

• Comparing the chemical composition of different smoke samples

• Application of non-targeted differential screening using GCxGC-EI-TOF-MS

• GCxGC-EI-TOF-MS data provide:

– structural proposals for the most relevant differences

– unknowns (no structural proposal available)

• 218 structural proposals were confirmed by reference standards

• 176 unknowns were additionally included in dataset



Range and Structural Diversity of CASI dataset

Comparison of CASI dataset against

smoke related organic compounds (>6000)

CASI compounds ( ) are distributed

between a broad range of structural features of

the in-house database ( ) of smoke related

organic compounds.

- PCA based on ECFP6 fingerprints: circular topological fingerprints

for e.g. similarity searching Dataset we used covers the common ranges

of GC-MS

50

150

250

350

450

550

500 1500 2500 3500

Kovats Index

Mo

lec

ula

r w

eig

ht



Overall results for current dataset (Identified + Unknowns)

• Evaluation of results on combined dataset (247 mass-spectra) by contingency table

Used threshold of being identified: 825 for NIST MS Search, 795 for CASI Score)

• CASI Platform shows significantly less false positives (11 vs. 29) than NIST MS search.

true

(CASI score)

false

(CASI score)

true

(NIST MF)

false

(NIST MF)

positive 46 11 40 29

negative 165 14 147 20

total (%) 89% 11% 79% 21%



1

GCxGC-TOF

2


Databases

(NIST MS Search)

2nd

column

relative retention

time matching

KI matching

Multi JDX MS file



Hits

Boiling Point

matching

3 CASI Score

4

Ranking

5

Submission to

UCSD

database



Automatic submission into PMI

unique chemical database



User Web Interface



User and admin web interface

Job queueJob queueResult summaryResults summaryResult summaryResults summaryCandidates for each resultCandidates for each queryCandidates for each resultCandidates for each query

Export to

.sdf or

publish in in-

house

database



Conclusion

• CASI (for GC) platform is functioning.

• CASI can be expanded to other GC systems with the possibility to

dynamically change it to other instruments and analytical columns.

• Innovative 2D relative retention time concept is proven working.

• Robust models for KI, BP and 2DrelRT were established.

• CASI Score performs better than NIST MS Search.

• Patent application submitted (>20 claims).

• Very good feedback from the scientific community.

• Platform will be extended to accurate mass and LC-MS and to other GC

systems



Acknowledgment

Bioanalytical chemistry

Chemoinformatics

Arno Knorr Markus Stueber

Andre Stratmann Daniel Arndt Manuel Peitsch Pavel Pospisil

Elyette Martin Aurelien Monge



End

Thank you for your attention.


High-Throughput Computer-Assisted Identification Platform ...

Documents

Transcript of High-Throughput Computer-Assisted Identification Platform ...