X Y The significance of the structure of data on PLS predictions of protein involving both natural...

23
X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck

Transcript of X Y The significance of the structure of data on PLS predictions of protein involving both natural...

Page 1: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

X YThe significance of the structure

of data on PLS predictions of protein

involving both natural and human experimental  design

Åsmund RinnanLars Munck

Page 2: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Three Data-sets of barley

B + C: The major substances protein, starch, cellulose, beta-glucan, fat and water are weighted to represent biological composition

A B C

Natural Simulated DoE31 31 54

All measured on NIR 6500 from 1100-2498nm with 2 nm intervals

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Normal barleyProtein mutantsCarbohydrate mutants

Page 3: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Pre-processing of spectra

Moving Window SNV with 130 nm window

The 1580-2498 nm spectral area visualizes the least differences

between the three data sets

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 4: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

PCA 1100-2500nm

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 5: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Interval PCA selects 1804-2060 nm givingthe least differences between datasets.

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 6: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Predicting proteinUsing the three datasets

Nat Sim DoE

RMSE 0.71 1.08 0.69

r2 0.9 0.84 0.96

nLV 5 2 5

intercept 1.09 2.12 0.48

slope 0.93 0.86 0.97

Regression coefficients

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 7: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

PLS diagnostics (to protein)

A.Simple correlation coefficients: wave-length absorbtion to protein content.

B.PLS Regression coefficients

NaturalSimulatedDoE

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 8: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Isolating the chemical and biological components of the data-sets.

A B C

Natural SimulatedNatural

DoE

31 31 54

ChemistrySimBiology

RestBiologySimBiology

ChemistryChemistry

SimBiology = B – CRestBiology = (A – C) – (B – C)

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 9: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Predicting protein: by PLS: Chemistry and non simulated(rest) biology show high contributions while that of simulated biology is low.

Chemistry SimBio RestBio

RMSE 0.94 2.53 1.31

R2 0.87 0.13 0.76

nLV 3 1 3

intercept 1.58 12.9 3.15

slope 0.90 0.17 0.80

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 10: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Normalized regression coefficients

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 11: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Back to data, selected wavelengths

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Full PLS Correlation-PLSWavelengths abs to protein

Assignment PLSPhil Williams

Page 12: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Quick comparison

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 13: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Results: Summary

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 14: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Interpretation: We are working by ”Permutation science”:

• 1.By mathematical validation of models permutation of data in chemometrics i.e cross-validation

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 15: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

”Permutation science”:

• 2.Design of Experiments (DoE) Permutation of data through experiments by human design.

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 16: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

”Permutation science”:

• 1.By mathematical validation of models permutation of data in chemometrics i.e. crossvalidation

• 2.Design of Experiments (DoE) Permutation of data through experiments by human design.

• 3. Natural design Permutation by selection of unique natural states where nature reveals its principles in data.

Question: In chemometrics why not combine them all rather than focusing on mathematical permutation alone?

All three permutation approaches are in the heart ofchemometric validation of models! Why not use themtogether as we have done here. They are

complementary.

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 17: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Principles of natural processesare reflected in data

• The solar eclipse reveals solar eruptions

• The NIR barley endosperm mutant model developed since 1965 with expression control of genetics and environment Two types of mutants:

regulative protein mutants – P and carbohydrate (starch) mutants – C(normal barley – N)

*)

*) http://science.nationalgeographic.com/science/enlarge/solar-eclipse-moon.html

Mutant 5.f

Bomi control

Mutant 5.f

Bomi control

J.Chemometrics 24: 481-495 (2010)

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 18: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

How were the mutants found? By a bi-variate plot % proteinto mmol DBC (Dye binding capacity by acilanorange)

The Dyebinding Capacity (DBC) instrument for basic amino acids (lysine).

Background: Development of screening methods for improving lysine and nutritional quality in barley

LM at the nutritional laboratoryof the Swedish seed Ass.Svalöf in 1967.

High lysine Mutation

Mutation recombinants

Normal recombinants

DB

C

% protein

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 19: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Selecting endosperm mutants

J.Chemometrics 24: 481-495 (2010)

8

10

12

14

16

18

20

2 4 6 8 10 12 14 16 18 20

3a_3a_

3a_piggy

5f_5f_5f_5g_5g_5g_5g_ 1616

3a3a3a3b3b

3c

3c

3m3m

449449

4d

5f

5f

5f

5g5g5g

5g5g5g w1

w2NNNNNNN

N

N

NN

NN NN N

NNN

N

NNNNN

N NNNNb

β-Glucan

A/P

A/P = Amide Nitrogen to Protein

High β-glucan

Normal

High Lysine

No data

Vitamin E profileA/P vs. b-gulcan

Conclusion: Each mutant produces a unique chemical fingerprint for each individual gene in a controlled genetic background (Bomi). The fingerprint is summerized on the level of chemical bonds by NIR spectroscopy. Cellular computation is soft like a PCA.

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Any chemical (bi-)plot can select any mutant.

Page 20: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

There are deterministic differential NIR spectra for each mutant to the gene background Bomi that

reveals a spectral absorption reproducibility as high as 10-5 MSC log 1/R for the P mutant lys3.a(blue) and the C mutant lys5.g (brown).

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 21: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Data structure is super-ordinate to chemometric analysis

-0.15

-0.10

-0.05

0

0.05

0.10

-0.20

-0.10

0 0.10

0.20

• 3a• 3a•

3m• 3b

• 3c• 4d

• 16 •

5g• 95

• 449•

449

• 5f

• 5g

• w1

• w2 •

Bomi• CAII

• Minerva

• Nordal•

Nordal• Triumph

• Lysiba

• Lysimax

PC1

PC2

Scores

CN

P

BG = 12.3

BG = 3.7

3.2

3c

3a

The 3a and 3c P mutantsare differentiated in thisPCA

However, spectral differences in the area 2450-2500nm represent a much more finely tuned and informative change in β-glucan from 3.1% in 3a to 6.4% in 3c

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 22: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

How is the chemical composition of the cell decided?

Through soft modeling of intercellular dynamics of the whole cell by quantum and chemical cross-talk as revealed by the movements of chromosomes at mitosis (click at theleft figure).

Cell emergence is like music as directed by the whole chemical orchestra of the cell

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion

Page 23: X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck.

Conclusion• Biological macro data are

basically deterministic calculated in situ by “set probability” controlled by the whole cell

• Holistic analysis is limited by uncertainty specified as irreducibility “top down” and indeterminacy “bottom up”

• The structure of data is the king that rules mathematical modeling by data inspection

• Because of the determinism that here is demonstrated, data development of gentle data models (such as MSC) and data inspection software are of essential importance in avoiding a reduction

of information. • Chemometrics is excellent for

over- views but the results have to

be checked by data inspection,

RinnanDatasetPreprocessingPCAiPCAPLSBiologyPLS - againSummary

MunckPermutationMutantsDiff specData structureGeneticsConclusion