PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3...
Transcript of PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3...
![Page 1: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/1.jpg)
29/11/2016 v 1.0.0
Data processing
Filters and
normalisation
Mélanie Pétéra
W4M Core Team
![Page 2: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/2.jpg)
Presentation map
1) Processing the data • W4M table format for Galaxy
2) A generic tool to filter in Galaxy a) Generic Filter
• How does it work?
b) Examples
• Filtering MS data according to retention time
• Using blanks to filter your data
3) Signal drift and batch effect correction for MS data a) How does that work?
b) One Galaxy tool, various possibilities
• What’s different?
• How to use this tool?
4) Checking for quality • Using your pools to check your data
![Page 3: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/3.jpg)
PROCESSING THE DATA
![Page 4: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/4.jpg)
W4M Galaxy tools: a standard format
• A variety of tools to process extracted data
– filters
– normalisation
– statistics…
• A common way to handle data
– Easier to follow from a tool to another
– Less format switches in the analysis pipeline
– A standardised input files format to easily find the information needed or obtained
![Page 5: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/5.jpg)
W4M table format for Galaxy
• 3 tables gathering all the information
– the data matrix: intensities of ions or buckets
– the sample metadata file: information concerning your samples
– the variable metadata file: information concerning your ions or buckets
• Note that this 3 tables structure is already generated from the XCMS or bucketing modules
– /!\ you must complete the sample metadata file with your samples’ information (technical information about your samples, or factors of interest for example)
![Page 6: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/6.jpg)
W4M table format for Galaxy
Data matrix
name samples’ ID
va
ria
ble
s’ I
D
intensities
first column
first row
ONLY intensities (no other information)
Note: missing values should be coded NA
the name you want (just avoid it to begin with
"ID" if you plan to open it with Excel later)
![Page 7: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/7.jpg)
W4M table format for Galaxy
Sample metadata
name column names
sa
mp
les’ I
D
Information about your samples
(study factors for example)
first column
first row
You can add to this
table as many columns
as you want or need
the name you want (just avoid it to begin with
"ID" if you plan to open it with Excel later)
Samples’ ID must absolutely match
those in the data matrix file
Note: some modules may need some
specific columns with particular names
(e.g. ‘sampleType’, ‘injectionOrder’ or
‘batch’ for the Batch Correction module)
Refer to the module’s help section for
more information
![Page 8: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/8.jpg)
W4M table format for Galaxy
Variable metadata
name column names
va
ria
ble
s’ I
D
Information about your variables
first column
first row
Variables’ ID must absolutely match
those in the data matrix file
You can add to this
table as many columns
as you want or need
the name you want (just avoid it to begin with
"ID" if you plan to open it with Excel later)
![Page 9: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/9.jpg)
W4M table format for Galaxy
• The files must be tabulated – TSV files
– TXT files with tabulation as separator
• Convention for identifiers and column names – It should not contain any duplicate
– Rather use only alphanumeric characters, and points (.) and underscores (_)
Some tools include preliminary tests for your table format, but if you want to make sure everything is alright you can use the Check Format module. It can also help sometimes when you encounter errors you do not understand.
![Page 10: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/10.jpg)
A GENERIC TOOL TO FILTER IN GALAXY
Generic Filter
![Page 11: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/11.jpg)
A generic tool to filter in Galaxy
• Extracted data often contain more than what should be used
– Variety of ways depending on your protocol and objectives
• You must know what you are filtering
– A generic tool invites you to specify exactly what you want to filter => this is your choice
• Where is the information to filter?
– It must be contained in the sample metadata or variable metadata file (depending on the filter)
![Page 12: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/12.jpg)
Galaxy filtering module: "Generic Filter"
Input files
Output
files
3 tables as input files
3 tables as output files
corresponding to input files
filtered according to specified
parameters
![Page 13: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/13.jpg)
Galaxy filtering module: "Generic Filter"
![Page 14: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/14.jpg)
Example1: filtering according to retention time
Example for LC-QTOF with dead volume between 0 and 0.4 min and column flush from 16.5 min
• When using a chromatography column for MS analysis, you may want to exclude some time range, for example to:
– Exclude the dead volume
– Exclude a calibration zone at the begining or the end
– Exclude a column flush
– …
![Page 15: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/15.jpg)
Example 2: using blanks to filter MS data
• Why?
– One unavoidable thing in mass spectrometry data is noise in the signal
– There are ways to reduce the impact on gathered data that may sometimes be too radical (for example filtering all intensities below a given threshold)
– One possible alternative is the use of blanks to estimate the noise, as a reference
![Page 16: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/16.jpg)
• How?
– The idea is to compare blanks’ intensities with other samples’ intensities (biological samples and/or pools)
– Ideally blanks are your injection solvent
– One common way to compare may be to set a minimum difference (by ratio) between means or medians, or to test for significant difference with a statistical test
Example 2: using blanks to filter MS data
![Page 17: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/17.jpg)
• Example with Galaxy
If you separate blanks and other samples in two distinct folders for data extraction, you will have in your variable metadata table:
– a fold column: mean fold change (always greater than 1, see tstat for which set of sample classes was higher)
– a tstat column: Welch’s two sample t-statistic, positive for analytes having greater intensity in class2, negative for analytes having greater intensity in class1
– a pvalue column: p-value of t-statistic
Columns at the end of variable metadata table
Use Generic Filter tool to filter!
Example 2: using blanks to filter MS data
![Page 18: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/18.jpg)
SIGNAL DRIFT AND BATCH EFFECT CORRECTION FOR MS DATA
![Page 19: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/19.jpg)
How does that work?
• A normalisation process first established by Van Der Kloet et al.
– F.M. Van Der Kloet, I. Bobeldijk, E.R. Verheij, R.H. Jellema. (2009). “Analytical error reduction using single point calibration for accurate and precise metabolomic phenotyping.” Journal of Proteome Research p5132-5141
• which have made its way to nowadays procedures
– Dunn et al (2011). “Procedures for large-scale metabolic profiling of serum and plasma using gas chromatography and liquid chromatography coupled to mass spectrometry.” Nature Protocols, 6:1060-1083
![Page 20: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/20.jpg)
How does that work?
• Principle
– What we have
– What we want
injection order
inte
nsity f
or
1 ion
distinct batches of analyse
particular intra-batch analytical effects
Comparable intensities
![Page 21: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/21.jpg)
How does that work?
• Technically speaking – Correction is made for each ion independantly
– For each ion: • An intra-batch correction is made for each batch independantly
– Analytical effect is modelled using pools’ intensities according to the injection order
– Each sample intensity is devided by the estimation of analytical effect of corresponding injection number
– Sample values are then multiplied by a reference value (to keep original ion scale)
• Inter-batch effect is thus automatically corrected
Observed pool value
Observed sample value
x
y Estimated value for injection number x
Regression curve of analytical effect model
normalised value for
sample obtained at
injection number x
observed sample value
at injection number x
estimated value
for injection number x
reference
value
Pools = Quality-control
pooled samples, all identical,
injected regularly all through
an analytical sequence
![Page 22: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/22.jpg)
How does that work?
• What you need to make it go smoothly
– Pools should be injected regularly all through your sequences
– Pools should be identical, preferably a mix of all your biological samples to be representative of molecule diversity
– Pools should be numerous enough in each batch, for the regression to be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones)
– It’s recommended that your biological samples may be randomised for injection order
– Your data must contain:
• the injection order
• the batches of analyse
• the sample type (pool or sample)
![Page 23: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/23.jpg)
One Galaxy tool, various possibilities
You can choose different possibilities
by choosing a type of regression model
Various options depending
on your model choice
![Page 24: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/24.jpg)
What’s different?
• Two strategies implemented – linear / lowess / loess
– all loess pool / all loess sample
• Distinct graphical output for each strategie
– Different variations of before/after overview
Don’t forget the help section is your friend
● choice in regression model type
● intra-batch correction is conditioned
to internal quality metrics
● possibility to apply correction
based on sample intensities only
![Page 25: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/25.jpg)
How to use this tool
• Mandatory columns in sample Metadata table
– injectionOrder: numerical column of injection order
– sampleType: specifies if a pool or a sample (coded “pool” or “sample”)
– batch: categorical column indicating the batches of analyse (if only one, must be a constant)
• In the data matrix (containing intensities), missing values are allowed only for all loess methods
• In case you want to use the linear / lowess / loess strategy, you can use the “Determine batch correction” tool to help you in the choice of a regression type
This module computes graphics and
indicators, but still the user is the
only judge regarding which model is
the more appropriate for his data.
![Page 26: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/26.jpg)
How to use this tool
• Parameters
– Span (not available for ‘linear’ method): smoothing parameter for lo(w)ess regression
quite a smooth curve (span=1) not smooth at all (span=0.3)
![Page 27: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/27.jpg)
How to use this tool
• Parameters
– Null values (not available for ‘all loess’ strategy): what to do when negative or infinite intensity values are generated during calculations
– Factor of interest (not available for ‘all loess’ strategy): a categorical column in sample metadata table, used to have a quick graphical overview of the effect of normalisation on this variable in the data; this does not affect correction calculation
– Level of details for plots (not available for ‘all loess’ strategy): simply to choose the amount of graphical output to produce in the pdf file
Coloration depending on factors
batch sample type factor of interest
![Page 28: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/28.jpg)
Graphical output: linear/lowess/loess
![Page 29: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/29.jpg)
Graphical output: all_loess
![Page 30: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/30.jpg)
CHECKING FOR QUALITY
![Page 31: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/31.jpg)
Using your pools to check your data
• What to check
– Coefficient of variation:
– Correlation with pool dilutions: “Does intensity evolve according to dilution?”
𝐶𝑉 =𝜎
𝜇
where: 𝜎 = 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝜇 = 𝑚𝑒𝑎𝑛
used individually or used with ratio
e.g. pools’ CV is often considered
to be too high if upper than 0.3
e.g. ration between pools and
samples may be considered too
high if upper than 1 ( pools are
more variable than samples)
global boxplot available in Batch Correction
output with linear/loess/lowess methods
Pearson’s correlation
coefficient
Needs pool
dilutions being
injected
![Page 32: PROCESSING THE DATA - workflow4metabolomics€¦ · be reliable (must be, at the very least, of 3 per batch for linear methods and 8 for non-linear ones) –It’s recommended that](https://reader037.fdocuments.us/reader037/viewer/2022090605/605ab3d9e6912e494b4ebabc/html5/thumbnails/32.jpg)
Using your pools to check your data
Use the Quality Metrics module to compute your indicators
See the module Help
section or the
corresponding
HowTo for more
information
Note: this module can be used even without pools since it
computes other interesting quality information and graphics