LC-MS/MS Raw Spectral Data Processing

Zhiqiang Pang, Jeff Xia

2023-10-02

1. Overview of LC-MS/MS raw data processing workflow

Global or untargeted lipidomics is increasingly used to investigate changes of lipids in various biological or environmental systems in an unbiased manner. Liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS) has become the main workhorse for global metabolomics. The typical LC-HRMS metabolomics workflow involves spectra collection, raw data processing, statistical and functional analysis.

MetaboAnalystR is a R package, synchronized with the popular MetaboAnalyst website, designed for comprehensive metabolomic data analysis, visualization, and interpretation.

Since the version 2, MetaboAnalystR has supported raw spectral data processing, and linked the MS data processing results directly for functional analysis. The raw spectral data procesing was depending on XCMS. Given that there are many parameters to be optimized for centWave algorithm, which is implemented in XCMS. In version 3, MetaboAnalystR implemented an auto-optimized workflow to optimize all critical parameters to obtain an optimal results. Meanwhile, all functions in MetaboAnalystR for raw spectral processing have been well-organized into a companion R package, OptiLCMS. In the latest version (4.0), MetaboAnalystR starts supporting data deconvolution on MS2 spectral data, including data-dependent acquisition (DDA) and SWATH-data independent acquisition (DIA).

The aim of this vignette is to provide an overview on how to use MetaboAnalystR (together with OptiLCMS) to perform LC-MS/MS raw spectral metabolomics data processing. This vignette mainly includes the installation of OptiLCMS, MS data import, Parameters’ optimization, MS/MS data import, data deconvolution, spectral concensus of replicates, MS/MS spectral database searching and results export.

The general workflow of MetaboAnalystR for raw LC-MS/MS data processing is depicted below. The general data processing steps include two parts: MS feature detection and MS/MS data based chemical identification. The whole workflow are summarized in Figure 1.

Figure 1. Workflow for LC-MS/MS data processing.

Figure 1. Workflow for LC-MS/MS data processing.

1.0 Loading the package

After following the installation instructions on the OptiLCMS Github page, you will be ready to use the package. Use the library() function to load the package into R. Please make sure MetaboAnalystR has already been installed. If not, please refer to the installation guidance.

# Load MetaboAnalystR
library(MetaboAnalystR)
# Load OptiLCMS
library(OptiLCMS)

1.1 Tips for using the MetaboAnalystR package for raw spectral processing

  1. Different from other statistic or functional analysis moduel, the first function that you will use for raw spectral data processing is NOT the InitDataObjects function;

  2. You can follow our latest protocols to practice the example data or process your own dataset.

1.2 Data Format for Raw LC-MS/MS spectral data

For LC-MS/MS based global (untargeted) metabolomics raw data processing, MetaboAnalystR only accepts open-source formats (.mzML/.mzXML/.mzData/.cdf).

All vendor formats (.raw/.wiff/.D/.d/.RAW etc.) must be converted and centroided into a open-source format (e.g. *.mzML, highly recommended) at first. In addition, different from MetaboAnalyst website, there is no limitation on file size and file number.

User could process unlimited raw spectral data as long as the RAM of their local PC permits.

All MS data are converted and centroided by using MSconvert (Proteowizard, version 3.0.23046) into mzML format. All empty scans are removed in this step.

2. Example raw spectral data processing (MS1)

2.0 Download example spectra

In this example, we are going to use an experimental Malaria metabolomics dataset (UPLC-Q/E-ESI+, HILIC) between two immune status (Native vs. Semi-immune) from Li et al. There are 12 samples (2 groups, 6 samples for each) and 3 QCs included in this dataset. In this tutorial, we are going to use the auto-optimized pipeline to process MS1 data step by step.

download.file("https://www.xialab.ca/api/download/metaboanalyst/malaria_r_example.zip",
              destfile = "malaria_raw.zip",
              method = "curl")
unzip("malaria_raw.zip", exdir = "upload")

MS1 raw spectral processing include the following four steps: 1. (optional) automatic parameters’ optimization; 2. Raw spectral import; 3. MS1 data processing (peak picking, alignment and gap filling); 4. Peak annotation (as adducts and isotopes).

2.1 Region of Interest (ROI) extraction

The steps to extract an Region of Interest (ROI) for parameters optimization. In this step, you need to specify data files which are used to extract ROIs and optimize parameters. As for the mechanism, in brief, the algorithm first scans the whole spectra across m/z and retention time dimensions to select several ROIs that are enriched for real peaks. Second, these ROIs are then extracted as new synthetic spectra. Finally, a DoE model is used to optimize peak picking parameters based on the synthetic spectra. More technical details can be found in MetaboAnalystR 3.0 paper.

rt.idx refers to the retained percentage of chromatogram in retention time dimension. Users can also adjust the percentage based on their needs, but they are suggested to use more than 50% (0.5) to include larger ROI and improve the accuracy of following parameters’ optimization.

rmConts means “remove contamination” before ROI extraction. This option allows users to exclude strong instrumental noise or potential contaminants present during the chromatographic run for parameters’ optimization. User could enable this option (as TRUE) if their data contains significant noise/contamination.

It is noted that step 2.1.1 and 2.1.2 is only used for auto-optimized workflow. If user would like to configure the parameters themselved based on their expertise and experience, they can skip these two sections and jump into 2.1.4 directly.

# Load MetaboAnalystR and OptiLCMS
library(MetaboAnalystR)
library(OptiLCMS)

# Here, we extract ROIs from 3 QC samples.
DataFiles <- list.files("upload/QC/", full.names = TRUE)
mSet <- PerformROIExtraction(datapath = DataFiles, rt.idx = 0.9, rmConts = TRUE);
## Step 0/6: Scanning ROIs for parameters optimization...
## QC_001.mzML
## QC_003.mzML
## QC_005.mzML
## Data Loading...
## Raw file import begin...
## Import data: QC_001.mzML finished!
## Import data: QC_003.mzML finished!
## Import data: QC_005.mzML finished!
## Data Loaded !
## Empty Scan detecting...
## No Empty scan found !
## Identifying regions of interest (ROI)...
## Identifying regions of potential contaminants........Done!
## 70 potential contaminamts will not be used for parameters optimization !
## Going to the next step...
## MS data Preparing...
## MS Data ready !
## Identifying ROIs in m/z dimensions...
## Identifying ROIs in m/z dimensions Done !
## Identifying ROIs in RT dimensions...
## Chromatogram Plotting Begin...
## Identification on ROIs Finished!
Figure 2. Extracted ROI for parameters’ optimization.
Figure 2. Extracted ROI for parameters’ optimization.

To import binned spectral data, users will use the same Read.TextData functions in the above two examples. However, binned spectral data is typically of lower quality than compound concentration data and therefore requires a further step of data preprocessing to convert the data into a usable data matrix. For instance, binned spectral data often contains background noise and requires filtering of such unwanted variations. You can find an example here.

2.2 Auto-optimization of parameters

In this example, we will execute PerformParamsOptimization function to optimize the critical parameters of peak picking and alignment for the following data processing in an automatic way. It utilizes the extracted ROI data and the internal instrument-specific parameters. Parallel computing can be enabled. The number of cores user want to use could be specified. More RAM will be consumed if the parallel core is set high.

Parameters optimization with a design of experiment (DoE) strategy. Initial parameters are from the optimized parameters above or the internal default parameters from embedded SetPeakParam function.

Here is some recommendations for parallel core setting, 64GB RAM ~ 6 cores; 32GB RAM ~ 4 cores; 16GB RAM ~ 2 cores. Minimum 16GB RAM is required for raw spectral processing.

# Here we use PerformParamsOptimization to optimize parameters based on 
# the extracted ROI (stored in 'mSet') before process the entire dataset

best_params <- PerformParamsOptimization(mSet, param = SetPeakParam(platform = "UPLC-Q/E"), ncore = 4);
## 
## Step 1/6: Start to optimize parameters! 
##                 
## This step may take a long time...
## DoE Optimization Starting Now... (2023-10-02 13:42:23)
## Evaluating Noise level...
## ppm is estimated as : 4.3
## Done!
## Preparing Parameters for optimization finished !
## Data Spliting Finished !
## Peak Preparing Begin...
## Peak Preparing Done !
## Round:1
## DoE Running Begin...
## 100% |
## Round 1 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 39.3
## Model Parsing Done !
## Round:2
## DoE Running Begin...
## 100% |
## Round 2 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 42.6
## Model Parsing Done !
## Round:3
## DoE Running Begin...
## 100% |
## Round 3 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 34.3
## Model Parsing Done !
## No Increase Stopping !
## best parameter settings:
## min_peakwidth:  5
##  max_peakwidth:  15.75
##  mzdiff:  0.036
##  snthresh:  7.1
##  bw:  2
##  Peak_method:  centWave
##  RT_method:  loess
##  ppm:  4.3
##  noise:  6772
##  prefilter:  2
##  value_of_prefilter:  11556.95
##  minFraction:  0.8
##  minSamples:  1
##  maxFeatures:  100
##  fitgauss:  FALSE
##  mzCenterFun:  wMean
##  integrate:  1
##  extra:  1
##  span:  0.4
##  smooth:  loess
##  family:  gaussian
##  verbose.columns:  FALSE
##  polarity:  negative
##  perc_fwhm:  0.6
##  mz_abs_iso:  0.005
##  max_charge:  2
##  max_iso:  2
##  corr_eic_th:  0.85
##  mz_abs_add:  0.001
##  rmConts:  TRUE
##  BlankSub:  TRUE
## Step 1/6: Parameters Optimization Finished ! (2023-10-02 14:27:32)
## Time Spent In Total:45.2mins

2.3 Importing example data

For raw spectral processing, MetaboAnalystR accepts open-source formats (including mzML, NetCDF, mzDATA, mzXML) data. There are two options for users to input the metadata information (grouping information, e.g. diseases/healthy groups) on all files:

1). Save all files of the same group into a sub-folder. There must be at least 2 sub folders (excluding QC sample);

2). Use a meta data file for parameter, metadata. A phenotype data frame or a absolute path of the metadata file (.txt) for all samples. In the option, first column should be the sample name, while second column is the corresponding group name. If ommited, all samples in the same sub-folder will be considered as one group.

# "path" is used to specify the path to the folder containing the raw MS spectra to be processed.
# BPI and TIC plotting can be enabled with parameter, 
# "plotSettings = SetPlotParam(Plot = T)", or disabled by changing "T" into "F";

mSet <- ImportRawMSData(path = c("upload"), plotSettings = SetPlotParam(Plot = T))
## No initialized mSet found, will initialize one automatically !
## Warning in ImportRawMSData(path = c("upload"), plotSettings = SetPlotParam(Plot
## = T)): No metadata provided, all files in one sub-folder will be considered a
## group!
## 
## Step 2/6: Start to import the spectrum! 
## This step will take a short time...
## Raw file import begin...
## To reduce memory usage BPIS and TICS plots will be created using only 10 samples per group.
## Plotting BPIS and TICS.
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
## Step 2/6: Successfully imported raw MS data! (2023-10-02 14:27:43) 
## Going to the next step...
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).

2.4 Raw spectral data processing

The next step is to detect peaks (also known as peak picking) from the centroid data. Multiple algorithms have been developed to identify peaks in different dimensions such as retention time. Among them, the centWave algorithm implemented in XCMS has been shown to perform well in processing LC–HRMS spectra. After peak detection, peak alignment is performed to address retention time variations across spectra. These aligned peaks form a peak intensity table with varying proportions of missing values. These missing values indicate that either peak detection failed or the corresponding feature is absent from the respective sample. Therefore, the final step is ‘gap filling’ by reperforming direct peak extraction on corresponding regions in the raw spectra.

The PerformPeakProfiling function is wrapper of raw spectral processing functions that performs peak detection, alignment, and grouping in an automatical step. The parameters from optimization step above can be used directly in this step.

The function also generates two diagnostic plots including statistics on the total intensity of peaks in different samples, a retention time adjustment map, and a PCA plot showing the overall sample clustering prior to data cleaning and statistical analysis.

# "mSet" include complete raw MS spectra to be processed.
# "params" is using the "best_params" generated above
# Plotting functions can be enabled with parameter, 
# "plotSettings = SetPlotParam(Plot = T)", or disabled by changing "T" into "F";
mSet <- PerformPeakProfiling(mSet, Params = best_params, plotSettings = SetPlotParam(Plot=TRUE))
## 6 CPU Threads will be used for peak profiling !
## 
## Step 3/6: Started peak picking! This step will take some time...
## Step 3/6: Peak picking finished ! (2023-10-02 14:28:39)
## 
## Step 4/6: Started peak alignment! This step is running...
## Total of 9517 slices detected for processing... Done ! 
## Going to the next step...
## No blank sample included based on grouping info. Skipped!
## PeakGroup -- loess is used for retention time correction.....
## Performing retention time correction using 1321 peak groups.
## Applying retention time adjustment to the identified chromatographic peaks ... Done !
## Total of 9517 slices detected for processing... Done ! 
## Going to the next step...
## Step 4/6: Peak alignment finished ! (2023-10-02 14:29:28)
## 
## Step 5/6: Started peak filling! This step may take some time...
## Starting peak filling!
## Defining peak areas for filling-in....OK
## Start integrating peak areas from original files...
## Step 5/6: Peak filing finished ! (2023-10-02 14:29:58)
## Peak Profiling finished successfully !
## Begin to plotting figures...

Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).

## Done !
Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).
Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).

2.5 Feature annotation

A typical LC–HRMS spectrum of common biofluids (such as blood or urine samples) can often produce >10,000 peaks. However, this number is not equivalent to the number of compounds detected. The correspondence between the number of peaks and the number of actual metabolites remains elusive38. Multiple peaks can be derived from the same compounds. These are real, biologically relevant peaks and might result from the formation of adducts, incorporation of isotopes, or fragmentation during sample preparation or LC–MS analysis. The remaining peaks are now considered to be largely from background noise (artifacts or noise peaks)39. Therefore, the first step in peak annotation aims to identify real peaks, and to clarify the relationships among them. Many empirical and statistical rules have been developed to address this problem, including CAMERA40 and CliqueMS41, which are two popular R packages. Peak annotation in MetaboAnalyst 5.0 is currently based on CAMERA.

The PerformPeakAnnotation function annotates isotope and adduct peaks with the method from CAMERA (PMID: 22111785). It outputs the result as a CSV file (“annotated_peaklist.csv”) and saves the annotated peaks to mSet object. Finally, the peak list is formatted to the correct structure for MetaboAnalystR and filtered based upon user’s specifications using the FormatPeakList function. This function permits the filtering of adducts (i.e. removal of all adducts except for [M+H]+/[M-H]-) and filtering of isotopes (i.e. removal of all isotopes except for monoisotopic peaks). The goal of filtering peaks is to remove degenerative signals and reduce the file size.

# We firstly define the parameters for feature annotation

# 'polarity' is required, can be either 'negative' or 'positive';
# 'perc_fwhm' is used to set the percentage of the width of the FWHM for peak grouping. 
#              Default is set to 0.6;
# 'mz_abs_iso' is used to set the allowed variance for the search (for isotope annotation). 
#              The default is set to 0.005;
# 'max_charge' is set the maximum number of the isotope charge. 
#              For example, the default is 2, therefore the max isotope charge is 2+/-;
# 'max_iso' is used to set the maximum number of isotope peaks.
#              For example, the default is 2, therefore the max number of isotopes per peaks is 2;
# 'corr_eic_th' is used to set the threshold for intensity correlations across samples. 
#              Default is set to 0.85.
# 'mz_abs_add' is used to set the allowed variance for the search (for adduct annotation). 
#              Default is set to 0.001.
# 'adducts' is used to specify the adducts based on your instrument settings.

annParams <- SetAnnotationParam(polarity = 'positive', mz_abs_add = 0.015);
 
# "mSet" include processed raw MS spectra to be processed.
# "annParams" is the parameters used for annotation

mSet <- PerformPeakAnnotation(mSet, annParams)
## 
## Step 6/6: Starting Peak Annotation...
## Start grouping after retention time.
## Created 109 pseudospectra.
## Generating peak matrix...
## Run isotope peak annotation..
## Found isotopes:716
## Start grouping after correlation...
## Generating EIC's....
## Detecting  Naive_007.mzML  ... 477 peaks found!
## Detecting  Naive_027.mzML  ... 24 peaks found!
## Detecting  Naive_071.mzML  ... 50 peaks found!
## Detecting  Naive_109.mzML  ... 60 peaks found!
## Detecting  Naive_127.mzML  ... 159 peaks found!
## Detecting  Naive_139.mzML  ... 0 peaks found!
## Detecting  QC_001.mzML  ... 1299 peaks found!
## Detecting  QC_003.mzML  ... 2 peaks found!
## Detecting  QC_005.mzML  ... 25 peaks found!
## Detecting  Semi_025.mzML  ... 651 peaks found!
## Detecting  Semi_045.mzML  ... 404 peaks found!
## Detecting  Semi_061.mzML  ... 708 peaks found!
## Detecting  Semi_091.mzML  ... 553 peaks found!
## Detecting  Semi_143.mzML  ... 687 peaks found!
## Detecting  Semi_157.mzML  ... 88 peaks found!
## Warning: Found NA peaks in selected sample.
## Calculating peak correlations in 109 Groups... 
## Calculating graph cross linking in 109 Groups...
## New number of ps-groups:  2589
## mSet has now 2589 groups, instead of 109 !
## Generating peak matrix for peak annotation!
## Polarity is set in annotaParam: positive
## Ruleset could not read from object! Recalculating...
## Calculating possible adducts in 2589 Groups... 
## Step 6/6: Successfully performed peak annotation! (2023-10-02 14:31:02) 
## Going to the final step...

2.6 Feature table generation

# Here we format and filter the peak list for following analysis with MetaboAnalystR

# Parameters are explained as below,
# annParams, is the object created using the SetAnnotationParam function above;
# filtIso, is used to decide to filter out all isotopes (TRUE) or not (FALSE);
# filtAdducts, is used to decide to filter out all adducts (TRUE) or not (FALSE);
# missPercent, specify the threshold to remove features missing in a certain percentage
#              of all samples in a group.

mSet <- FormatPeakList(mSet, annParams, filtIso = FALSE, filtAdducts = FALSE, missPercent = 1);
Figure 5. 2D PCA score plot of all features (colored based on group information)
Figure 5. 2D PCA score plot of all features (colored based on group information)

Here, we need to export all results for following analysis, such as statistic analysis, functional analysis etc.

# Export annotation results, the annotation will be save as "annotated_peaklist.csv";
Export.Annotation(mSet);

# Export complete feature table. It will be saved as "metaboanalyst_input.csv";
# This table can be used for statistic analysis, functional analysis, biomarker analysis module directly.
Export.PeakTable(mSet);
## Done!
## 
## Everything has been finished Successfully ! (2023-10-02 14:31:02)
# Export a summary table (peak_result_summary.txt) to summarize the information of all peaks in
# different samples. The columns are sample names, groups, retention time range, m/z range of all peaks,
# number of all peaks and percentage of missing features.
Export.PeakSummary(mSet)

3. Example of DDA MS/MS data processing

The wide application of high-resolution MS has greatly improved the accuracy of mass measurement, but it is still inadequate to identify all ions at the MS1 level. Therefore, LC-MS is typically performed with MS1 full scan coupled with MS/MS to achieve high-throughput quantification and compound annotation. MS/MS spectra can be generated using data-dependent acquisition (DDA) or data-independent acquisition (DIA) methods. DDA acquires MS/MS spectra by fragmentation of precursor ions selected using a relatively narrow MS/MS isolation window (e.g., 1 m/z). Although DDA spectra are directly linked to precursors, recent studies show that > 50% of them are ‘chimeric’ and need to be deconvolved before searching any reference database.

In this case study, we are implementing a small DDA dataset from a COVID-19 metabolomics study.

3.0 Download example DDA MS/MS spectra

In this example, we are including a small sub-dataset from a COVID-19 metabolomics study (Peng W. et al). This small dataset includes 6 samples (3 Mild samples and 3 fetal samples) for quick learning and demo purpose. User could download the compelete dataset.

# There are 6 raw spectral files (.mzML) and 1 MS feature table included in the downloaded folder
# The MS feature table was generated with the MS1 data processing pipeline (as described in section 2)

download.file("https://www.xialab.ca/api/download/metaboanalyst/ms2_dda_example.zip",
              destfile = "ms2_dda_example.zip",
              method = "curl")
unzip("ms2_dda_example.zip", exdir = "ms2_dda")

3.1 Download MS/MS spectra reference database

MeaboAnalystR includes comprehensive MS/MS reference database. The reference spectra database is not contained in the R package itself. But it is fully accessible and can be downloaded from MetaboAnalyst server. MetaboAnalystR offers comprehensive database options to facilitate high-throughput MS/MS spectra processing and compound annotation. A total of five databases are provided, including pathway compound database, biological compound database, lipids database, exposomics database and the complete database. All these databases are curated from public MS and MS/MS data, including HMDB, MoNA Series, LipidBlast, MassBank, GNPS, LipidBank, MINEs, LipidMAPs, KEGG.

All public MS/MS reference databases have been curated and organized into different options for MetaboAnalystR:

1). Complete Database, including all MS/MS reference records (~7.2GB) Download, Neutral Loss;

2). Biology Database, including compounds have been reported to participate biological mechanism (744MB) Download, Neutral Loss;

3). Pathway Database, including KEGG pathways related compounds from 120 common species (138MB) Download, Neutral Loss;

4). Lipids Database, including Lipids and lipid-like compounds (1.9GB) Download, Neutral Loss;

5). Exposomics Database, including currently reported exposome related compounds (1.5GB) Download, Neutral Loss;

6). HMDB experimental Database, including only HMDB experimental database (for learning purpose, 13MB) Download, Neutral Loss;

## Here we are downloading biology MS/MS reference database.

download.file("https://www.xialab.ca/api/download/metaboanalyst/MS2ID_Bio_v09102023.zip",
              destfile = "MS2ID_Bio.zip",
              method = "curl")
unzip("MS2ID_Bio.zip", exdir = "ms2_db")

3.2 Importing DDA MS/MS spectra

## We load MetaboAnalystR and OptiLCMS
library(MetaboAnalystR)
library(OptiLCMS)

## Clean the environment first
rm(list = ls())

## Read the MS1 feature table as the target features for MS/MS data processing
## This table include four columns (mzmin, mzmax, rtmin, rtmax)
## mzmin and mzmax is the minimum and the maximum value of m/z for the feature;
## rtmin and rtmax is the minimum and the maximum value of retention time for the feature;
ft_dt <- qs::qread("ms2_dda/ms2_dda_example/ft_dt.qs")

## Here we use function PerformMSnImport to read MS/MS data
## This step may take seconds to minutes depending on the size of your dataset
mSet <- PerformMSnImport(filesPath = c(list.files("ms2_dda/ms2_dda_example/",
                                                  pattern = ".mzML",
                                                  full.names = T, recursive = T)),
                         targetFeatures = ft_dt,
                         acquisitionMode = "DDA")
## Importing raw MS data .. done.

3.3 DDA MS/MS spectra Deconvolution

For DDA spectral data deconvolution, MetaboAnalystR assigns all MS/MS spectra of an individual spectral data into different feature groups based on precursors’ information (m/z, retention time), and chimeric status is evaluated based on the nearest MS scan.

The MS/MS spectra of all ions (including main precursor and other contaminating ions within the isolation window and above the intensity threshold) are extracted from reference libraries as candidate spectra. If any reference spectrum is missing, a predicted spectrum will be generated using a similarity-network model. All candidates are then used to obtain the deconvolved spectrum based on elastic-net regression with extra penalties to the predicted spectra.

# In this step, we are processing the DDA spectra deconvolution;

# The deconvolution is based on the Biology MS/MS spectral database;
# Parallel computing is supported, users are encouraged to use multiple cores to speed up;
# This step may take minutes to hours to finish, depending on the size of datasets

system.time(mSet <- PerformDDADeconvolution(mSet,
                                ppm1 = 5,
                                ppm2 = 10,
                                sn = 12,
                                filtering = 0,
                                window_size = 1.5,
                                intensity_thresh = 1.6e5,
                                database_path = "ms2_db/MS2ID_Bio_v09102023.sqlite",
                                ncores = 6L))
## Loading required package: parallel
## 
## Deconvolution executed successfully!
##    user  system elapsed 
##   0.888   0.172 163.956

3.4 Spectrum consensus of replicates

MS/MS data acquisition with multiple replicates is common. In such cases, all deconvolved M/MS spectra corresponding to the same MS1 peak must be processed to generate a single consensus spectrum. If there are no replicates, this step is skipped. All MS/MS fragments across different replicates are initially summarized by count.

If the frequency of an individual fragment is above a user-defined threshold (e.g., 50%), it is kept; otherwise, the fragment is removed. Optionally, a database-assisted spectrum consensus in MetaboAnalystR can be used to avoid potential over-deletion. If database-assisted option is enabled by users, all spectra of the precursor are extracted as referent list (L) from the reference library. All fragments not meeting the (user-defined) frequency threshold are then searched against L.

If this fragment can be found from L and the frequency of the fragment across the replicates is over 2, the fragment is kept; otherwise, it is discarded. All kept fragments are normalized and merged to generate a consensus spectrum for database searching in the next step. The spectrum consensus can be achieved with the function, PerformSpectrumConsenus.

mSet <- PerformSpectrumConsenus (mSet,
                                 ppm2 = 15,
                                 concensus_fraction = 0.2,
                                 database_path = "",
                                 use_rt = FALSE,
                                 user_dbCorrection = FALSE)

3.5 Reference database searching

Reference library searching in MetaboAnalystR is based on the m/z (and optionally, the information of retention time) of precursors. All matches are extracted from database for formula prediction. MetaboAnalystR uses the same scoring rule as MS-DIAL. The matching score is calculated using the following formula:

Similarity_score = (MS/MS similarity + m/z similarity + RT similarity + Isotope Similarity x 0.5)/3.5.

where the MS/MS similarity (ranging from 0 to 1) is calculated using the popular dot product similarity or spectral entropy similarity algorithms. MS1 similarity and retention time (RT) similarity (both ranging from 0 to 1) are calculated with exponential distribution method based on deviation. Isotope similarity is also calculated using a similar method as implemented in MS-DIAL. However, the calculation of isotope distribution similarity only considers [M+n] (n<3), as the intensity of isotopes [M+n] (n≥3) is very low and highly variant to be considered. Briefly, the isotope distribution similarity evaluation is performed based on the experimental isotope distribution and the theoretical distribution of formulas extracted from the reference library. Isotope elements considered here include carbon (C13), hydrogen (H2), nitrogen (N15), oxygen (O17, O18) and sulfur (S33, S34). Other elements are not considered due to their extremely low abundance in nature. Retention time is optionally used based on users’ request. If retention time is disabled (by default), the retention time similarity is not calculated, and denominator is modified to 2.5.

The database searching is performed based on SQLite query, and can be achieved with the function, PerformDBSearchingBatch.

# PerformDBSearchingBatch is used to seatching MS/MS reference library
# Results will be scored based on the similarity rules above
# Parallel computing is allowed. CPU cores used are controlled by argument "ncores".

mSet <- PerformDBSearchingBatch (mSet,
                                 ppm1 = 10,
                                 ppm2 = 25,
                                 rt_tol = 5,
                                 database_path = "ms2_db/MS2ID_Bio_v09102023.sqlite",
                                 use_rt = FALSE,
                                 enableNL = FALSE,
                                 ncores = 6L)
## ==== Database searching against MS2ID_Bio_v09102023.sqlite started ====
## 
## ============ Searching finished successfully! ============

3.6 Results export

After the reference library searching, all compound identification results can be exported as a data frame and saved as a .csv or .txt file. The exported information includes compound names, chemical formula, InChIKeys, and matching scores. If the reference library is lipid library, the exported information also includes lipid classifications (super class, main class and sub-class). The database searching results can be exported using PerformResultsExport function and formatted as a data frame table using FormatMSnAnnotation function.

mSet <- PerformResultsExport (mSet,
                             type = 0L,
                             topN = 10L,
                             ncores = 3L)

======= Converting started. See progress below =======

============ Exporting finished successfully! ============

# 2nd argument, TopN, is the argument used to specifiy the number of compounds to export
# If the dataset is a lipidomics dataset, please set 3rd argument as "TRUE"
# to extract the lipid classification information
# All identified compounds have been summarized as a data.frame (dtx) here

dtx <- FormatMSnAnnotation(mSet, 5L, F)

All identified compounds have been summarized as a data.frame. The results are shown in Table 1 below. Here are the explication of different columns:

1). mzmin, minimum m/z value for this feature;

2). mzmax, maximum m/z value for this feature;

3). rtmin, minimum retention time value for this feature;

4). rtmax, maximum retention time value for this feature;

5). Compound_N, Compound name of the Nth identification. The identification results are sorted based on the scores (decreasing);

6). InChiKey_N, InChiKey of the Nth identification;

7). Formula_N, Formula of the Nth identification;

8). Score_N, Similarity score of the Nth identification. 100 is the perfect match. 0 is the negative match;

9). Database_N, the MS/MS reference library source of the Nth identification;

10). SuperClass_N, Superclass of the Nth identification (only for lipidomics database).

11). MainClass_N, Mainclass of the Nth identification (only for lipidomics database).

12). SubClass_N, Subclass of the Nth identification (only for lipidomics database).

4. Example of SWATH-DIA MS/MS data processing

Compared to DDA MS/MS spectra, Data independent acquisition (DIA) MS/MS provides more coverage on the metabolome. SWATH-DIA is an emerging technique of DIA data acquisition by splitting the acquisition range as sequential windows, and further reduce the complexities of the data processing. Different from the DDA MS/MS data, there is no clear association between precursors and MS/MS spectra. Therefore, the deconvolution step of SWATH-DIA is more important for chemical identification.

The SWATH-DIA MS/MS data processing module of MetaboAnalystR has been developed based on the DecoMetDIA. The entire deconvolution workflow was written using Rcpp/C++ framework and further optimized to enable high-throughput processing. In addition, MetaboAnalystR supports multi-threaded data processing to further speed up the analysis through parallel computing.

In this case study, we are implementing another COVID-19 metabolomcis SWATH-DIA data.

4.0 Download example SWATH-DIA spectra

Similar to the example dataset of DDA, in this example, we are including another small sub-dataset from another COVID-19 metabolomics study (Mahmoud S. et al). This small dataset includes 6 samples (3 COVID-19 samples and 3 healthy controls) for quick learning and demo purpose.

# There are 6 raw spectral files (.mzML) and 1 text file, which include the design of SWATH windows

download.file("https://www.xialab.ca/api/download/metaboanalyst/ms2_dia_example.zip",
              destfile = "ms2_dia_example.zip",
              method = "curl")

unzip("ms2_dia_example.zip", exdir = "ms2_dia")
## Load OptiLCMS
library(OptiLCMS)

## Construct meta data
meta_dt <- data.frame(samples = c("210210-SWATH-NEG-Covid-Cov-16.mzML", 
                                  "210210-SWATH-NEG-Covid-Cov-17.mzML", 
                                  "210210-SWATH-NEG-Covid-Cov-18.mzML",
                                  "210210-SWATH-NEG-Covid-Ct-1.mzML",
                                  "210210-SWATH-NEG-Covid-Ct-2.mzML",
                                  "210210-SWATH-NEG-Covid-Ct-3.mzML"),
                      groups = c(rep("COVID",3), rep("Control",3)))

## Import raw MS1 data
mSet1 <- ImportRawMSData(path = "ms2_dia/ms2_dia_example/", metadata = meta_dt)
## No initialized mSet found, will initialize one automatically !
## 
## Step 1/6: Start to import the spectrum! 
## This step will take a short time...
## Raw file import begin...
## Step 1/6: Successfully imported raw MS data! (2023-10-02 14:41:40) 
## Going to the next step...
## Process MS1 peaks by using customized parameters
mSet1 <- PerformPeakProfiling(mSet1, 
                              Params = SetPeakParam(ppm = 25,
                                                    bw = 3,
                                                    mzdiff = 0.001,
                                                    max_peakwidth = 35,
                                                    min_peakwidth = 5,
                                                    noise = 200, 
                                                    minFraction = 0.2),
                              ncore = 6,
                              plotSettings = SetPlotParam(Plot = F))
## 6 CPU Threads will be used for peak profiling !
## 
## Step 3/6: Started peak picking! This step will take some time...
## Step 3/6: Peak picking finished ! (2023-10-02 14:41:53)
## 
## Step 4/6: Started peak alignment! This step is running...
## Total of 7535 slices detected for processing... Done ! 
## Going to the next step...
## No blank sample included based on grouping info. Skipped!
## PeakGroup -- loess is used for retention time correction.....
## Performing retention time correction using 1306 peak groups.
## Applying retention time adjustment to the identified chromatographic peaks ... Done !
## Total of 7535 slices detected for processing... Done ! 
## Going to the next step...
## Step 4/6: Peak alignment finished ! (2023-10-02 14:42:00)
## 
## Step 5/6: Started peak filling! This step may take some time...
## Starting peak filling!
## Defining peak areas for filling-in....OK
## Start integrating peak areas from original files...
## Step 5/6: Peak filing finished ! (2023-10-02 14:42:25)
## Peak Profiling finished successfully !
## Begin to plotting figures...
## Annotation
annParams <- SetAnnotationParam(polarity = 'negative',
                                mz_abs_add = 0.035);

mSet1 <- PerformPeakAnnotation(mSet = mSet1,
                               annotaParam = annParams,
                               ncore =1)
## 
## Step 6/6: Starting Peak Annotation...
## Start grouping after retention time.
## Created 66 pseudospectra.
## Generating peak matrix...
## Run isotope peak annotation..
## Found isotopes:98
## Start grouping after correlation...
## Generating EIC's....
## Detecting  210210-SWATH-NEG-Covid-Cov-16.mzML  ... 334 peaks found!
## Detecting  210210-SWATH-NEG-Covid-Cov-17.mzML  ... 1366 peaks found!
## Detecting  210210-SWATH-NEG-Covid-Cov-18.mzML  ... 646 peaks found!
## Detecting  210210-SWATH-NEG-Covid-Ct-1.mzML  ... 637 peaks found!
## Detecting  210210-SWATH-NEG-Covid-Ct-2.mzML  ... 142 peaks found!
## Detecting  210210-SWATH-NEG-Covid-Ct-3.mzML  ... 56 peaks found!
## Warning: Found NA peaks in selected sample.
## Calculating peak correlations in 66 Groups... 
## Calculating graph cross linking in 66 Groups...
## New number of ps-groups:  2241
## mSet has now 2241 groups, instead of 66 !
## Generating peak matrix for peak annotation!
## Polarity is set in annotaParam: negative
## Ruleset could not read from object! Recalculating...
## Calculating possible adducts in 2241 Groups... 
## Step 6/6: Successfully performed peak annotation! (2023-10-02 14:42:47) 
## Going to the final step...
## Format MS1 feature table
mSet1 <- FormatPeakList(mSet = mSet1,
                        annParams,
                        filtIso =FALSE,
                        filtAdducts = FALSE,
                        missPercent = 1)
Figure 6. 2D PCA score plot of all features (colored based on group information)
Figure 6. 2D PCA score plot of all features (colored based on group information)
mSet <- PerformMSnImport(mSet = mSet1,
                         filesPath = c(list.files("ms2_dia/ms2_dia_example/", pattern = ".mzML", full.names = T, recursive = T)),
                         acquisitionMode = "DIA",
                         SWATH_file = "ms2_dia/ms2_dia_example/DIA_SWATH_MS_experiment_file_neg.txt")
## Importing raw MS data .. done.
mSet <- PerformDIADeconvolution(mSet,
                                min_width = 5,
                                span = 0.3,
                                ppm2 = 30,
                                sn = 12,
                                filtering = 0,
                                ncores = 6L)

mSet <- PerformSpectrumConsenus (mSet,
                                 ppm2 = 30,
                                 concensus_fraction = 0.25,
                                 database_path = "",
                                 use_rt = FALSE,
                                 user_dbCorrection = FALSE)

mSet <- PerformDBSearchingBatch (mSet,
                                 ppm1 = 15,
                                 ppm2 = 30,
                                 rt_tol = 5,
                                 database_path = "ms2_db/MS2ID_Bio_v09102023.sqlite",
                                 use_rt = FALSE,
                                 enableNL = FALSE,
                                 ncores = 4L)
## ==== Database searching against MS2ID_Bio_v09102023.sqlite started ====
## 
## ============ Searching finished successfully! ============
mSet <- PerformResultsExport (mSet,
                              type = 0L,
                              topN = 10L,
                              ncores = 4L)
## ======= Converting started. See progress below ======= 
## 
## ============ Exporting finished successfully! ============
dtx2 <- FormatMSnAnnotation(mSet, 5L, F)

5. Further analysis

Despite MS/MS spectra data provides more accurate chemical identification results, it is still impossible to find out the exact chemical identities for most MS features, unless chemical standards are utilized. However, MS/MS spectra could efficiently reduce the number of putative identifications on a feature. Integration of MS/MS results into functional analysis could improve the accuracy of pathway prediction. See more details on the Chapter Functional Analysis.