LC-MS/MS Raw Spectral Data Processing

Zhiqiang Pang, Jeff Xia

2023-10-02

1. Overview of LC-MS/MS raw data processing workflow

Global or untargeted lipidomics is increasingly used to investigate changes of lipids in various biological or environmental systems in an unbiased manner. Liquid chromatography coupled to high-resolution mass spectrometry (LC-HRMS) has become the main workhorse for global metabolomics. The typical LC-HRMS metabolomics workflow involves spectra collection, raw data processing, statistical and functional analysis.

MetaboAnalystR is a R package, synchronized with the popular MetaboAnalyst website, designed for comprehensive metabolomic data analysis, visualization, and interpretation.

Since the version 2, MetaboAnalystR has supported raw spectral data processing, and linked the MS data processing results directly for functional analysis. The raw spectral data procesing was depending on XCMS. Given that there are many parameters to be optimized for centWave algorithm, which is implemented in XCMS. In version 3, MetaboAnalystR implemented an auto-optimized workflow to optimize all critical parameters to obtain an optimal results. Meanwhile, all functions in MetaboAnalystR for raw spectral processing have been well-organized into a companion R package, OptiLCMS. In the latest version (4.0), MetaboAnalystR starts supporting data deconvolution on MS2 spectral data, including data-dependent acquisition (DDA) and SWATH-data independent acquisition (DIA).

The aim of this vignette is to provide an overview on how to use MetaboAnalystR (together with OptiLCMS) to perform LC-MS/MS raw spectral metabolomics data processing. This vignette mainly includes the installation of OptiLCMS, MS data import, Parameters’ optimization, MS/MS data import, data deconvolution, spectral concensus of replicates, MS/MS spectral database searching and results export.

The general workflow of MetaboAnalystR for raw LC-MS/MS data processing is depicted below. The general data processing steps include two parts: MS feature detection and MS/MS data based chemical identification. The whole workflow are summarized in Figure 1.

Figure 1. Workflow for LC-MS/MS data processing.

Figure 1. Workflow for LC-MS/MS data processing.

1.0 Loading the package

After following the installation instructions on the OptiLCMS Github page, you will be ready to use the package. Use the library() function to load the package into R. Please make sure MetaboAnalystR has already been installed. If not, please refer to the installation guidance.

# Load MetaboAnalystR
library(MetaboAnalystR)
# Load OptiLCMS
library(OptiLCMS)

1.1 Tips for using the MetaboAnalystR package for raw spectral processing

  1. Different from other statistic or functional analysis moduel, the first function that you will use for raw spectral data processing is NOT the InitDataObjects function;

  2. You can follow our latest protocols to practice the example data or process your own dataset.

1.2 Data Format for Raw LC-MS/MS spectral data

For LC-MS/MS based global (untargeted) metabolomics raw data processing, MetaboAnalystR only accepts open-source formats (.mzML/.mzXML/.mzData/.cdf).

All vendor formats (.raw/.wiff/.D/.d/.RAW etc.) must be converted and centroided into a open-source format (e.g. *.mzML, highly recommended) at first. In addition, different from MetaboAnalyst website, there is no limitation on file size and file number.

User could process unlimited raw spectral data as long as the RAM of their local PC permits.

All MS data are converted and centroided by using MSconvert (Proteowizard, version 3.0.23046) into mzML format. All empty scans are removed in this step.

2. Example raw spectral data processing (MS1)

2.0 Download example spectra

In this example, we are going to use an experimental Malaria metabolomics dataset (UPLC-Q/E-ESI+, HILIC) between two immune status (Native vs. Semi-immune) from Li et al. There are 12 samples (2 groups, 6 samples for each) and 3 QCs included in this dataset. In this tutorial, we are going to use the auto-optimized pipeline to process MS1 data step by step.

download.file("https://www.xialab.ca/api/download/metaboanalyst/malaria_r_example.zip",
              destfile = "malaria_raw.zip",
              method = "curl")
unzip("malaria_raw.zip", exdir = "upload")

MS1 raw spectral processing include the following four steps: 1. (optional) automatic parameters’ optimization; 2. Raw spectral import; 3. MS1 data processing (peak picking, alignment and gap filling); 4. Peak annotation (as adducts and isotopes).

2.1 Region of Interest (ROI) extraction

The steps to extract an Region of Interest (ROI) for parameters optimization. In this step, you need to specify data files which are used to extract ROIs and optimize parameters. As for the mechanism, in brief, the algorithm first scans the whole spectra across m/z and retention time dimensions to select several ROIs that are enriched for real peaks. Second, these ROIs are then extracted as new synthetic spectra. Finally, a DoE model is used to optimize peak picking parameters based on the synthetic spectra. More technical details can be found in MetaboAnalystR 3.0 paper.

rt.idx refers to the retained percentage of chromatogram in retention time dimension. Users can also adjust the percentage based on their needs, but they are suggested to use more than 50% (0.5) to include larger ROI and improve the accuracy of following parameters’ optimization.

rmConts means “remove contamination” before ROI extraction. This option allows users to exclude strong instrumental noise or potential contaminants present during the chromatographic run for parameters’ optimization. User could enable this option (as TRUE) if their data contains significant noise/contamination.

It is noted that step 2.1.1 and 2.1.2 is only used for auto-optimized workflow. If user would like to configure the parameters themselved based on their expertise and experience, they can skip these two sections and jump into 2.1.4 directly.

# Load MetaboAnalystR and OptiLCMS
library(MetaboAnalystR)
library(OptiLCMS)

# Here, we extract ROIs from 3 QC samples.
DataFiles <- list.files("upload/QC/", full.names = TRUE)
mSet <- PerformROIExtraction(datapath = DataFiles, rt.idx = 0.9, rmConts = TRUE);
## Step 0/6: Scanning ROIs for parameters optimization...
## QC_001.mzML
## QC_003.mzML
## QC_005.mzML
## Data Loading...
## Raw file import begin...
## Import data: QC_001.mzML finished!
## Import data: QC_003.mzML finished!
## Import data: QC_005.mzML finished!
## Data Loaded !
## Empty Scan detecting...
## No Empty scan found !
## Identifying regions of interest (ROI)...
## Identifying regions of potential contaminants........Done!
## 70 potential contaminamts will not be used for parameters optimization !
## Going to the next step...
## MS data Preparing...
## MS Data ready !
## Identifying ROIs in m/z dimensions...
## Identifying ROIs in m/z dimensions Done !
## Identifying ROIs in RT dimensions...
## Chromatogram Plotting Begin...
## Identification on ROIs Finished!
Figure 2. Extracted ROI for parameters’ optimization.
Figure 2. Extracted ROI for parameters’ optimization.

To import binned spectral data, users will use the same Read.TextData functions in the above two examples. However, binned spectral data is typically of lower quality than compound concentration data and therefore requires a further step of data preprocessing to convert the data into a usable data matrix. For instance, binned spectral data often contains background noise and requires filtering of such unwanted variations. You can find an example here.

2.2 Auto-optimization of parameters

In this example, we will execute PerformParamsOptimization function to optimize the critical parameters of peak picking and alignment for the following data processing in an automatic way. It utilizes the extracted ROI data and the internal instrument-specific parameters. Parallel computing can be enabled. The number of cores user want to use could be specified. More RAM will be consumed if the parallel core is set high.

Parameters optimization with a design of experiment (DoE) strategy. Initial parameters are from the optimized parameters above or the internal default parameters from embedded SetPeakParam function.

Here is some recommendations for parallel core setting, 64GB RAM ~ 6 cores; 32GB RAM ~ 4 cores; 16GB RAM ~ 2 cores. Minimum 16GB RAM is required for raw spectral processing.

# Here we use PerformParamsOptimization to optimize parameters based on 
# the extracted ROI (stored in 'mSet') before process the entire dataset

best_params <- PerformParamsOptimization(mSet, param = SetPeakParam(platform = "UPLC-Q/E"), ncore = 4);
## 
## Step 1/6: Start to optimize parameters! 
##                 
## This step may take a long time...
## DoE Optimization Starting Now... (2023-10-02 13:42:23)
## Evaluating Noise level...
## ppm is estimated as : 4.3
## Done!
## Preparing Parameters for optimization finished !
## Data Spliting Finished !
## Peak Preparing Begin...
## Peak Preparing Done !
## Round:1
## DoE Running Begin...
## 100% |
## Round 1 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 39.3
## Model Parsing Done !
## Round:2
## DoE Running Begin...
## 100% |
## Round 2 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 42.6
## Model Parsing Done !
## Round:3
## DoE Running Begin...
## 100% |
## Round 3 Finished !
## Model Parsing...
## Gaussian peak ratio (%): 34.3
## Model Parsing Done !
## No Increase Stopping !
## best parameter settings:
## min_peakwidth:  5
##  max_peakwidth:  15.75
##  mzdiff:  0.036
##  snthresh:  7.1
##  bw:  2
##  Peak_method:  centWave
##  RT_method:  loess
##  ppm:  4.3
##  noise:  6772
##  prefilter:  2
##  value_of_prefilter:  11556.95
##  minFraction:  0.8
##  minSamples:  1
##  maxFeatures:  100
##  fitgauss:  FALSE
##  mzCenterFun:  wMean
##  integrate:  1
##  extra:  1
##  span:  0.4
##  smooth:  loess
##  family:  gaussian
##  verbose.columns:  FALSE
##  polarity:  negative
##  perc_fwhm:  0.6
##  mz_abs_iso:  0.005
##  max_charge:  2
##  max_iso:  2
##  corr_eic_th:  0.85
##  mz_abs_add:  0.001
##  rmConts:  TRUE
##  BlankSub:  TRUE
## Step 1/6: Parameters Optimization Finished ! (2023-10-02 14:27:32)
## Time Spent In Total:45.2mins

2.3 Importing example data

For raw spectral processing, MetaboAnalystR accepts open-source formats (including mzML, NetCDF, mzDATA, mzXML) data. There are two options for users to input the metadata information (grouping information, e.g. diseases/healthy groups) on all files:

1). Save all files of the same group into a sub-folder. There must be at least 2 sub folders (excluding QC sample);

2). Use a meta data file for parameter, metadata. A phenotype data frame or a absolute path of the metadata file (.txt) for all samples. In the option, first column should be the sample name, while second column is the corresponding group name. If ommited, all samples in the same sub-folder will be considered as one group.

# "path" is used to specify the path to the folder containing the raw MS spectra to be processed.
# BPI and TIC plotting can be enabled with parameter, 
# "plotSettings = SetPlotParam(Plot = T)", or disabled by changing "T" into "F";

mSet <- ImportRawMSData(path = c("upload"), plotSettings = SetPlotParam(Plot = T))
## No initialized mSet found, will initialize one automatically !
## Warning in ImportRawMSData(path = c("upload"), plotSettings = SetPlotParam(Plot
## = T)): No metadata provided, all files in one sub-folder will be considered a
## group!
## 
## Step 2/6: Start to import the spectrum! 
## This step will take a short time...
## Raw file import begin...
## To reduce memory usage BPIS and TICS plots will be created using only 10 samples per group.
## Plotting BPIS and TICS.
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
## Step 2/6: Successfully imported raw MS data! (2023-10-02 14:27:43) 
## Going to the next step...
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).
Figure 3. Base peak ion (BPI) intensity chromatogram and Total ion chromatogram (TIC) of all samples (ESI+).

2.4 Raw spectral data processing

The next step is to detect peaks (also known as peak picking) from the centroid data. Multiple algorithms have been developed to identify peaks in different dimensions such as retention time. Among them, the centWave algorithm implemented in XCMS has been shown to perform well in processing LC–HRMS spectra. After peak detection, peak alignment is performed to address retention time variations across spectra. These aligned peaks form a peak intensity table with varying proportions of missing values. These missing values indicate that either peak detection failed or the corresponding feature is absent from the respective sample. Therefore, the final step is ‘gap filling’ by reperforming direct peak extraction on corresponding regions in the raw spectra.

The PerformPeakProfiling function is wrapper of raw spectral processing functions that performs peak detection, alignment, and grouping in an automatical step. The parameters from optimization step above can be used directly in this step.

The function also generates two diagnostic plots including statistics on the total intensity of peaks in different samples, a retention time adjustment map, and a PCA plot showing the overall sample clustering prior to data cleaning and statistical analysis.

# "mSet" include complete raw MS spectra to be processed.
# "params" is using the "best_params" generated above
# Plotting functions can be enabled with parameter, 
# "plotSettings = SetPlotParam(Plot = T)", or disabled by changing "T" into "F";
mSet <- PerformPeakProfiling(mSet, Params = best_params, plotSettings = SetPlotParam(Plot=TRUE))
## 6 CPU Threads will be used for peak profiling !
## 
## Step 3/6: Started peak picking! This step will take some time...
## Step 3/6: Peak picking finished ! (2023-10-02 14:28:39)
## 
## Step 4/6: Started peak alignment! This step is running...
## Total of 9517 slices detected for processing... Done ! 
## Going to the next step...
## No blank sample included based on grouping info. Skipped!
## PeakGroup -- loess is used for retention time correction.....
## Performing retention time correction using 1321 peak groups.
## Applying retention time adjustment to the identified chromatographic peaks ... Done !
## Total of 9517 slices detected for processing... Done ! 
## Going to the next step...
## Step 4/6: Peak alignment finished ! (2023-10-02 14:29:28)
## 
## Step 5/6: Started peak filling! This step may take some time...
## Starting peak filling!
## Defining peak areas for filling-in....OK
## Start integrating peak areas from original files...
## Step 5/6: Peak filing finished ! (2023-10-02 14:29:58)
## Peak Profiling finished successfully !
## Begin to plotting figures...

Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).Figure 4. Raw spectral profiling results (Feature stats; Retention time adjustment; BPI after RT adjustment).

## Done !