A quick introduction to MetaboCrates
MetaboCrates.Rmd
Introduction
The MetaboCrates package provides tools for processing and analyzing targeted metabolomics data (e.g., generated using Biocrates® kits). You can easily upload data and perform automated preprocessing, such as data cleaning and missing value imputation, followed by visualizations and descriptive statistics for data exploration. The package also includes tools for quality control and outlier detection.
Getting started
MetaboCrates can be easily installed and loaded into the environment using the following commands.
devtools::install_github("BioGenies/MetaboCrates")
library(MetaboCrates)
Next, metabolomics data can be imported with the
read_data()
function. In this guide, an example dataset
included in the package will be used.
path <- get_example_data("two_sets_example.xlsx")
dat <- read_data(path)
get_info(dat)
#> [1] "Data contains 13 sample types and 2 NA types."
The class of the dat
object is both
data.frame
and raw_data
. It has the following
attributes:
-
LOD_table
- table containing the limits of detection (LOD), lower limit of quantification (LLOQ), and upper limit of quantification (ULOQ). -
NA_info
-
counts
- types of missing values found in the data with their counts, -
NA_ratios_type
- fractions of missing values of each type for every metabolite, -
NA_ratios_group
- fractions of missing values with respect to group levels for every metabolite,
-
-
metabolites
- names of metabolites in the data, -
samples
- names of sample types with counts, -
group
- names of grouping columns (appears after using theadd_group()
function), -
removed
- names of metabolites removed based on the:-
LOD
- proportion of missing values, -
QC
- coefficient of variation (CV),
-
-
completed
- completed data (appears after imputation), -
cv
- coefficients of variation based on QC sample type for each metabolite (appears after using thecalculate_CV()
function).
Non-null attributes at this stage are shown below.
attr(dat, "LOD_table")[,1:5]
#> plate bar code C0 C2 C3 C3-DC (C4-OH)
#> 1 LOD (calc.) 1052125775-1 [µM] NA NA NA NA
#> 2 LOD (calc.) 1052125780-1 [µM] 0.7702 0.0545 0.0000 0.0206
#> 3 LOD (calc.) 1052126590-1 [µM] NA NA NA NA
#> 4 LOD (calc.) 1052126601-1 [µM] 1.4720 0.0269 0.0000 0.0002
#> 5 LOD (from OP) 1052125775-1 [µM] NA NA NA NA
#> 6 LOD (from OP) 1052125780-1 [µM] 1.6670 0.1333 0.1222 0.0900
#> 7 LOD (from OP) 1052126590-1 [µM] NA NA NA NA
#> 8 LOD (from OP) 1052126601-1 [µM] 1.6670 0.1333 0.1222 0.0900
attr(dat, "NA_info")[["NA_ratios"]][5:10,]
#> NULL
attr(dat, "NA_info")[["counts"]]
#> # A tibble: 2 × 2
#> type n
#> <chr> <dbl>
#> 1 < LOD 16240
#> 2 NA 316
attr(dat, "metabolites")[1:10]
#> [1] "C0" "C2" "C3" "C3-DC (C4-OH)"
#> [5] "C3-OH" "C3:1" "C4" "C4:1"
#> [9] "C5" "C5-DC (C6-OH)"
attr(dat, "samples")[1:5,]
#> # A tibble: 5 × 2
#> `sample type` count
#> <chr> <int>
#> 1 Blank 1
#> 2 QC Level 1 2
#> 3 QC Level 2 6
#> 4 QC Level 3 2
#> 5 Sample 159
Now, information about missing values can be displayed.
attr(dat, "NA_info")[["counts"]]
#> # A tibble: 2 × 2
#> type n
#> <chr> <dbl>
#> 1 < LOD 16240
#> 2 NA 316
plot_mv_types(dat)
Group selection
Using the add_group()
function, you can specify columns
(which must not contain any missing values) to group the data by.
Although this step is optional, some functions require grouped data.
After grouping,the NA_info
attribute contains the
proportion of missing values in group levels for each metabolite.
Moreover, the grouping columns names are added to the group
attribute.
attr(grouped_dat, "NA_info")[["NA_ratios_group"]][5:10,]
#> # A tibble: 6 × 3
#> metabolite grouping_column NA_frac
#> <chr> <chr> <dbl>
#> 1 3-IPA 2023LS_s2, human 0
#> 2 3-IPA 2023_Lukasz_S_s1, human 0
#> 3 3-Met-His 2023LS_s2, human 0
#> 4 3-Met-His 2023_Lukasz_S_s1, human 0
#> 5 5-AVA 2023LS_s2, human 0.114
#> 6 5-AVA 2023_Lukasz_S_s1, human 0.238
attr(grouped_dat, "group")
#> [1] "submission name" "species"
Below, the information about grouped data are displayed.
cat(get_info(grouped_dat))
#> Data contains 13 sample types and 2 NA types.
#> Groupping by: "submission name" (4 levels). Data contains 13 sample types and 2 NA types.
#> Groupping by: "species" (4 levels).
plot_groups(grouped_dat)
Compounds filtering
To investigate the missing values ratios in each metabolite, the
plot_NA_percent()
function can be used.
example_metabolites <- c("Cys", "TG 18:3_35:2", "Cortisone", "CDCA", "AbsAcid")
plot_NA_percent(grouped_dat, type = "joint", interactive = FALSE) +
ylim(example_metabolites)
plot_NA_percent(grouped_dat, type = "NA_type", interactive = FALSE) +
ylim(example_metabolites)
plot_NA_percent(grouped_dat, type = "group", interactive = FALSE) +
ylim(example_metabolites)
When not grouped data passed, function below returns the names of metabolites which have more than the given threshold of missing values. If the grouping is specified, then it returns only metabolites for which the threshold is exceeded in each group level. In the examples below the threshold is 20%.
get_LOD_to_remove(dat, 0.2)[1:5]
#> [1] "AbsAcid" "Ac-Orn" "Anserine" "BABA" "C10:1"
get_LOD_to_remove(grouped_dat, 0.2)[1:5]
#> [1] "AbsAcid" "Ac-Orn" "Anserine" "C10:1" "C10:2"
If the grouping with up to 4 levels is specified, then the Venn diagram can be created, which shows the counts of metabolites having ratios of missing values larger than the given threshold for each group level.
create_venn_diagram(grouped_dat, 0.2)
Finally, we can remove the chosen metabolites using the
remove_metabolites()
function with type
specified as LOD
. Instead of actually removing them from
data, this function adds provided metabolites to the
removed
attribute.
rm_dat <- remove_metabolites(grouped_dat,
metabolites_to_remove = c("AbsAcid", "Ac-Orn"),
type = "LOD")
attr(rm_dat, "removed")
#> $LOD
#> [1] "AbsAcid" "Ac-Orn"
#>
#> $QC
#> NULL
Then, the data and missing values ratios without removed metabolites can be displayed. Below, the first 5 rows of both tibbles and chosen columns of the first are visible.
show_data(rm_dat)[1:5, c(1:3, 19:21)]
#> plate bar code sample bar code sample type C2 C3
#> 1 1052126590-1 | 1052126601-1 10000001 Blank NA NA
#> 2 1052125775-1 | 1052125780-1 1020527268 QC Level 1 1.999 0.7022
#> 3 1052126590-1 | 1052126601-1 1020527268 QC Level 1 2.083 0.7166
#> 4 1052125775-1 | 1052125780-1 1020527272 QC Level 2 3.689 1.461
#> 5 1052125775-1 | 1052125780-1 1020527272 QC Level 2 3.088 1.369
#> C3-DC (C4-OH)
#> 1 NA
#> 2 < LOD
#> 3 0.0134
#> 4 0.0346
#> 5 0.0344
show_ratios(rm_dat)[1:5,]
#> # A tibble: 5 × 2
#> metabolite NA_frac
#> <chr> <dbl>
#> 1 1-Met-His 0
#> 2 3-IAA 0
#> 3 3-IPA 0
#> 4 3-Met-His 0
#> 5 5-AVA 0.176
There are two options for restoring metabolites - either specific ones can be selected or all metabolites from the specified type can be restored.
attr(unremove_metabolites(rm_dat, "AbsAcid"), "removed")
#> $LOD
#> [1] "Ac-Orn"
#>
#> $QC
#> NULL
attr(unremove_all(rm_dat, "LOD"), "removed")
#> $LOD
#> NULL
#>
#> $QC
#> NULL
Gaps completing
Aside from removing metabolites, missing values can be also imputed
using the complete_data()
function. It completes missing
values related to the limits of quantification and detection. Imputation
of values below the limit of detection (< LOD
) can be
performed in one of six ways:
-
NULL
- skipping imputation, -
halfmin
- half of the minimum observed value, -
halflimit
- half of the detection limit, -
random
- random number not smaller than the limit of detection and not bigger than the minimum observed value, -
limit
- the detection limit, -
limit-0.2min
- the detection limit minus 0.2 times the minimum observed value.
Values below the quantification limit (< LLOQ
) can be
imputed using either NULL
or limit
methods.
Similarly, values above the upper limit of quantification
(> ULOQ
) can be imputed using the NULL
or
limit
methods, and additionally with the
third quartile
method, which assigns the third quartile of
the observed values.
Completed dataset is then stored as the completed
attribute.
comp_dat <- complete_data(rm_dat,
LOD_method = "halfmin",
LLOQ_method = "limit",
ULOQ_method = "third quartile")
To investigate missing values, you can use the
plot_heatmap()
function, which creates a heatmap indicating
whether a value is missing for each sample and metabolite, depending on
the plate bar code. You can specify a plate bar code to generate a
single plot; otherwise, if there are multiple plate bar codes in the
dataset, a faceted plot will be created. If
show_colors = TRUE
, each missing value type is shown in a
different color. Below, cropped plots are presented, showing the first
five metabolites.
plot_heatmap(comp_dat, plate_bar_code = "1052125775-1 | 1052125780-1") +
ylim(attr(comp_dat, "metabolites")[5:1])
plot_heatmap(comp_dat) +
ylim(attr(comp_dat, "metabolites")[5:1])
Next, you can compare the distributions of an individual metabolite
before and after imputation. The create_distribution_plot()
function can generate a histogram, density plot, and beeswarm plot
(either regular or interactive with tooltips). Additionally, for
histogram, you can choose to display either all values after imputation
or only those that were previously missing and then imputed.
create_distribution_plot(comp_dat, metabolite = "C9", bins = 40)
create_distribution_plot(comp_dat, metabolite = "C9", bins = 40,
histogram_type = "imputed")
create_distribution_plot(comp_dat, metabolite = "C9", type = "density")
create_distribution_plot(comp_dat, metabolite = "C9",
type = "beeswarm")
MetaboCrates also offers two types of Q-Q plots: a
theoretical Q-Q plot (using the create_qqplot()
function),
which compares the quantiles of a metabolite with the quantiles of the
normal distribution, and an empirical Q-Q plot (with
create_empirical_qqplot()
), which compares the quantiles of
metabolite values before and after imputation.
create_qqplot(comp_dat, metabolite = "C9")
create_empirical_qqplot(comp_dat, metabolite = "C9")
After imputation, you can perform quality control and outlier detection.
Quality control
To perform quality control, you must calculate the coefficients of variation (CV) for each quality control type using the completed data.
qc_dat <- calculate_CV(comp_dat)
The coefficients of variation can now be accessed in the
cv
attribute.
attr(qc_dat, "cv")[1:5,]
#> # A tibble: 5 × 3
#> `sample type` metabolite CV
#> <chr> <chr> <dbl>
#> 1 QC Level 1 1-Met-His 0.0392
#> 2 QC Level 1 3-IAA 0.00646
#> 3 QC Level 1 3-IPA 0.0252
#> 4 QC Level 1 3-Met-His 0.0322
#> 5 QC Level 1 5-AVA 0.0722
You can obtain the metabolites with CV values greater than a specified threshold using the following function.
get_CV_to_remove(qc_dat, 0.8)[1:5]
#> [1] "C14:1" "C3-DC (C4-OH)" "Cer d18:1/26:0" "Cer d18:2/18:0"
#> [5] "DG 14:1_18:1"
To remove metabolites from the data, use the
remove_metabolites()
function.
qc_dat <- remove_metabolites(qc_dat, c("C14:1", "C3-DC (C4-OH)"), "QC")
attr(qc_dat, "removed")
#> $LOD
#> [1] "AbsAcid" "Ac-Orn"
#>
#> $QC
#> [1] "C14:1" "C3-DC (C4-OH)"
Outlier detection
To detect outliers and patterns corresponding to variation in the data, you can use principal component analysis (PCA). MetaboCrates offers several PCA plots. The function below creates PCA scatterplots, which visualize the clustering of observations with respect to the sample type or group level.
create_PCA_plot(qc_dat,
types_to_display = c("Sample", "QC Level 1", "QC Level 2"))
create_PCA_plot(qc_dat, group_by = "group")
Using the same function, you can also generate biplots (based on data
with all sample types if group_by = "sample_type"
, or only
type Sample when group_by = "group"
), displaying
loadings above a given threshold. This plot type illustrates how each
metabolite contributes to the first two principal components and
highlights metabolites that capture similar information.
create_PCA_plot(qc_dat, type = "biplot", threshold = 0.05, interactive = FALSE)
create_PCA_plot(qc_dat, type = "biplot", group_by = "group", threshold = 0.05,
interactive = FALSE)
Finally, you can investigate the percentage of total variance in the data explained by each principal component. Additionally, cumulative variance - the sum of the percentage variance contributions from the first principal component up to a given component - can be displayed. Threshold determines which principal components are shown: all components for which the cumulative variance remains below the threshold are included, along with one additional component that brings the cumulative variance just above the threshold (if applicable). You can also specify the maximum number of principal components to display on the plot.
pca_variance(qc_dat, threshold = 0.7)
pca_variance(qc_dat, threshold = 0.7, group_by = "group", max_num = 6,
cumulative = FALSE)