k-mers (n-grams) refer to k-length substrings derived from longer sequences, which can be continuous (representing a block of subsequent residues) or discontinuous (where the wildcards representing the gaps between residues are allowed). In biological applications, k-mers are used for various purposes, including genome assembly, sequence alignment, motif discovery, variant calling, phylogenetics, and CRISPR target identification. Due to the vast number of variables introduced by k-mer notation, we need tools to filter the variables.
In this package we provide tools for simulating k-mer data and benchmarking various filtering techniques. It is designed to assist researchers and practitioners in evaluating and comparing different approaches for preprocessing sequence data, particularly for applications such as protein function prediction.
Our package provides fast and efficient generation of synthetic k-mer datasets with customizable parameters, including sequence length, number of sequences, k-mer size and motifs impact. The latest embraces modelling the response variable in discrete or continuous way depending on few biological cases, such as
additive impact of each motif,
additive impact of interactions of motifs,
additive impact of custom interactions (logical expressions using motifs).
To learn more, please visit: https://biogenies.info/kmerFilters/
Specific resources:
You can use the generate_motifs
function for motifs generation. For example:
library(kmerFilters)
#> Warning: replacing previous import 'S4Arrays::makeNindexFromArrayViewport' by
#> 'DelayedArray::makeNindexFromArrayViewport' when loading 'SummarizedExperiment'
alph <- letters[1:4]
n_injections <- 4
motifs <- generate_motifs(alphabet = alph,
n_motifs = 4,
n_injections = 4,
n = 4,
d = 6)
motifs
#> [[1]]
#> [1] "b" "_" "_" "d"
#>
#> [[2]]
#> [1] "c" "_" "c" "a" "c"
#>
#> [[3]]
#> [1] "a" "_" "d"
#>
#> [[4]]
#> [1] "c" "b" "_" "_" "_" "_" "b"
Using simulated motifs we can simulate positive and negative sequences and consrtuct a k-mer feature space:
results <- generate_kmer_data(n_seq = 200,
sequence_length = 20,
alphabet = alph,
motifs = motifs,
n_injections = 4)
results[1:10, ]
#> 10 x 23406 sparse Matrix of class "dgCMatrix"
#> [[ suppressing 33 column names 'a_0', 'd_0', 'b_0' ... ]]
#>
#> [1,] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 . . . . . 1 1 1 1 1 1 1 1 1 1 1 1 1 ......
#> [2,] 1 1 1 1 . . 1 . . 1 . . 1 1 1 1 1 1 1 . . 1 1 1 . 1 . 1 1 . . 1 1 ......
#> [3,] 1 1 1 1 1 1 . 1 . 1 1 1 . . 1 . 1 1 1 1 1 1 . 1 1 . 1 1 . 1 1 . 1 ......
#> [4,] 1 1 1 1 1 1 . 1 . 1 1 1 . 1 1 . 1 1 1 . 1 . 1 1 1 . . 1 1 1 1 1 . ......
#> [5,] 1 1 1 1 1 . 1 1 . 1 . 1 1 1 1 1 1 1 1 1 . . . 1 1 . 1 1 1 1 . 1 1 ......
#> [6,] 1 1 1 1 . . 1 . . 1 1 . 1 1 1 1 1 . 1 1 1 1 1 1 . 1 . 1 1 . . 1 1 ......
#> [7,] 1 1 1 1 1 . . 1 1 1 . 1 . 1 1 1 . 1 1 . 1 1 . 1 1 1 . 1 . 1 1 . 1 ......
#> [8,] 1 1 1 1 . 1 . . . 1 1 . 1 1 1 . 1 . 1 . . . 1 1 1 1 . 1 1 1 1 1 . ......
#> [9,] 1 1 1 1 1 1 1 . 1 1 1 1 . 1 1 1 1 . 1 . . . 1 1 1 . 1 1 1 1 1 1 . ......
#> [10,] 1 1 1 1 1 1 . . . 1 1 1 . 1 1 1 1 1 1 . 1 . 1 1 1 1 1 1 1 1 1 1 1 ......
#>
#> .....suppressing 23373 columns in show(); maybe adjust options(max.print=, width=)
#> ..............................
Using obtained data you can choose how to generate a response variable. We provide three functions for that:
For example, the following code:
target <- get_target_additive(results)
target
#> [1] 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 0 1 0 1 1 1 1 1 1 1
#> [38] 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 1 1 0
#> [75] 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 0 0 1
#> [112] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [149] 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
#> [186] 0 0 1 0 0 0 0 0 0 0 0 1 0 0 1
creates a binary response variable based on the logistic regression model assumptions.
We have implemented many filtering methods. You can list them easily using the function
list_filters()
#> [1] "filter_chisq" "filter_fcbf" "filter_ic" "filter_ig"
#> [5] "filter_praznik" "filter_quipt"
Using k-mer space and target variable you can filter k-mers as follows:
filter_quipt(target, results, significance_level = 0.05)
#> [1] "a.c_1" "b.d_2" "b.b_4" "c.b_5"
#> [5] "c.a.c_0.0" "c.c.a_1.0" "c.c.c_1.1" "b.c.c_3.1"
#> [9] "c.b.b_0.4" "c.c.a.c_1.0.0" "d.c.c.a_0.1.0" "c.b.b.b_0.3.0"
#> [13] "c.c.c.c_1.1.1" "b.c.c.c_1.1.1" "d.c.c.c_3.1.1" "b.c.b.b_0.2.1"
#> [17] "b.c.b.b_0.3.1" "c.b.c.b_0.0.3"