Data sampling and data structure. • kmerFilters

What are k-mers?

k-mers, or n-grams, are sequences of $k$ consecutive elements extracted from a larger sequence of letters. In the context of bioinformatics, k-mers refer to short DNA, RNA, or protein sequences. They are used to analyze biological sequences by breaking them down into manageable fragments, allowing researchers to identify patterns, mutations, and other genetic features. k-mers are valuable in various applications, including sequence alignment, genome assembly, and gene prediction.

In our package we generate some artificial sequences and count k-mers for them using the seqr package. If you came here with your own data for analysis and do not wish to simulate artificial data, you can use the seqr package to count k-mers in your sequences. Anyway, you can filter k-mers with methods contained in kmerFilters.

Data structure

To generate data we first generate both positive and negative sequences and then count k-mers for them. Everything is done by two functions specified by the following parameters (for more see the documentation):

alphabet - set of letters for sampling,
sequence_length - length of sequence,
n_injections - maximum number of motifs to inject into a positive sequence.

Let’s create an example set of motifs

set.seed(111)

dna_alphabet <- c("A", "C", "T", "G")

motifs <- generate_motifs(alphabet = dna_alphabet,     # alphabet
                          n_motifs = 2,                # number of motifs
                          n_injections = 1,            # number of injections
                          n = 3,                       # number of letters
                          d = 2)                       # number of possible gaps

Here, the result is a list of motifs for potential injection to positive sequences:

motifs

## [[1]]
## [1] "T" "G" "T"
## 
## [[2]]
## [1] "A" "_" "T"

Having motifs we can generate sequences and create a k-mer feature space:

sequence_length <- 4

kmer_dat <- generate_kmer_data(n_seq = 10, 
                               sequence_length = sequence_length, 
                               alph = dna_alphabet,
                               motifs = motifs, 
                               n_injections = 1,
                               n = 2,
                               d = 1)

The output of generate_kmer_data function is an object of the dgCMatrix class. It’s a sparse matrix containing 0’s and 1’s where 1 denotes the occurrence of k-mer and 0 denotes no occurrence. The below numbers means that we obtained 10 rows (sequences) and 24 (found k-mers with maximum 2 letetrs and 1 gap):

dim(kmer_dat)

## [1] 10 24

We can easily access the k-mers using the following command:

kmer_dat@Dimnames[[2]]

##  [1] "G_0"   "T_0"   "A_0"   "C_0"   "T.G_0" "G.T_0" "A.T_0" "T.A_0" "T.T_0"
## [10] "A.C_0" "C.T_0" "A.A_0" "T.C_0" "C.A_0" "C.C_0" "G.G_1" "T.T_1" "A.T_1"
## [19] "T.A_1" "G.A_1" "C.T_1" "A.A_1" "A.C_1" "T.C_1"

Each dot . between letters represents a potential gap. The numbers following the _ symbol indicate the number of gaps associated with each dot. For example, A.C_1 means exactly $\texttt{A_C}$ where $\texttt{_}$ means a gap. You can decode the names using biogram package as follows

unname(biogram::decode_ngrams(kmer_dat@Dimnames[[2]]))

##  [1] "G"   "T"   "A"   "C"   "TG"  "GT"  "AT"  "TA"  "TT"  "AC"  "CT"  "AA" 
## [13] "TC"  "CA"  "CC"  "G_G" "T_T" "A_T" "T_A" "G_A" "C_T" "A_A" "A_C" "T_C"

For a set of sequences each k-mer is a variable indicating whether given k-mer occurs in corresponding sequence. Here, we have 24 k-mers:

as.matrix(kmer_dat)

##       G_0 T_0 A_0 C_0 T.G_0 G.T_0 A.T_0 T.A_0 T.T_0 A.C_0 C.T_0 A.A_0 T.C_0
##  [1,]   1   1   0   0     1     1     0     0     0     0     0     0     0
##  [2,]   0   1   1   0     0     0     1     1     1     0     0     0     0
##  [3,]   1   1   1   0     1     1     0     1     0     0     0     0     0
##  [4,]   1   1   0   0     1     1     0     0     0     0     0     0     0
##  [5,]   0   1   1   1     0     0     0     0     1     1     1     0     0
##  [6,]   0   1   1   0     0     0     0     1     0     0     0     1     0
##  [7,]   0   1   1   1     0     0     1     0     0     0     0     0     1
##  [8,]   1   1   1   0     0     1     0     1     0     0     0     1     0
##  [9,]   0   1   1   1     0     0     1     0     0     0     0     0     1
## [10,]   0   1   1   1     0     0     1     0     0     0     0     0     1
##       C.A_0 C.C_0 G.G_1 T.T_1 A.T_1 T.A_1 G.A_1 C.T_1 A.A_1 A.C_1 T.C_1
##  [1,]     0     0     1     1     0     0     0     0     0     0     0
##  [2,]     0     0     0     0     1     1     0     0     0     0     0
##  [3,]     0     0     0     1     0     0     1     0     0     0     0
##  [4,]     0     0     1     1     0     0     0     0     0     0     0
##  [5,]     0     0     0     0     1     0     0     1     0     0     0
##  [6,]     0     0     0     0     0     1     0     0     1     0     0
##  [7,]     1     0     0     0     0     1     0     1     0     0     0
##  [8,]     0     0     0     0     0     1     1     0     0     0     0
##  [9,]     0     1     0     0     0     0     0     0     0     1     1
## [10,]     0     1     0     0     0     0     0     0     0     1     1

corresponding to the sequences:

apply(attr(kmer_dat, "sequences"), 1, paste0, collapse = "")

##  [1] "GTGT" "ATTA" "TGTA" "GTGT" "ACTT" "TAAA" "TCAT" "GTAA" "ATCC" "ATCC"

Target variable

Here we will shortly describe the response variable. For the set of generated motifs

attr(kmer_dat, "motifs_set")

## [[1]]
## [1] "T" "G" "T"
## 
## [[2]]
## [1] "A" "_" "T"

we create an object motifs_map which denotes which motif was injected to which sequence. In our example

attr(kmer_dat, "motifs_map")

##       [,1] [,2]
##  [1,]    1    0
##  [2,]    0    1
##  [3,]    1    0
##  [4,]    1    0
##  [5,]    0    1
##  [6,]    0    0
##  [7,]    0    0
##  [8,]    0    0
##  [9,]    0    0
## [10,]    0    0

we can see that the motif TGT occurred 3 times and A_T occurred 2 time. The target attribute says whether given sequence contains at least one motif:

attr(kmer_dat, "target")

##  [1]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE

A variable target says whether a sequence is positive or negative, however we might also consider some noise (false positively or negatively labeled sequences). To do so, we use the functions:

get_target_additive,
get_target_interactions,
get_target_logic.

For example

get_target_additive(kmer_dat)

##  [1] 1 0 1 1 0 0 0 0 0 1

For more details about target sampling see vignette.