about_kmers.Rmd
k-mers, or n-grams, are sequences of consecutive elements extracted from a larger sequence of letters. In the context of bioinformatics, k-mers refer to short DNA, RNA, or protein sequences. They are used to analyze biological sequences by breaking them down into manageable fragments, allowing researchers to identify patterns, mutations, and other genetic features. k-mers are valuable in various applications, including sequence alignment, genome assembly, and gene prediction.
In our package we generate some artificial sequences and count k-mers for them using the seqr package. If you came here with your own data for analysis and do not wish to simulate artificial data, you can use the seqr package to count k-mers in your sequences. Anyway, you can filter k-mers with methods contained in kmerFilters.
To generate data we first generate both positive and negative sequences and then count k-mers for them. Everything is done by two functions specified by the following parameters (for more see the documentation):
Let’s create an example set of motifs
set.seed(111)
dna_alphabet <- c("A", "C", "T", "G")
motifs <- generate_motifs(alphabet = dna_alphabet, # alphabet
n_motifs = 2, # number of motifs
n_injections = 1, # number of injections
n = 3, # number of letters
d = 2) # number of possible gaps
Here, the result is a list of motifs for potential injection to positive sequences:
motifs
## [[1]]
## [1] "T" "G" "T"
##
## [[2]]
## [1] "A" "_" "T"
Having motifs we can generate sequences and create a k-mer feature space:
sequence_length <- 4
kmer_dat <- generate_kmer_data(n_seq = 10,
sequence_length = sequence_length,
alph = dna_alphabet,
motifs = motifs,
n_injections = 1,
n = 2,
d = 1)
The output of generate_kmer_data
function is an object
of the dgCMatrix
class. It’s a sparse matrix containing 0’s
and 1’s where 1 denotes the occurrence of k-mer and 0 denotes no
occurrence. The below numbers means that we obtained 10 rows (sequences)
and 24 (found k-mers with maximum 2 letetrs and 1 gap):
dim(kmer_dat)
## [1] 10 24
We can easily access the k-mers using the following command:
kmer_dat@Dimnames[[2]]
## [1] "G_0" "T_0" "A_0" "C_0" "T.G_0" "G.T_0" "A.T_0" "T.A_0" "T.T_0"
## [10] "A.C_0" "C.T_0" "A.A_0" "T.C_0" "C.A_0" "C.C_0" "G.G_1" "T.T_1" "A.T_1"
## [19] "T.A_1" "G.A_1" "C.T_1" "A.A_1" "A.C_1" "T.C_1"
Each dot .
between letters represents a potential gap.
The numbers following the _
symbol indicate the number of
gaps associated with each dot. For example, A.C_1
means
exactly
where
means a gap. You can decode the names using biogram package as
follows
unname(biogram::decode_ngrams(kmer_dat@Dimnames[[2]]))
## [1] "G" "T" "A" "C" "TG" "GT" "AT" "TA" "TT" "AC" "CT" "AA"
## [13] "TC" "CA" "CC" "G_G" "T_T" "A_T" "T_A" "G_A" "C_T" "A_A" "A_C" "T_C"
For a set of sequences each k-mer is a variable indicating whether given k-mer occurs in corresponding sequence. Here, we have 24 k-mers:
as.matrix(kmer_dat)
## G_0 T_0 A_0 C_0 T.G_0 G.T_0 A.T_0 T.A_0 T.T_0 A.C_0 C.T_0 A.A_0 T.C_0
## [1,] 1 1 0 0 1 1 0 0 0 0 0 0 0
## [2,] 0 1 1 0 0 0 1 1 1 0 0 0 0
## [3,] 1 1 1 0 1 1 0 1 0 0 0 0 0
## [4,] 1 1 0 0 1 1 0 0 0 0 0 0 0
## [5,] 0 1 1 1 0 0 0 0 1 1 1 0 0
## [6,] 0 1 1 0 0 0 0 1 0 0 0 1 0
## [7,] 0 1 1 1 0 0 1 0 0 0 0 0 1
## [8,] 1 1 1 0 0 1 0 1 0 0 0 1 0
## [9,] 0 1 1 1 0 0 1 0 0 0 0 0 1
## [10,] 0 1 1 1 0 0 1 0 0 0 0 0 1
## C.A_0 C.C_0 G.G_1 T.T_1 A.T_1 T.A_1 G.A_1 C.T_1 A.A_1 A.C_1 T.C_1
## [1,] 0 0 1 1 0 0 0 0 0 0 0
## [2,] 0 0 0 0 1 1 0 0 0 0 0
## [3,] 0 0 0 1 0 0 1 0 0 0 0
## [4,] 0 0 1 1 0 0 0 0 0 0 0
## [5,] 0 0 0 0 1 0 0 1 0 0 0
## [6,] 0 0 0 0 0 1 0 0 1 0 0
## [7,] 1 0 0 0 0 1 0 1 0 0 0
## [8,] 0 0 0 0 0 1 1 0 0 0 0
## [9,] 0 1 0 0 0 0 0 0 0 1 1
## [10,] 0 1 0 0 0 0 0 0 0 1 1
corresponding to the sequences:
## [1] "GTGT" "ATTA" "TGTA" "GTGT" "ACTT" "TAAA" "TCAT" "GTAA" "ATCC" "ATCC"
Here we will shortly describe the response variable. For the set of generated motifs
attr(kmer_dat, "motifs_set")
## [[1]]
## [1] "T" "G" "T"
##
## [[2]]
## [1] "A" "_" "T"
we create an object motifs_map
which denotes which motif
was injected to which sequence. In our example
attr(kmer_dat, "motifs_map")
## [,1] [,2]
## [1,] 1 0
## [2,] 0 1
## [3,] 1 0
## [4,] 1 0
## [5,] 0 1
## [6,] 0 0
## [7,] 0 0
## [8,] 0 0
## [9,] 0 0
## [10,] 0 0
we can see that the motif TGT occurred 3 times and A_T occurred 2
time. The target
attribute says whether given sequence
contains at least one motif:
attr(kmer_dat, "target")
## [1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE
A variable target
says whether a sequence is positive or
negative, however we might also consider some noise (false positively or
negatively labeled sequences). To do so, we use the functions:
get_target_additive
,get_target_interactions
,get_target_logic
.For example
get_target_additive(kmer_dat)
## [1] 1 0 1 1 0 0 0 0 0 1
For more details about target sampling see vignette.