tidysq
package is meant to store and conduct operations on biological sequences. This vignette provides a guide to basic usage of tidysq
, i.e. reading, manipulating and writing sequences to file.
The most recent version of tidysq
can be installed with install_github()
function from devtools
.
# devtools::install_github("BioGenies/tidysq")
library(tidysq)
#>
#> Attaching package: 'tidysq'
#> The following object is masked from 'package:base':
#>
#> paste
Biological sequences can be and often are represented as strings – sequences of letters. For example, a DNA sequence can take the form of "TAGGCCCTAGACCTG"
, where A
means adenine, C
– cytosine, G
– guanine and T
– thymine. Exact IUPAC recommendations for one-letter codes can be found here.
Within tidysq
package sequence data is stored in sq
objects, that is, vectors of biological sequences. They can be created from string vectors as above:
sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG"))
sq_dna
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG <15>
#> [2] TAGGCCCTGGGCATG <15>
There are several thing to note. First, each sequence is an element of sq
object. Many operations are vectorized — they are applied to all sequences of a vector — and sq
objects are no different in this regard. Second, the first line of output says: basic DNA sequences list
. This means that all sequences of this object are of DNA type and do not use ambiguous letters (more about that in “Advanced alphabet techniques” vignette).
Manipulating sequence objects is an integral part of tidysq
. sq
objects can be easily subsetted using usual R syntax:
sq_dna[1]
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG <15>
Extracting subsequences is a bit more complicated than that — because it uses designated function bite()
. Its syntax, however, closely resembles that of base R — indexing starts with one and negative indices are interpreted as “anything except that”. It returns an sq
object with all sequences subsetted:
bite(sq_dna, 5:10)
#> basic DNA sequences list:
#> [1] CCCTAG <6>
#> [2] CCCTGG <6>
bite(sq_dna, c(-9, -11, -13))
#> basic DNA sequences list:
#> [1] TAGGCCCTGCTG <12>
#> [2] TAGGCCCTGCTG <12>
It’s possible to reverse sequences using this function:
# Don't do it like that!
bite(sq_dna, 15:1)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT <15>
#> [2] GTACGGGTCCCGGAT <15>
However, this usage is strongly discouraged, because it’s both ineffective and works badly with sequences of different lengths. Instead, there is a designated function reverse()
:
reverse(sq_dna)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT <15>
#> [2] GTACGGGTCCCGGAT <15>
Note that it is very different to base rev()
, which reverses only the order of sequences, not letters:
rev(sq_dna)
#> basic DNA sequences list:
#> [1] TAGGCCCTGGGCATG <15>
#> [2] TAGGCCCTAGACCTG <15>
We can combine two or more sq
objects using base c()
function:
tidysq
offers two functions specific to DNA/RNA sequences, namely complement()
and translate()
. The former creates sequences with complementary bases, that is, replaces A
with T
, C
with G
and vice versa. The latter translates input to amino acid sequences using the translation table with three-letter codons.
These functions can be called as shown below:
complement(sq_dna)
#> basic DNA sequences list:
#> [1] ATCCGGGATCTGGAC <15>
#> [2] ATCCGGGACCCGTAC <15>
#> [3] CAGGTCTAGGGCCTA <15>
#> [4] CATGCCCAGGGCCTA <15>
translate(sq_dna)
#> basic amino acid sequences list:
#> [1] *ALDL <5>
#> [2] *ALGM <5>
#> [3] VQIPD <5>
#> [4] VRVPD <5>
One noteworthy feature here is that translation can be done with any genetic code table of those listed on this Wikipedia page:
translate(sq_dna, table = 6)
#> basic amino acid sequences list:
#> [1] QALDL <5>
#> [2] QALGM <5>
#> [3] VQIPD <5>
#> [4] VRVPD <5>
Motifs are short subsequences. These are often searched for in biological sequences. tidysq
has two distinct functions that allow the user to perform such search.
One of them is a %has%
operator that takes sq
object and character vector as parameters respectively. It returns a logical vector of the same length as sq
object, where each element says whether all motifs passed as strings were found in given sequence:
sq_dna %has% "ATC"
#> [1] FALSE FALSE TRUE FALSE
# It can be used to subset sq
sq_dna[sq_dna %has% c("AG", "CC")]
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG <15>
#> [2] TAGGCCCTGGGCATG <15>
#> [3] GTCCAGATCCCGGAT <15>
It says nothing about motif placement within sequence nor it exact form, however. In this case, there is find_motifs()
function that returns a whole tibble
(from tibble
package; basically improved version of data.frame
) with various info about found motifs. Important thing to note here is that the second argument is a character vector of sequence names to avoid embedding potentially long sequences in resulting tibble
potentially many times:
find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG"))
#> # A tibble: 4 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 rev1 ATC <3> ATC 7 9
#> 2 seq1 TAG <3> TAG 1 3
#> 3 seq1 TAG <3> TAG 8 10
#> 4 seq2 TAG <3> TAG 1 3
You can also provide this function with a data.frame
(or, what we recommend, tibble
) containing one column called sq
, containing the sequences and the other colum name
containing the names.
sqibble <- tibble::tibble(sq = sq_dna,
name = c("seq1", "seq2", "rev1", "rev2"))
# does the same as the call from previous chunk of code
find_motifs(sqibble, c("ATC", "TAG"))
#> # A tibble: 4 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 rev1 ATC <3> ATC 7 9
#> 2 seq1 TAG <3> TAG 1 3
#> 3 seq1 TAG <3> TAG 8 10
#> 4 seq2 TAG <3> TAG 1 3
There are ambiguous DNA bases in IUPAC codes and these can be used in motifs. One of them is "N"
— its meaning is "any of A
, C
, G
or T
:
find_motifs(sqibble, "GNCC")
#> # A tibble: 7 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 seq1 GGCC <4> GNCC 3 6
#> 2 seq1 GCCC <4> GNCC 4 7
#> 3 seq1 GACC <4> GNCC 10 13
#> 4 seq2 GGCC <4> GNCC 3 6
#> 5 seq2 GCCC <4> GNCC 4 7
#> 6 rev1 GTCC <4> GNCC 1 4
#> 7 rev2 GTCC <4> GNCC 7 10
This example displays the difference between "sought"
and "found"
columns. The former contains the string representation of motif that the user was looking for, while the latter contains a tidysq
-encoded sequence with an “instance” of motif.
Two additional characters are reserved because of their special meaning in motifs. "^"
means that this motif must be found at the start of a sequence, while "$"
means the same, but with the end instead. They can be mixed with ambiguous letters, of course:
find_motifs(sqibble, c("^TAG", "ATN$"))
#> # A tibble: 3 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 seq1 TAG <3> ^TAG 1 3
#> 2 seq2 TAG <3> ^TAG 1 3
#> 3 seq2 ATG <3> ATN$ 13 15
After doing computations the user might wish to save their sequences for future use. One of the most popular formats for storing biological sequences is FASTA. tidysq
allows the user to write sequences to FASTA file with write_fasta()
function. Important thing to remember here that the arguments for the function are analogous to those used in find_motifs()
– either sq
object and a vector of names or a tibble
with columns of sequences and names:
write_fasta(sq_dna,
c("seq1", "seq2", "rev1", "rev2"),
"just_your_ordinary_fasta_file.fasta")
# or
write_fasta(sqibble,
"just_your_ordinary_fasta_file.fasta")