An object of class sq represents a list of biological sequences. It is the main internal format of the tidysq package and most functions operate on it. The storage method is memory-optimized so that objects require as little memory as possible (details below).
There are multiple ways of obtaining sq
objects:
constructing from another object with as.sq
method,
reading from the FASTA file with read_fasta
,
importing from a format of other package like ape or
Biostrings with import_sq
.
Important note: A manual assignment of a class sq
to an
object is strongly discouraged - due to the usage of low-level
functions for bit packing such assignment may lead to calling one of those
functions during operating on object or even printing it which can cause
a crash of R session and, in consequence, loss of data.
There are multiple ways of saving sq
objects or converting them into
other formats:
converting into a character vector with
as.character
method,
converting into a character matrix with
as.matrix
method,
saving as FASTA file with write_fasta
,
exporting into a format of other package like ape
or
Biostrings
with export_sq
.
This package is meant to handle amino acid, DNA and RNA sequences. IUPAC
standard for one letter codes includes ambiguous bases that are used to
describe more than one basic standard base. For example, "B
" in the
context of DNA code means "any of C, G or T". As there are operations that
make sense only for unambiguous bases (like translate
), this
package has separate types for sequences with "basic" and "extended"
alphabet.
There is need to differentiate sq
objects that keep different types
of sequences (DNA, RNA, amino acid), as they use different alphabets.
Furthermore, there are special types for handling non-standard sequence
formats.
Each sq object has exactly one of types:
ami_bsc - (amino acids) represents a list of sequences of amino acids (peptides or proteins),
ami_ext - same as above, but with possible usage of ambiguous letters,
dna_bsc - (DNA) represents a list of DNA sequences,
dna_ext - same as above, but with possible usage of ambiguous letters,
rna_bsc - (RNA) represents a list of RNA sequences (together with DNA above often collectively called "nucleotide sequences"),
rna_ext - same as above, but with possible usage of ambiguous letters,
unt - (untyped) represents a list of sequences that do
not have specified type. They are mainly result of reading sequences from
a file that contains some letters that are not in standard nucleotide or
amino acid alphabets and user has not specified them explicitly. They should
be converted to other sq classes (using functions like
substitute_letters
or typify
),
atp - (atypical) represents sequences that have an
alphabet different from standard alphabets - similarly to unt, but
user has been explicitly informed about it. They are result of constructing
sequences or reading from file with provided custom alphabet (for details
see read_fasta
and sq
function). They are also
result of using function substitute_letters
- users can use
it to for example simplify an alphabet and replace several letters by one.
For clarity, ami_bsc and ami_ext types are often referred to collectively as ami when there is no need to explicitly specify every possible type. The same applies to dna and rna.
sq
object type is printed when using overloaded method
print
. It can be also checked and obtained as
a value (that may be passed as argument to function) by using
sq_type
.
See alphabet
.
The user can obtain an alphabet of the sq
object using the
alphabet
function. The user can check which letters are
invalid (i.e. not represented in standard amino acid or nucleotide
alphabet) in each sequence of given sq
object by using
find_invalid_letters
. To substitute one letter with another
use substitute_letters
.
There is a possibility of introducing NA
values into
sequences. NA
value does not represents gap (which are represented by
"-
") or wildcard elements ("N
" in the case of nucleotides and
"X
" in the case of amino acids), but is used as a representation of
an empty position or invalid letters (not represented in nucleotide or amino
acid alphabet).
NA
does not belong to any alphabet. It is printed as "!
" and,
thus, it is highly unrecommended to use "!
" as special letter in
atp sequences (but print character can be changed in options, see
tidysq-options
).
NA
might be introduced by:
reading fasta file with non-standard letters with
read_fasta
with safe_mode
argument set to TRUE
,
replacing a letter with NA
value with
substitute_letters
,
subsetting sequences beyond their lengths with bite
.
The user can convert sequences that contain NA
values into
NULL
sequences with remove_na
.
NULL
sequence is a sequence of length 0.
NULL
sequences might be introduced by:
constructing sq
object from character string of length zero,
using the remove_ambiguous
function,
using the remove_na
function,
subsetting sq
object with bite
function (and
negative indices that span at least -1:-length(sequence)
.
sq
object is, in fact, list of raw vectors. The fact that it
is list implies that the user can concatenate sq
objects using
c
method and subset them using
extract operator
. Alphabet is kept as an
attribute of the object.
Raw vectors are the most efficient way of storage - each letter of a
sequence is assigned an integer (its index in alphabet of sq
object).
Those integers in binary format fit in less than 8 bits, but normally are
stored on 16 bits. However, thanks to bit packing it is possible to remove
unused bits and store numbers more tightly. This means that all operations
must either be implemented with this packing in mind or accept a little time
overhead induced by unpacking and repacking sequences. However, this cost
is relatively low in comparison to amount of saved memory.
For example - dna_bsc alphabet consists of 5 values: ACGT-. They
are assigned numbers 0 to 4 respectively. Those numbers in binary format
take form: 000
, 001
, 010
, 011
, 100
. Each
of these letters can be coded with just 3 bits instead of 8 which is
demanded by char
- this allows us to save more than 60% of memory
spent on storage of basic nucleotide sequences.
sq
objects are compatible with tibble
class -
that means one can have an sq
object as a column of a tibble
.
There are overloaded print methods, so that it is printed in pretty format.