sq: class for keeping biological sequences tidy

An object of class sq represents a list of biological sequences. It is the main internal format of the tidysq package and most functions operate on it. The storage method is memory-optimized so that objects require as little memory as possible (details below).

Construction/reading/import of sq objects

There are multiple ways of obtaining sq objects:

constructing from a character vector with sq method,
constructing from another object with as.sq method,
reading from the FASTA file with read_fasta,
importing from a format of other package like ape or Biostrings with import_sq.

Important note: A manual assignment of a class sq to an object is strongly discouraged - due to the usage of low-level functions for bit packing such assignment may lead to calling one of those functions during operating on object or even printing it which can cause a crash of R session and, in consequence, loss of data.

Export/writing of sq objects

There are multiple ways of saving sq objects or converting them into other formats:

converting into a character vector with as.character method,
converting into a character matrix with as.matrix method,
saving as FASTA file with write_fasta,
exporting into a format of other package like ape or Biostrings with export_sq.

Ambiguous letters

This package is meant to handle amino acid, DNA and RNA sequences. IUPAC standard for one letter codes includes ambiguous bases that are used to describe more than one basic standard base. For example, "B" in the context of DNA code means "any of C, G or T". As there are operations that make sense only for unambiguous bases (like translate), this package has separate types for sequences with "basic" and "extended" alphabet.

Types of sq

There is need to differentiate sq objects that keep different types of sequences (DNA, RNA, amino acid), as they use different alphabets. Furthermore, there are special types for handling non-standard sequence formats.

Each sq object has exactly one of types:

ami_bsc - (amino acids) represents a list of sequences of amino acids (peptides or proteins),
ami_ext - same as above, but with possible usage of ambiguous letters,
dna_bsc - (DNA) represents a list of DNA sequences,
dna_ext - same as above, but with possible usage of ambiguous letters,
rna_bsc - (RNA) represents a list of RNA sequences (together with DNA above often collectively called "nucleotide sequences"),
rna_ext - same as above, but with possible usage of ambiguous letters,
unt - (untyped) represents a list of sequences that do not have specified type. They are mainly result of reading sequences from a file that contains some letters that are not in standard nucleotide or amino acid alphabets and user has not specified them explicitly. They should be converted to other sq classes (using functions like substitute_letters or typify),
atp - (atypical) represents sequences that have an alphabet different from standard alphabets - similarly to unt, but user has been explicitly informed about it. They are result of constructing sequences or reading from file with provided custom alphabet (for details see read_fasta and sq function). They are also result of using function substitute_letters - users can use it to for example simplify an alphabet and replace several letters by one.

For clarity, ami_bsc and ami_ext types are often referred to collectively as ami when there is no need to explicitly specify every possible type. The same applies to dna and rna.

sq object type is printed when using overloaded method print. It can be also checked and obtained as a value (that may be passed as argument to function) by using sq_type.

Alphabet

See alphabet.

The user can obtain an alphabet of the sq object using the alphabet function. The user can check which letters are invalid (i.e. not represented in standard amino acid or nucleotide alphabet) in each sequence of given sq object by using find_invalid_letters. To substitute one letter with another use substitute_letters.

Missing/Not Available values

There is a possibility of introducing NA values into sequences. NA value does not represents gap (which are represented by "-") or wildcard elements ("N" in the case of nucleotides and "X" in the case of amino acids), but is used as a representation of an empty position or invalid letters (not represented in nucleotide or amino acid alphabet).

NA does not belong to any alphabet. It is printed as "!" and, thus, it is highly unrecommended to use "!" as special letter in atp sequences (but print character can be changed in options, see tidysq-options).

NA might be introduced by:

reading fasta file with non-standard letters with read_fasta with safe_mode argument set to TRUE,
replacing a letter with NA value with substitute_letters,
subsetting sequences beyond their lengths with bite.

The user can convert sequences that contain NA values into NULL sequences with remove_na.

NULL (empty) sequences

NULL sequence is a sequence of length 0.

NULL sequences might be introduced by:

constructing sq object from character string of length zero,
using the remove_ambiguous function,
using the remove_na function,
subsetting sq object with bite function (and negative indices that span at least -1:-length(sequence).

Storage format

sq object is, in fact, list of raw vectors. The fact that it is list implies that the user can concatenate sq objects using c method and subset them using extract operator. Alphabet is kept as an attribute of the object.

Raw vectors are the most efficient way of storage - each letter of a sequence is assigned an integer (its index in alphabet of sq object). Those integers in binary format fit in less than 8 bits, but normally are stored on 16 bits. However, thanks to bit packing it is possible to remove unused bits and store numbers more tightly. This means that all operations must either be implemented with this packing in mind or accept a little time overhead induced by unpacking and repacking sequences. However, this cost is relatively low in comparison to amount of saved memory.

For example - dna_bsc alphabet consists of 5 values: ACGT-. They are assigned numbers 0 to 4 respectively. Those numbers in binary format take form: 000, 001, 010, 011, 100. Each of these letters can be coded with just 3 bits instead of 8 which is demanded by char - this allows us to save more than 60% of memory spent on storage of basic nucleotide sequences.

tibble compatibility

sq objects are compatible with tibble class - that means one can have an sq object as a column of a tibble. There are overloaded print methods, so that it is printed in pretty format.