Advanced alphabet techniques • tidysq

library(tidysq)
#> 
#> Attaching package: 'tidysq'
#> The following object is masked from 'package:base':
#> 
#>     paste

Sequences in sq objects are compressed to take up less storage space. To achieve that, sq objects store an alphabet attribute that serves as a dictionary of possible symbols. This attribute can be accessed by its namesake function:

sq_dna <- sq(c("CTGAATGCAGT", "ATGCCGT", "CAGACCATT"))
alphabet(sq_dna)
#> <tidysq alphabet[5]>
#> [1] A C G T -

It is strongly discouraged to manually assign different alphabet, as it may result in undesirable behavior.

Standard alphabets

Alphabets can be divided into standard and non-standard types. Both these groups have similar behavior, but standard alphabets have additional functionalities available due to their biological interpretation.

Standard alphabets can be subdivided into basic and extended alphabets, both groups closely linked. For every standard alphabet there exists a type such that if an sq object has this type, then its alphabet attribute has this alphabet as value.

Basic alphabets

There are three predefined basic alphabets — for DNA, RNA and amino acid sequences. They consist of all letter codes used for bases of given type, as well as gap letter “-” and (in amino acid case) stop letter “*”. Alphabets are stored as character vectors with added sq_alphabet class for additional methods. For instance, amino acid alphabet contains following letters: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y, -, *.

Basic DNA/RNA alphabet is necessary for translate() operation.

Extended alphabets

For each basic alphabet there is an extended counterpart. These three extended alphabets contain all letters from the respective basic ones and, additionally, ambiguous letters (that is, letters that mean “X-or-Y-or-Z base”, where X, Y and Z are chosen from corresponding base alphabet).

Both basic and extended alphabets can be acquired using get_standard_alphabet() function. It uses type interpreting not to force the user to remember exact type name (although using consistent naming is beneficial to code readability):

get_standard_alphabet("ami_ext")
#> <tidysq alphabet[28]>
#>  [1] A B C D E F G H I J K L M N O P Q R S T U V W X Y Z - *
get_standard_alphabet("rna_bsc")
#> <tidysq alphabet[5]>
#> [1] A C G U -
get_standard_alphabet("DNA extended")
#> <tidysq alphabet[16]>
#>  [1] A C G T W S M K R Y B D H V N -

Removing ambiguous elements

When an sq object has an extended type, it can be converted to the basic one by utilizing remove_ambiguous() function. It works by removing either sequences where an ambiguous element is present or just this element, depending on by_letter parameter value. In the example below N is such an element:

sq_rna <- sq(c("UCGGNNCAGNN", "AUUCGGUGA", "CNCUUANNNCNU"))
sq_rna
#> extended RNA sequences list:
#> [1] UCGGNNCAGNN                                                             <11>
#> [2] AUUCGGUGA                                                                <9>
#> [3] CNCUUANNNCNU                                                            <12>
remove_ambiguous(sq_rna)
#> basic RNA sequences list:
#> [1] <NULL>                                                                   <0>
#> [2] AUUCGGUGA                                                                <9>
#> [3] <NULL>                                                                   <0>
remove_ambiguous(sq_rna, by_letter = TRUE)
#> basic RNA sequences list:
#> [1] UCGGCAG                                                                  <7>
#> [2] AUUCGGUGA                                                                <9>
#> [3] CCUUACU                                                                  <7>

Should the user wish to keep the original lengths of sequences unchanged, it’s more appropriate to use substitute_letters() function instead. The most obvious replacement is “-” gap letter, present in all standard alphabets:

substitute_letters(sq_rna, c(N = "-"))
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG--                                                             <11>
#> [2] AUUCGGUGA                                                                <9>
#> [3] C-CUUA---C-U                                                            <12>

Notice, however, that returned object has atp alphabet instead. More on handling that in chapter about changing sq types.

Non-standard alphabets

Non-standard alphabet group consists of two types: untyped (unt) and atypical (atp). The former is a result of not specifying alphabet and being unable to find a standard alphabet that would contain all letters appearing in sequences. The latter, on the other hand, is used whenever the user specifies used alphabet explicitly. The difference can be best shown with calls to constructing sq() function:

sq(c("PFN&I&VO*&P", "&IO*&PVO"))
#> unt (unspecified type) sequences list:
#> [1] PFN&I&VO*&P                                                             <11>
#> [2] &IO*&PVO                                                                 <8>
sq(c("PFN&I&VO*&P", "&IO*&PVO"),
   alphabet = c("F", "I", "N", "O", "P", "V", "&", "*"))
#> atp (atypical alphabet) sequences list:
#> [1] PFN&I&VO*&P                                                             <11>
#> [2] &IO*&PVO                                                                 <8>

Obviously, as with standard alphabets, atypical ones can also contain more letters than actually appear:

sq(c("PFN&I&VO*&P", "&IO*&PVO"),
   alphabet = c("E", "F", "I", "N", "O", "P", "Q", "V", "&", "*", ":"))
#> atp (atypical alphabet) sequences list:
#> [1] PFN&I&VO*&P                                                             <11>
#> [2] &IO*&PVO                                                                 <8>

Multicharacter alphabets

The main usage of atypical alphabets is to allow the user to handle data with multicharacter letters. For example sometimes amino acid sequences are described using three-character codes. These can be handled as shown below (although with specifying all, not only a handful of codes):

sq_multichar <- sq(c("TyrGlyArgArgAsp", "AspGlyArgGly", "CysGluGlyTyrProArg"),
                   alphabet = c("Arg", "Asp", "Cys", "Glu", "Gly", "Pro", "Tyr"))
sq_multichar
#> atp (atypical alphabet) sequences list:
#> [1] Tyr Gly Arg Arg Asp                                                      <5>
#> [2] Asp Gly Arg Gly                                                          <4>
#> [3] Cys Glu Gly Tyr Pro Arg                                                  <6>

These letters are treated as a whole, meaning that they are indivisible. It can be observed during letter replacement operation:

substitute_letters(sq_multichar, c(Arg = "X", Glu = "His", Pro = "X"))
#> atp (atypical alphabet) sequences list:
#> [1] Tyr Gly X X Asp                                                          <5>
#> [2] Asp Gly X Gly                                                            <4>
#> [3] Cys His Gly Tyr X X                                                      <6>

Type manipulation

As shown in previous chapters, substitute_letters() return an sq object of atp type. If a type isn’t satisfying, then the user can utilize typify() function that creates new sq object with desired type (backticks are necessary, when the substituted letter isn’t a valid variable name):

sq_unt <- sq(c("UCGG&&CAG&&", "AUUCGGUGA", "C&CUUA&&&C&U"))
sq_sub <- substitute_letters(sq_unt, c(`&` = "-"))
sq_sub
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG--                                                             <11>
#> [2] AUUCGGUGA                                                                <9>
#> [3] C-CUUA---C-U                                                            <12>
typify(sq_sub, "rna_bsc")
#> basic RNA sequences list:
#> [1] UCGG--CAG--                                                             <11>
#> [2] AUUCGGUGA                                                                <9>
#> [3] C-CUUA---C-U                                                            <12>

However, one should note that there is a requirement for typify() to work — typified sq object must not contain any letters not in the target alphabet. For instance, following call won’t work:

typify(sq_sub, "dna_bsc")
#> Error in CPP_typify(x, dest_type, NA_letter): sq object contains letters that do not appear in the alphabet of target type

The user isn’t left alone to guess whether a sequence has invalid letters or not. In this case they can use find_invalid_letters() function that returns a list of character vectors, where each vector contains invalid letter for corresponding sequence:

find_invalid_letters(sq_sub, "dna_bsc")
#> [[1]]
#> [1] "U"
#> 
#> [[2]]
#> [1] "U"
#> 
#> [[3]]
#> [1] "U"

However, all invalid letters within an alphabet have to be substituted before passing it to typify(). A more complicated call that replaces all ambiguous letters with “-” gap letter can be constructed as follows:

ambiguous_letters <- setdiff(
  get_standard_alphabet("rna_ext"),
  get_standard_alphabet("rna_bsc")
)
encoding <- rep("-", length(ambiguous_letters))
names(encoding) <- ambiguous_letters
encoding
#>   W   S   M   K   R   Y   B   D   H   V   N 
#> "-" "-" "-" "-" "-" "-" "-" "-" "-" "-" "-"

sq_rna_sub <- substitute_letters(sq_rna, encoding)
typify(sq_rna_sub, "rna_bsc")
#> atp (atypical alphabet) sequences list:
#> [1] UCGG--CAG--                                                             <11>
#> [2] AUUCGGUGA                                                                <9>
#> [3] C-CUUA---C-U                                                            <12>