Construct sq object from character vector

This function allows the user to construct objects of class sq from a character vector.

sq(
  x,
  alphabet = NULL,
  NA_letter = getOption("tidysq_NA_letter"),
  safe_mode = getOption("tidysq_safe_mode"),
  on_warning = getOption("tidysq_on_warning"),
  ignore_case = FALSE
)

Arguments

x: [character]
Vector to construct sq object from.
alphabet: [character]
If provided value is a single string, it will be interpreted as type (see details). If provided value has length greater than one, it will be treated as atypical alphabet for sq object and sq type will be atp. If provided value is NULL, type guessing will be performed (see details).
NA_letter: [character(1)]
A string that is used to interpret and display NA value in the context of sq class. Default value equals to "!".
safe_mode: [logical(1)]
Default value is FALSE. When turned on, safe mode guarantees that NA appears within a sequence if and only if input sequence contains value passed with NA_letter. This means that resulting type might be different to the one passed as argument, if there are letters in a sequence that does not appear in the original alphabet.
on_warning: ["silent" || "message" || "warning" || "error"]
Determines the method of handling warning message. Default value is "warning".
ignore_case: [logical(1)]
If turned on, lowercase letters are turned into respective uppercase ones and interpreted as such. If not, either sq object must be of type unt or all lowercase letters are interpreted as NA values. Default value is FALSE. Ignoring case does not work with atp alphabets.

Value

An object of class sq with appropriate type.

Details

Function sq covers all possibilities of standard and non-standard types and alphabets. You can check what 'type' and 'alphabet' exactly are in sq class documentation. There is a guide below on how function operates and how the program behaves depending on arguments passed and letters in the sequences.

x parameter should be a character vector. Each element of this vector is a biological sequence. If this parameter has length 0, object of class sq with 0 sequences will be created (if not specified, it will have dna_bsc type, which is a result of rules written below). If it contains sequences of length 0, NULL sequences will be introduced (see NULL (empty) sequences section in sq class).

Important note: in all below cases word 'letter' stands for an element of an alphabet. Letter might consist of more than one character, for example "Ala" might be a single letter. However, if the user wants to construct or read sequences with multi-character letters, one has to specify all letters in alphabet parameter. Details of letters, alphabet and types can be found in sq class documentation.

Simple guide to construct

In many cases, just the x parameter needs to be specified - type of sequences will be guessed according to rules described below. The user needs to pay attention, however, because for short sequences type may be guessed incorrectly - in this case they should specify type in alphabet parameter.

If your sequences contain non-standard letters, where each non-standard letter is one character long (that is, any character that is not an uppercase letter), you also don't need to specify any parameter. Optionally, you can explicitly do it by setting alphabet to "unt".

In safe mode it is guaranteed that only letters which are equal to NA_letter argument are interpreted as NA values. Due to that, resulting alphabet might be different from the alphabet argument.

Detailed guide to construct

Below are listed all possibilities that can occur during the construction of a sq object:

If you don't specify any other parameter than x, function will try to guess sequence type (it will check in exactly this order):
1. If it contains only ACGT- letters, type will be set to dna_bsc.
2. If it contains only ACGU- letters, type will be set to rna_bsc.
3. If it contains any letters from 1. and 2. and additionally letters DEFHIKLMNPQRSVWY*, type will be set to ami_bsc.
4. If it contains any letters from 1. and additionally letters WSMKRYBDHVN, type will be set to dna_ext.
5. If it contains any letters from 2. and additionally letters WSMKRYBDHVN, type will be set to rna_ext.
6. If it contains any letters from previous points and additionally letters JOUXZ, type will be set to ami_ext.
7. If it contains any letters that exceed all groups mentioned above, type will be set to unt.
If you specify alphabet parameter as any of "dna_bsc", "dna_ext", "rna_bsc", "rna_ext", "ami_bsc", "ami_ext"; then:
- If safe_mode is FALSE, then sequences will be built with standard alphabet for given type.
- If safe_mode is TRUE, then sequences will be scanned for letters not in standard alphabet:
  - If no such letters are found, then sequences will be built with standard alphabet for given type.
  - If at least one such letter is found, then sequences are built with real alphabet and with type set to unt.
If you specify alphabet parameter as "unt", then sequences are scanned for alphabet and subsequently built with obtained alphabet and type unt.
If you specify alphabet parameter as character vector longer than 1, then type is set to atp and alphabet is equal to letters in said parameter.

If ignore_case is set to TRUE, then lowercase letters are turned into uppercase during their interpretation, unless type is set to atp.

Handling unt and atp types and `NA` values

You can convert letters into another using substitute_letters and then use typify or sq_type<- function to set type of sq to dna_bsc, dna_ext, rna_bsc, rna_ext, ami_bsc or ami_ext. If your sequences contain NA values, use remove_na.

Examples

# constructing sq without specifying alphabet:
# Correct sq type will be guessed from appearing letters
## dna_bsc
sq(c("ATGC", "TCGTTA", "TT--AG"))
#> basic DNA sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>

## rna_bsc
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"))
#> basic RNA sequences list:
#> [1] CUUAC                                                                    <5>
#> [2] UACCGGC                                                                  <7>
#> [3] GCA-ACGU                                                                 <8>

## ami_bsc
sq(c("YQQPAVVM", "PQCFL"))
#> basic amino acid sequences list:
#> [1] YQQPAVVM                                                                 <8>
#> [2] PQCFL                                                                    <5>

## ami cln sq can contain "*" - a letter meaning end of translation:
sq(c("MMDF*", "SYIHR*", "MGG*"))
#> basic amino acid sequences list:
#> [1] MMDF*                                                                    <5>
#> [2] SYIHR*                                                                   <6>
#> [3] MGG*                                                                     <4>

## dna_ext
sq(c("TMVCCDA", "BASDT-CNN"))
#> extended DNA sequences list:
#> [1] TMVCCDA                                                                  <7>
#> [2] BASDT-CNN                                                                <9>

## rna_ext
sq(c("WHDHKYN", "GCYVCYU"))
#> extended RNA sequences list:
#> [1] WHDHKYN                                                                  <7>
#> [2] GCYVCYU                                                                  <7>

## ami_ext
sq(c("XYOQWWKCNJLO"))
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO                                                            <12>

## unt - assume that one wants to mark some special element in sequence with "%"
sq(c("%%YAPLAA", "PLAA"))
#> unt (unspecified type) sequences list:
#> [1] %%YAPLAA                                                                 <8>
#> [2] PLAA                                                                     <4>

# passing type as alphabet parameter:
# All above examples yield an identical result if type specified is the same as guessed
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_bsc")
#> basic DNA sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), "rna_bsc")
#> basic RNA sequences list:
#> [1] CUUAC                                                                    <5>
#> [2] UACCGGC                                                                  <7>
#> [3] GCA-ACGU                                                                 <8>
sq(c("YQQPAVVM", "PQCFL"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] YQQPAVVM                                                                 <8>
#> [2] PQCFL                                                                    <5>
sq(c("MMDF*", "SYIHR*", "MGG*"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] MMDF*                                                                    <5>
#> [2] SYIHR*                                                                   <6>
#> [3] MGG*                                                                     <4>
sq(c("TMVCCDA", "BASDT-CNN"), "dna_ext")
#> extended DNA sequences list:
#> [1] TMVCCDA                                                                  <7>
#> [2] BASDT-CNN                                                                <9>
sq(c("WHDHKYN", "GCYVCYU"), "rna_ext")
#> extended RNA sequences list:
#> [1] WHDHKYN                                                                  <7>
#> [2] GCYVCYU                                                                  <7>
sq(c("XYOQWWKCNJLO"), "ami_ext")
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO                                                            <12>
sq(c("%%YAPLAA", "PLAA"), "unt")
#> unt (unspecified type) sequences list:
#> [1] %%YAPLAA                                                                 <8>
#> [2] PLAA                                                                     <4>

# Type doesn't have to be the same as the guessed one if letters fit in the destination alphabet
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_ext")
#> extended DNA sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_ext")
#> extended amino acid sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "unt")
#> unt (unspecified type) sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>

# constructing sq with specified letters of alphabet:
# In sequences below "mA" denotes methyled alanine - two characters are treated as single letter
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c("mA", LETTERS))
#> atp (atypical alphabet) sequences list:
#> [1] L mA Q Y mA S S R                                                        <8>
#> [2] L mA S M K L K F mA mA                                                  <10>
# Order of alphabet letters are not meaningful in most cases
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c(LETTERS, "mA"))
#> atp (atypical alphabet) sequences list:
#> [1] L mA Q Y mA S S R                                                        <8>
#> [2] L mA S M K L K F mA mA                                                  <10>

# reading sequences with three-letter names:
sq(c("ProProGlyAlaMetAlaCys"), alphabet = c("Pro", "Gly", "Ala", "Met", "Cys"))
#> atp (atypical alphabet) sequences list:
#> [1] Pro Pro Gly Ala Met Ala Cys                                              <7>

# using safe mode:
# Safe mode guarantees that no element is read as NA
# But resulting alphabet might be different to the passed one (albeit with warning/error)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc", safe_mode = TRUE)
#> Warning: Detected letters that do not match specified type!
#> unt (unspecified type) sequences list:
#> [1] CUUAC                                                                    <5>
#> [2] UACCGGC                                                                  <7>
#> [3] GCA-ACGU                                                                 <8>
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc")
#> basic DNA sequences list:
#> [1] C!!AC                                                                    <5>
#> [2] !ACCGGC                                                                  <7>
#> [3] GCA-ACG!                                                                 <8>

# Safe mode guesses alphabet based on whole sequence
long_sequence <- paste0(paste0(rep("A", 4500), collapse = ""), "N")
sq(long_sequence, safe_mode = TRUE)
#> basic amino acid sequences list:
#> [1] AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... <4501>
sq(long_sequence)
#> basic DNA sequences list:
#> [1] AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... <4501>

# ignoring case:
# By default, lower- and uppercase letters are treated separately
# This behavior can be changed by setting ignore_case = TRUE
sq(c("aTGc", "tcgTTA", "tt--AG"), ignore_case = TRUE)
#> basic DNA sequences list:
#> [1] ATGC                                                                     <4>
#> [2] TCGTTA                                                                   <6>
#> [3] TT--AG                                                                   <6>
sq(c("XYOqwwKCNJLo"), ignore_case = TRUE)
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO                                                            <12>

# It is possible to construct sq with length 0
sq(character())
#> basic DNA sequences list of length 0

# As well as sq with empty sequences
sq(c("AGTGGC", "", "CATGA", ""))
#> basic DNA sequences list:
#> [1] AGTGGC                                                                   <6>
#> [2] <NULL>                                                                   <0>
#> [3] CATGA                                                                    <5>
#> [4] <NULL>                                                                   <0>