This function allows the user to construct objects of
class sq
from a character vector.
[character
]
Vector to construct sq
object from.
[character
]
If provided value is a single string, it will be interpreted as type (see
details). If provided value has length greater than one, it will be treated
as atypical alphabet for sq
object and sq
type will be
atp
. If provided value is NULL
, type
guessing will be performed (see details).
[character(1)
]
A string that is used to interpret and display NA
value in the
context of sq class
. Default value equals to
"!
".
[logical(1)
]
Default value is FALSE
. When turned on, safe mode guarantees that
NA
appears within a sequence if and only if input sequence contains
value passed with NA_letter
. This means that resulting type might be
different to the one passed as argument, if there are letters in a sequence
that does not appear in the original alphabet.
["silent" || "message" || "warning" || "error"
]
Determines the method of handling warning message. Default value is
"warning"
.
[logical(1)
]
If turned on, lowercase letters are turned into respective uppercase ones
and interpreted as such. If not, either sq
object must be of type
unt or all lowercase letters are interpreted as NA
values.
Default value is FALSE
. Ignoring case does not work with atp
alphabets.
An object of class sq
with appropriate type.
Function sq
covers all possibilities of standard and non-standard
types and alphabets. You can check what 'type' and 'alphabet' exactly are in
sq class
documentation. There is a guide below on
how function operates and how the program behaves depending on arguments
passed and letters in the sequences.
x
parameter should be a character vector. Each element of this vector
is a biological sequence. If this parameter has length 0, object of class
sq
with 0 sequences will be created (if not specified, it will have
dna_bsc type, which is a result of rules written below). If it
contains sequences of length 0, NULL
sequences will be introduced (see
NULL (empty) sequences section in sq class
).
Important note: in all below cases word 'letter' stands for an
element of an alphabet. Letter might consist of more than one character, for
example "Ala
" might be a single letter. However, if the user wants to
construct or read sequences with multi-character letters, one has to specify
all letters in alphabet
parameter. Details of letters, alphabet and
types can be found in sq class
documentation.
In many cases, just the x
parameter needs to be specified - type of
sequences will be guessed according to rules described below. The user needs
to pay attention, however, because for short sequences type may be guessed
incorrectly - in this case they should specify type in alphabet
parameter.
If your sequences contain non-standard letters, where each non-standard
letter is one character long (that is, any character that is not an uppercase
letter), you also don't need to specify any parameter. Optionally, you can
explicitly do it by setting alphabet
to "unt"
.
In safe mode
it is guaranteed that only letters which are equal to
NA_letter
argument are interpreted as NA
values. Due to that,
resulting alphabet might be different from the alphabet
argument.
Below are listed all possibilities that can occur during the construction of
a sq
object:
If you don't specify any other parameter than x
, function will
try to guess sequence type (it will check in exactly this order):
If it contains only ACGT- letters, type will be set to dna_bsc.
If it contains only ACGU- letters, type will be set to rna_bsc.
If it contains any letters from 1. and 2. and additionally letters DEFHIKLMNPQRSVWY*, type will be set to ami_bsc.
If it contains any letters from 1. and additionally letters WSMKRYBDHVN, type will be set to dna_ext.
If it contains any letters from 2. and additionally letters WSMKRYBDHVN, type will be set to rna_ext.
If it contains any letters from previous points and additionally letters JOUXZ, type will be set to ami_ext.
If it contains any letters that exceed all groups mentioned above, type will be set to unt.
If you specify alphabet
parameter as any of "dna_bsc"
,
"dna_ext"
, "rna_bsc"
, "rna_ext"
, "ami_bsc"
,
"ami_ext"
; then:
If safe_mode
is FALSE
, then sequences will be built
with standard alphabet for given type.
If safe_mode
is TRUE
, then sequences will be scanned
for letters not in standard alphabet:
If no such letters are found, then sequences will be built with standard alphabet for given type.
If at least one such letter is found, then sequences are built with real alphabet and with type set to unt.
If you specify alphabet
parameter as "unt"
, then
sequences are scanned for alphabet and subsequently built with obtained
alphabet and type unt.
If you specify alphabet
parameter as character
vector
longer than 1, then type is set to atp and alphabet is equal to
letters in said parameter.
If ignore_case
is set to TRUE
, then lowercase letters are
turned into uppercase during their interpretation, unless type is set to
atp.
NA
valuesYou can convert letters into another using substitute_letters
and then use typify
or sq_type<-
function to set type of
sq
to dna_bsc, dna_ext, rna_bsc,
rna_ext, ami_bsc or ami_ext. If your sequences
contain NA
values, use remove_na
.
Functions from input module:
import_sq()
,
random_sq()
,
read_fasta()
# constructing sq without specifying alphabet:
# Correct sq type will be guessed from appearing letters
## dna_bsc
sq(c("ATGC", "TCGTTA", "TT--AG"))
#> basic DNA sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
## rna_bsc
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"))
#> basic RNA sequences list:
#> [1] CUUAC <5>
#> [2] UACCGGC <7>
#> [3] GCA-ACGU <8>
## ami_bsc
sq(c("YQQPAVVM", "PQCFL"))
#> basic amino acid sequences list:
#> [1] YQQPAVVM <8>
#> [2] PQCFL <5>
## ami cln sq can contain "*" - a letter meaning end of translation:
sq(c("MMDF*", "SYIHR*", "MGG*"))
#> basic amino acid sequences list:
#> [1] MMDF* <5>
#> [2] SYIHR* <6>
#> [3] MGG* <4>
## dna_ext
sq(c("TMVCCDA", "BASDT-CNN"))
#> extended DNA sequences list:
#> [1] TMVCCDA <7>
#> [2] BASDT-CNN <9>
## rna_ext
sq(c("WHDHKYN", "GCYVCYU"))
#> extended RNA sequences list:
#> [1] WHDHKYN <7>
#> [2] GCYVCYU <7>
## ami_ext
sq(c("XYOQWWKCNJLO"))
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO <12>
## unt - assume that one wants to mark some special element in sequence with "%"
sq(c("%%YAPLAA", "PLAA"))
#> unt (unspecified type) sequences list:
#> [1] %%YAPLAA <8>
#> [2] PLAA <4>
# passing type as alphabet parameter:
# All above examples yield an identical result if type specified is the same as guessed
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_bsc")
#> basic DNA sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), "rna_bsc")
#> basic RNA sequences list:
#> [1] CUUAC <5>
#> [2] UACCGGC <7>
#> [3] GCA-ACGU <8>
sq(c("YQQPAVVM", "PQCFL"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] YQQPAVVM <8>
#> [2] PQCFL <5>
sq(c("MMDF*", "SYIHR*", "MGG*"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] MMDF* <5>
#> [2] SYIHR* <6>
#> [3] MGG* <4>
sq(c("TMVCCDA", "BASDT-CNN"), "dna_ext")
#> extended DNA sequences list:
#> [1] TMVCCDA <7>
#> [2] BASDT-CNN <9>
sq(c("WHDHKYN", "GCYVCYU"), "rna_ext")
#> extended RNA sequences list:
#> [1] WHDHKYN <7>
#> [2] GCYVCYU <7>
sq(c("XYOQWWKCNJLO"), "ami_ext")
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO <12>
sq(c("%%YAPLAA", "PLAA"), "unt")
#> unt (unspecified type) sequences list:
#> [1] %%YAPLAA <8>
#> [2] PLAA <4>
# Type doesn't have to be the same as the guessed one if letters fit in the destination alphabet
sq(c("ATGC", "TCGTTA", "TT--AG"), "dna_ext")
#> extended DNA sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_bsc")
#> basic amino acid sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "ami_ext")
#> extended amino acid sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
sq(c("ATGC", "TCGTTA", "TT--AG"), "unt")
#> unt (unspecified type) sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
# constructing sq with specified letters of alphabet:
# In sequences below "mA" denotes methyled alanine - two characters are treated as single letter
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c("mA", LETTERS))
#> atp (atypical alphabet) sequences list:
#> [1] L mA Q Y mA S S R <8>
#> [2] L mA S M K L K F mA mA <10>
# Order of alphabet letters are not meaningful in most cases
sq(c("LmAQYmASSR", "LmASMKLKFmAmA"), alphabet = c(LETTERS, "mA"))
#> atp (atypical alphabet) sequences list:
#> [1] L mA Q Y mA S S R <8>
#> [2] L mA S M K L K F mA mA <10>
# reading sequences with three-letter names:
sq(c("ProProGlyAlaMetAlaCys"), alphabet = c("Pro", "Gly", "Ala", "Met", "Cys"))
#> atp (atypical alphabet) sequences list:
#> [1] Pro Pro Gly Ala Met Ala Cys <7>
# using safe mode:
# Safe mode guarantees that no element is read as NA
# But resulting alphabet might be different to the passed one (albeit with warning/error)
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc", safe_mode = TRUE)
#> Warning: Detected letters that do not match specified type!
#> unt (unspecified type) sequences list:
#> [1] CUUAC <5>
#> [2] UACCGGC <7>
#> [3] GCA-ACGU <8>
sq(c("CUUAC", "UACCGGC", "GCA-ACGU"), alphabet = "dna_bsc")
#> basic DNA sequences list:
#> [1] C!!AC <5>
#> [2] !ACCGGC <7>
#> [3] GCA-ACG! <8>
# Safe mode guesses alphabet based on whole sequence
long_sequence <- paste0(paste0(rep("A", 4500), collapse = ""), "N")
sq(long_sequence, safe_mode = TRUE)
#> basic amino acid sequences list:
#> [1] AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... <4501>
sq(long_sequence)
#> basic DNA sequences list:
#> [1] AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA... <4501>
# ignoring case:
# By default, lower- and uppercase letters are treated separately
# This behavior can be changed by setting ignore_case = TRUE
sq(c("aTGc", "tcgTTA", "tt--AG"), ignore_case = TRUE)
#> basic DNA sequences list:
#> [1] ATGC <4>
#> [2] TCGTTA <6>
#> [3] TT--AG <6>
sq(c("XYOqwwKCNJLo"), ignore_case = TRUE)
#> extended amino acid sequences list:
#> [1] XYOQWWKCNJLO <12>
# It is possible to construct sq with length 0
sq(character())
#> basic DNA sequences list of length 0
# As well as sq with empty sequences
sq(c("AGTGGC", "", "CATGA", ""))
#> basic DNA sequences list:
#> [1] AGTGGC <6>
#> [2] <NULL> <0>
#> [3] CATGA <5>
#> [4] <NULL> <0>