Returns alphabet
attribute of an object.
alphabet(x)
[sq
]
An object to extract alphabet from.
A character vector of letters of the alphabet.
Each sq
object have an alphabet associated with it. Alphabet
is a set of possible letters that can appear in sequences contained
in object. Alphabet is kept mostly as a character vector, where each element
represents one letter.
sq
objects of type ami, dna or rna have
fixed, predefined alphabets. In other words, if two sq
objects have
exactly the same type - ami_bsc, dna_ext, rna_bsc
or any other combination - they are ensured to have the same alphabet.
Below are listed alphabets for these types:
ami_bsc - ACDEFGHIKLMNPQRSTVWY-*
ami_ext - ABCDEFGHIJKLMNOPQRSTUVWXYZ-*
dna_bsc - ACGT-
dna_ext - ACGTWSMKRYBDHVN-
rna_bsc - ACGU-
rna_ext - ACGUWSMKRYBDHVN-
Other types of sq
objects are allowed to have different alphabets.
Furthermore, having an alphabet exactly identical to one of those above does
not automatically indicate that the type of the sequence is one of those -
e.g., there might be an atp sq
that has an alphabet
identical to ami_bsc alphabet. To set the type, one should
use the typify
or `sq_type<-`
function.
The purpose of co-existence of unt and atp alphabets is
the fact that although there is a standard for format of fasta files,
sometimes there are other types of symbols, which do not match the standard.
Thanks to these types, tidysq can import files with customized alphabets.
Moreover, the user may want to group amino acids with similar properties
(e.g., for machine learning) and replace the standard alphabet with symbols
for whole groups. To check details, see read_fasta
,
sq
and substitute_letters
.
Important note: in atp alphabets there is a possibility
of letters appearing that consist of more than one character - this
functionality is provided in order to handle situations like
post-translational modifications, (e.g., using "mA
" to indicate
methylated alanine).
Important note: alphabets of atp and unt
sq
objects are case sensitive. Thus, in their alphabets both
lowercase and uppercase characters can appear simultaneously and they are
treated as different letters. Alphabets of dna, rna and
ami types are always uppercase and all functions converts other
parameters to uppercase when working with dna, rna or
ami - e.g. %has%
operator converts lower letters to
upper when searching for motifs in dna, rna or
ami object.
Important note: maximum length of an alphabet is
30 letters. The user is not allowed to read fasta files or
construct sq
objects from character vectors that have more than 30
distinct characters in sequences (unless creating ami, dna
or rna objects with ignore_case
parameter set equal to
TRUE
).
Functions from alphabet module:
get_standard_alphabet()