This function replaces sequences with ambiguous elements by empty (NULL) sequences or removes ambiguous elements from sequences in an sq object.

remove_ambiguous(x, by_letter = FALSE, ...)

# S3 method for sq
remove_ambiguous(
  x,
  by_letter = FALSE,
  ...,
  NA_letter = getOption("tidysq_NA_letter")
)

Arguments

x

[sq_dna_bsc || sq_rna_bsc || sq_dna_ext || sq_rna_ext || sq_ami_bsc || sq_ami_ext]
An object this function is applied to.

by_letter

[logical(1)]
If FALSE, filter condition is applied to sequence as a whole. If TRUE, each letter is applied filter to separately.

...

further arguments to be passed from or to other methods.

NA_letter

[character(1)]
A string that is used to interpret and display NA value in the context of sq class. Default value equals to "!".

Value

An sq object with the _bscversion of inputted type.

Details

Biological sequences, whether of DNA, RNA or amino acid elements, are not always exactly determined. Sometimes the only information the user has about an element is that it's one of given set of possible elements. In this case the element is described with one of special letters, here called ambiguous.

The inclusion of these letters is the difference between extended and basic alphabets (and, conversely, types). For amino acid alphabet these letters are: B, J, O, U, X, Z; whereas for DNA and RNA: W, S, M, K, R, Y, B, D, H, V, N.

remove_ambiguous() is used to create sequences without any of the elements above. Depending on value of by_letter argument, the function either replaces "ambiguous" sequences with empty sequences (if by_letter is equal to TRUE) or shortens original sequence by retaining only unambiguous letters (if opposite is true).

See also

Functions that clean sequences: is_empty_sq(), remove_na()

Examples

# Creating objects to work on:
sq_ami <- sq(c("MIAANYTWIL","TIAALGNIIYRAIE", "NYERTGHLI", "MAYXXXIALN"),
             alphabet = "ami_ext")
sq_dna <- sq(c("ATGCAGGA", "GACCGAACGAN", "TGACGAGCTTA", "ACTNNAGCN"),
             alphabet = "dna_ext")

# Removing whole sequences with ambiguous elements:
remove_ambiguous(sq_ami)
#> basic amino acid sequences list:
#> [1] MIAANYTWIL                                                              <10>
#> [2] TIAALGNIIYRAIE                                                          <14>
#> [3] NYERTGHLI                                                                <9>
#> [4] <NULL>                                                                   <0>
remove_ambiguous(sq_dna)
#> basic DNA sequences list:
#> [1] ATGCAGGA                                                                 <8>
#> [2] <NULL>                                                                   <0>
#> [3] TGACGAGCTTA                                                             <11>
#> [4] <NULL>                                                                   <0>

# Removing ambiguous elements from sequences:
remove_ambiguous(sq_ami, by_letter = TRUE)
#> basic amino acid sequences list:
#> [1] MIAANYTWIL                                                              <10>
#> [2] TIAALGNIIYRAIE                                                          <14>
#> [3] NYERTGHLI                                                                <9>
#> [4] MAYIALN                                                                  <7>
remove_ambiguous(sq_dna, by_letter = TRUE)
#> basic DNA sequences list:
#> [1] ATGCAGGA                                                                 <8>
#> [2] GACCGAACGA                                                              <10>
#> [3] TGACGAGCTTA                                                             <11>
#> [4] ACTAGC                                                                   <6>

# Analysis of the result
sq_clean <- remove_ambiguous(sq_ami)
is_empty_sq(sq_clean)
#> [1] FALSE FALSE FALSE  TRUE
sq_type(sq_clean)
#> [1] "ami_bsc"