Finds all given motifs in sequences and returns their positions.

find_motifs(x, ...)

# S3 method for sq
find_motifs(x, name, motifs, ..., NA_letter = getOption("tidysq_NA_letter"))

# S3 method for data.frame
find_motifs(
  x,
  motifs,
  ...,
  .sq = "sq",
  .name = "name",
  NA_letter = getOption("tidysq_NA_letter")
)

Arguments

x

[sq]
An object this function is applied to.

...

further arguments to be passed from or to other methods.

name

[character]
Vector of sequence names. Must be of the same length as sq object.

motifs

[character]
Motifs to be searched for.

NA_letter

[character(1)]
A string that is used to interpret and display NA value in the context of sq class. Default value equals to "!".

.sq

[character(1)]
Name of a column that stores sequences.

.name

[character(1)]
Name of a column that stores names (unique identifiers).

Value

A tibble with following columns:

name

name of the sequence in which a motif was found

sought

sought motif

found

found subsequence, may differ from sought if the motif contained ambiguous letters

start

position of first element of found motif

end

position of last element of found motif

Details

This function allows search of a given motif or motifs in the sq object. It returns all motifs found with their start and end positions within a sequence.

Motif capabilities and restrictions

There are more options than to simply create a motif that is a string representation of searched subsequence. For example, when using this function with any of standard types, i.e. ami, dna or rna, the user can create a motif with ambiguous letters. In this case the engine will try to match any of possible meanings of this letter. For example, take "B" from extended DNA alphabet. It means "not A", so it can be matched with "C", "G" and "T", but also "B", "Y" (either "C" or "T"), "K" (either "G" or "T") and "S" (either "C" or "G").

Full list of ambiguous letters with their meaning can be found on IUPAC site.

Motifs are also restricted in that the alphabets of sq objects on which search operations are conducted cannot contain "^" and "$" symbols. These two have a special meaning - they are used to indicate beginning and end of sequence respectively and can be used to limit the position of matched subsequences.

See also

Functions interpreting sq in biological context: %has%(), complement(), translate()

Examples

# Creating objects to work on:
sq_dna <- sq(c("ATGCAGGA", "GACCGNBAACGAN", "TGACGAGCTTAG"),
             alphabet = "dna_bsc")
sq_ami <- sq(c("AGNTYIKFGGAYTI", "MATEGILIAADGYTWIL", "MIPADHICAANGIENAGIK"),
             alphabet = "ami_bsc")
sq_atp <- sq(c("mAmYmY", "nbAnsAmA", ""),
             alphabet = c("mA", "mY", "nbA", "nsA"))
sq_names <- c("sq1", "sq2", "sq3")

# Finding motif of two alanines followed by aspartic acid or asparagine
# ("AAB" motif matches "AAB", "AAD" and "AAN"):
find_motifs(sq_ami, sq_names, "AAB")
#> # A tibble: 2 × 5
#>   names found     sought start   end
#>   <chr> <ami_bsc> <chr>  <int> <int>
#> 1 sq2   AAD <3>   AAB        9    11
#> 2 sq3   AAN <3>   AAB        9    11

# Finding "C" at fourth position:
find_motifs(sq_dna, sq_names, "^NNNC")
#> # A tibble: 3 × 5
#>   names found     sought start   end
#>   <chr> <dna_bsc> <chr>  <int> <int>
#> 1 sq1   ATGC <4>  ^NNNC      1     4
#> 2 sq2   GACC <4>  ^NNNC      1     4
#> 3 sq3   TGAC <4>  ^NNNC      1     4

# Finding motif "I" at second-to-last position:
find_motifs(sq_ami, sq_names, "IX$")
#> # A tibble: 2 × 5
#>   names found     sought start   end
#>   <chr> <ami_bsc> <chr>  <int> <int>
#> 1 sq2   IL <2>    IX$       16    17
#> 2 sq3   IK <2>    IX$       18    19

# Finding multiple motifs:
find_motifs(sq_dna, sq_names, c("^ABN", "ANCBY", "BAN$"))
#> # A tibble: 5 × 5
#>   names found     sought start   end
#>   <chr> <dna_bsc> <chr>  <int> <int>
#> 1 sq1   ATG   <3> ^ABN       1     3
#> 2 sq2   ACCG! <5> ANCBY      2     6
#> 3 sq3   AGCTT <5> ANCBY      6    10
#> 4 sq2   GA!   <3> BAN$      11    13
#> 5 sq3   TAG   <3> BAN$      10    12

# Finding multicharacter motifs:
find_motifs(sq_atp, sq_names, c("nsA", "mYmY$"))
#> # A tibble: 2 × 5
#>   names found    sought start   end
#>   <chr> <atp>    <chr>  <int> <int>
#> 1 sq2   nsA  <1> nsA        2     2
#> 2 sq1   mYmY <2> mYmY$      2     3

# It can be a part of tidyverse pipeline:
library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:tidysq’:
#> 
#>     collapse
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
fasta_file <- system.file(package = "tidysq", "examples/example_aa.fasta")
read_fasta(fasta_file) %>%
  mutate(name = toupper(name)) %>%
  find_motifs("TXG")
#> # A tibble: 4 × 5
#>   names                                     found     sought start   end
#>   <chr>                                     <ami_bsc> <chr>  <int> <int>
#> 1 AMY446|UNTITLED|CHORION B                 TAG <3>   TXG        8    10
#> 2 AMY456|UNTITLED|LYSOZYME (HEN)            TPG <3>   TXG       21    23
#> 3 AMY465|UNTITLED|APOLIPOPROTEIN A-I (G26R) TEG <3>   TXG       79    81
#> 4 AMY608|UNTITLED|DE NOVO                   TIG <3>   TXG        4     6