Finds all given motifs in sequences and returns their positions.
[sq
]
An object this function is applied to.
further arguments to be passed from or to other methods.
[character
]
Vector of sequence names. Must be of the same length as sq
object.
[character
]
Motifs to be searched for.
[character(1)
]
A string that is used to interpret and display NA
value in the
context of sq class
. Default value equals to
"!
".
[character(1)
]
Name of a column that stores sequences.
[character(1)
]
Name of a column that stores names (unique identifiers).
A tibble
with following columns:
name of the sequence in which a motif was found
sought motif
found subsequence, may differ from sought if the motif contained ambiguous letters
position of first element of found motif
position of last element of found motif
This function allows search of a given motif or motifs in the sq
object. It returns all motifs found with their start and end positions within
a sequence.
There are more options than to simply create a motif that is a string representation of searched subsequence. For example, when using this function with any of standard types, i.e. ami, dna or rna, the user can create a motif with ambiguous letters. In this case the engine will try to match any of possible meanings of this letter. For example, take "B" from extended DNA alphabet. It means "not A", so it can be matched with "C", "G" and "T", but also "B", "Y" (either "C" or "T"), "K" (either "G" or "T") and "S" (either "C" or "G").
Full list of ambiguous letters with their meaning can be found on IUPAC site.
Motifs are also restricted in that the alphabets of sq
objects on
which search operations are conducted cannot contain "^" and "$" symbols.
These two have a special meaning - they are used to indicate beginning and
end of sequence respectively and can be used to limit the position of matched
subsequences.
Functions interpreting sq in biological context:
%has%()
,
complement()
,
translate()
# Creating objects to work on:
sq_dna <- sq(c("ATGCAGGA", "GACCGNBAACGAN", "TGACGAGCTTAG"),
alphabet = "dna_bsc")
sq_ami <- sq(c("AGNTYIKFGGAYTI", "MATEGILIAADGYTWIL", "MIPADHICAANGIENAGIK"),
alphabet = "ami_bsc")
sq_atp <- sq(c("mAmYmY", "nbAnsAmA", ""),
alphabet = c("mA", "mY", "nbA", "nsA"))
sq_names <- c("sq1", "sq2", "sq3")
# Finding motif of two alanines followed by aspartic acid or asparagine
# ("AAB" motif matches "AAB", "AAD" and "AAN"):
find_motifs(sq_ami, sq_names, "AAB")
#> # A tibble: 2 × 5
#> names found sought start end
#> <chr> <ami_bsc> <chr> <int> <int>
#> 1 sq2 AAD <3> AAB 9 11
#> 2 sq3 AAN <3> AAB 9 11
# Finding "C" at fourth position:
find_motifs(sq_dna, sq_names, "^NNNC")
#> # A tibble: 3 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 sq1 ATGC <4> ^NNNC 1 4
#> 2 sq2 GACC <4> ^NNNC 1 4
#> 3 sq3 TGAC <4> ^NNNC 1 4
# Finding motif "I" at second-to-last position:
find_motifs(sq_ami, sq_names, "IX$")
#> # A tibble: 2 × 5
#> names found sought start end
#> <chr> <ami_bsc> <chr> <int> <int>
#> 1 sq2 IL <2> IX$ 16 17
#> 2 sq3 IK <2> IX$ 18 19
# Finding multiple motifs:
find_motifs(sq_dna, sq_names, c("^ABN", "ANCBY", "BAN$"))
#> # A tibble: 5 × 5
#> names found sought start end
#> <chr> <dna_bsc> <chr> <int> <int>
#> 1 sq1 ATG <3> ^ABN 1 3
#> 2 sq2 ACCG! <5> ANCBY 2 6
#> 3 sq3 AGCTT <5> ANCBY 6 10
#> 4 sq2 GA! <3> BAN$ 11 13
#> 5 sq3 TAG <3> BAN$ 10 12
# Finding multicharacter motifs:
find_motifs(sq_atp, sq_names, c("nsA", "mYmY$"))
#> # A tibble: 2 × 5
#> names found sought start end
#> <chr> <atp> <chr> <int> <int>
#> 1 sq2 nsA <1> nsA 2 2
#> 2 sq1 mYmY <2> mYmY$ 2 3
# It can be a part of tidyverse pipeline:
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following object is masked from ‘package:tidysq’:
#>
#> collapse
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
fasta_file <- system.file(package = "tidysq", "examples/example_aa.fasta")
read_fasta(fasta_file) %>%
mutate(name = toupper(name)) %>%
find_motifs("TXG")
#> # A tibble: 4 × 5
#> names found sought start end
#> <chr> <ami_bsc> <chr> <int> <int>
#> 1 AMY446|UNTITLED|CHORION B TAG <3> TXG 8 10
#> 2 AMY456|UNTITLED|LYSOZYME (HEN) TPG <3> TXG 21 23
#> 3 AMY465|UNTITLED|APOLIPOPROTEIN A-I (G26R) TEG <3> TXG 79 81
#> 4 AMY608|UNTITLED|DE NOVO TIG <3> TXG 4 6