Sequence data simulation

We generate nn sequences s1,,sns_1, \ldots, s_n of length NN based on real frequences of amino acids on full alphabet.

Example sequences of length 10.
A L A V P H G K T F
S L Q W E P V L D T
R I F N N V Q G A A
G C S D G Y D Q T R
Y L R R S R P D A V
N V S M M T R G D I

Motif generation

We generate a set of mm motifs (m1,,mmm_1, \ldots, m_m) with the following parameters:

Example motifs.
A _ _ B
Y L _ G _ _ D
N _ _ _ _ M
A B _ T

Motif injection

We inject motif by replacing a randomly selected part of a sequence with this motif. For example:

Example sequences of length 10 with addition of motif AB_T
A L A V P H G K T F
S L Q W E P V L D T
R I F N N V Q G A A
G A B D T Y D Q T R
Y L R R S R A B A T
N V A B M T R G D I

We inject from 00 to kk motifs to a single sequence according to the following procedure:

Simulation scheme.
Simulation scheme.

Target variable sampling

Let’s define a random variable XKX_K on a set of sequences s1,,sns_1, \ldots, s_n which describes whether a sequence KK is a subsequence of sis_i. Namely, for any sequence sis_i:

XK(si)={1,Ksi0,otherwiseX_K(s_i) = \left\{\begin{array}{rl} 1, & K \subseteq s_i\\ 0, & \text{otherwise} \end{array} \right.

where Ks2K \subseteq s_2 means that KK is a subsequence of s2s_2. For example

XAB(ABCD)=1,X_{AB}(ABCD) = 1,XA_B(ABCD)=0.X_{A\_B}(ABCD) = 0.

Logistic regression

In this case we consider a standard logistic regression model where the joint effect of all motifs is the sum of their individual effects. Let w1,,wmw_1, \ldots, w_m be weights related to motifs m1,,mmm_1, \ldots, m_m. Let w0w_0 be an effect for the sequences without motifs. Then, we can define an additive logistic model as follows:

g(EY)=w0+w1Xm1+w2Xm2++wmXmmg(EY) = w_0 + w_1 X_{m_1} + w_2 X_{m_2} + \ldots + w_m X_{m_m} We assume some particular values of w1,,wmw_1, \ldots, w_m and calculate vector of probabilities pp as follows:

p=g1(w0+w1Xm1+w2Xm2++wmXmm).p = g^{-1}(w_0 + w_1 X_{m_1} + w_2 X_{m_2} + \ldots + w_m X_{m_m}). Having pp we simulate yiy_i from binomial distribution B(1,pi)B(1, p_i).

Logistic regression with interactions

Another approach is based on interactions indicating that the effect of one predictor depends on the value of another predictor. Let’s define maximum number of motifs per sequence k=max{ki,i=1,,n}k = \max\lbrace k_i, i = 1, \ldots, n\rbrace. Let w1,,wkw_{1}, \ldots, w_{k} denote weights of single effects. Namely:

g(EY)=w0+i=1kwiXmi+(i=1k1j=i+1kwijXmiXmj)++w1kXm1Xmkg(EY) = w_0 + \sum_{i = 1}^{k} w_{i} X_{m_i} + \left(\sum_{i = 1}^{k-1}\sum_{j = i + 1}^{k} w_{ij} X_{m_i}X_{m_j}\right) + \ldots + w_{1\ldots k} X_{m_1}\ldots X_{m_k}

Logic regression

Here, we consider new variables, L1,,LlL_1, \ldots, L_l where each of them is a logic expression based on a subset of motifs m1,,mmm_1, \ldots, m_m. For example, L1(m1,m2,m3)=(Xm1Xm2)Xm3.L_1(m_1, m_2, m_3) = (X_{m_1} \land X_{m_2}) \lor X_{m_3}. Each variable LiL_i obtains its own weight in the model. Our model is following:

g(EY)=w0+i=1lwiLi.g(EY) = w_0 + \sum_{i = 1}^{l} w_i L_i.