Documentation¶
StruM: Structural Motifs¶
This package provides functionality for computing structural representations of DNA sequence motifs. Estimates for DNA structure comes from the DiNucleotide Property Database (http://diprodb.leibniz-fli.de/).
This version relies on the Cython framework for speed purposes. Improvements in speed are particularly seen with scoring longer sequences.
-
strum.
rev_comp
(str seq) → str¶ Reverse complement a DNA sequence.
Parameters: seq (str.) – A DNA sequence.
-
class
strum.
StruM
¶ StruM Object: Train and work with Structural Motifs
StruMs can be learned via maximum likelihood (
train()
) from a known set of aligned binding sites, or via expectation maximization (train_EM()
) from a set of broader regions.For speed, this package will trim the sequences to be the same length. This can be user specified, else the shortest sequence will be used as a guide.
Additional features other than those from DiProDB can be incorporated by using the
update()
method.-
__init__
(self, str mode='full', int n_process=1, custom_filter=[])¶ Create a FastStruM object.
Parameters: - mode (str.) –
Defines which subset of available features in the DiProDB table to use. Choose from: [‘basic’, ‘protein’, ‘groove’, ‘proteingroove’, ‘unique’, ‘full’, ‘nucs’, ‘custom’]
MODE Features basic Twist, Rise, Bend. protein (for DNA-protein complex) Roll, Twist, Tilt, Slide, Shift, Rise. groove Major Groove Width, Major Groove Depth, Major Groove Size, Major Groove Distance, Minor Groove Width, Minor Groove Depth, Minor Groove Size, Minor Groove Distance. proteingroove The union of the “protein” and “groove” filters. unique Filters the table for the first occurrence of each type of feature. full All available double stranded DNA features. nucs Adenine content, Guanine content, Cytosine content, Thymine content. custom Manually select desired features. - custom_filter (list of ints.) – Specifies the indices of desired features from the DiProDB table.
- n_process (int.) – Number of threads to use.
-1
uses all processers.
- mode (str.) –
-
define_PWM
(self, list seqs, weights=None)¶ Computes a position weight matrix from sequences used to train the StruM.
Parameters: - seqs (list of str.) – Training set, composed of gapless alignment of binding sites of equal length.
- weights (1D array of floats.) – Weights to associate with each of the sequences
in
seqs
to use in learning the motif.
Returns: None. Sets the position weight matrix
self.PWM
based on the weighted sequences.
-
filter
(self)¶ Update StruM to ignore uninformative position-specific features.
Features with a high variance, i.e. non-specific features, do not contribute to the specificity of the StruM model. Filtering them out may increase the signal-to-noise ratio. The position-specific- features are rank ordered by their variance, and a univariate spline is fit to the distribution. The point of inflection is used as the threshold for masking less specific features.
Once this method is run and the attribute self.filter_mask is generated, the method
score_seq_filt()
will become available.
-
plot
(self, save_path)¶ Save a graphical representation of the StruM.
Generates an image displaying the trends of each of the features in the StruM. The line indicates the average (scaled) value for that feature at each position. Shading represents +/- 1 standard deviation.
Note
This generates one row per feature. If you are including many features, your plot may be excessively tall.
Parameters: save_path (str.) – Filename to use when saving the image.
-
print_PWM
(self, labels=False)¶ Pretty prints the PWM to std_out.
Parameters: labels (bool.) – Flag indicating whether to print the PWM with labels indicating the position associated with each column, and the nucleotide associated with each row. Returns: Formatted position weight matrix suitable for display, or use in the MEME suite, e.g. Also prints the PWM to std_out
.Return type: str.
-
read_FASTA
(self, fasta_file)¶ Reads a FASTA formatted file for headers and sequences.
Parameters: fasta_file (file object.) – FASTA formatted file containing DNA sequences. Returns: The headers and sequences from the FASTA file, as two separate lists. Return type: (list, list)
-
score_seq
(self, seq, **kwargs)¶ Scores a sequence using pre-calculated StruM.
Note
This only scores the sequence in the given orientation. If you want to score both strands, use
rev_comp()
on seq, and runscore_seq()
again.Parameters: seq (str.) – DNA sequence, all uppercase characters, composed of letters from set ACGTN. Follows format of the appropriate translate function: ( translate_base()
,translate_func()
)Returns: Vector of scores for similarity of each kmer in seq
to the StruM.Return type: 1D array.
-
score_seq_filt
(self, seq, **kwargs)¶ A variation on
score_seq()
that masks non-specific features.Once the self.filter_mask is generated, this method becomes available. This scores a sequence with the precomputed StruM, masking non-specific features.
Refer to
score_seq()
for more information about the arguments.
-
text_strum
(self, avg, std, avg_range=None, std_range=None, colorbar=False)¶ Gerates a ANSI colored Unicode text representation of a StruM.
Parameters: - avg (Array of floats.) – The average values of the StruM to plot.
- std (Array of floats.) – The standard deviation values of the StruM to plot.
- avg_range (tuple
(low_value, high_value)
.) – Seven bins of even width will be generated across this range, and theavg
values will be assigned to one of the bins. - std_range (tuple
(low_value, high_value)
.) – Eight bins of even width will be generated across this range, and thestd
values will be assigned to one of the bins. - colorbar (bool.) – Whether to include a colorbar
Returns: Returns a tuple of strings. The first element is the formatted StruM representation. If
colorbar=True
, the second element is the colorbar string.Return type: tuple of strings.
-
train
(self, training_sequences, weights=None, lim=None, **kwargs)¶ Learn structural motif from a set of known binding sites.
Parameters: - training_sequences (list of str. Or if updated, list of tuples: (str, [args])) – Training set, composed of gapless alignment of binding sites of equal length.
- weights (1D array of floats.) – Weights to associate with each of the sequences
in
training_sequences
to use in learning the motif. - lim (float) – Minimum value allowed for variation in a given position-specific-feature. Useful to prevent any deviation at that position from resulting in a probability of 0.
Returns: None. Defines the structural motif
self.strum
and the corresponding position weight matrixself.PWM
, sets attributeself.fit = True
.
-
train_EM
(self, data, fasta=True, params=None, int k=10, int max_iter=1000, double convergence_criterion=0.001, random_seed=0, int n_init=1, lim=None, seqlength=None, background=None, seed_motif=None, bool verbose=False)¶ Performs Expectation-Maximization on a set of sequences to find motif.
Parameters: - data (list of str, open file object referring to a FASTA file.) – A set of sequences to use for training the model. Assumed to have one occurrence of the binding site per sequence.
- fasta (bool.) – Flag indicating whether
data
points to an open file object containing a FASTA formatted file with DNA sequences. - params (
*args
,**kwargs
.) – Additional parameters to pass toself.func
, if defined. - k (int.) – Size of binding site to consider. Since dinucleotides
are considered, in sequence-space the size of the binding
site will be
k + 1
. - max_iter (int.) – Maximum number of iterations of Expecation Maximization to perform if convergence is not attained.
- convergence_criterion (float.) – If the change in the likelihood between two iterations is less than this value, the model is considered to have converged.
- random_seed (int.) – Seed for the random number generator used in the EM algorithm for initialization.
- n_init (int.) – Number of random restarts of the EM algorithm to perform.
- lim (float) – Minimum value allowed for variation in a given position-specific-feature. Useful to prevent any deviation at that position from resulting in a probability of 0.
- seqlength (int.) – If set, the sequences in the training data will be trimmed symmetrically to this length. .. note:: This must be longer than the shortes sequence.
- background (str.) –
Method to use for computing the background model. Default is to assume equal probability of each dinucleotide. Passing
"compute"
adapts the background model to represent the dinucleotide frequencies in the training sequences. Otherwise this can the path to a tab delimited file specifying the representation of each dinucleotide. E.g.AA 0.0625 AC 0.0625 ...
- seed_motif (
StruM.strum
.) – Optional. A StruM.strum to use for initializing the Expectation-Maximization algorithm. If set,k
will be replaced by the corresponding value from theseed_motif
’s shape.n_init
will also be reset to 1. - verbose – Specifies whether to print a text version
of the converging StruM at each iteration. Default
False
.
Returns: None. Defines the structural motif
self.strum
and the corresponding position weight matrixself.PWM
.
-
translate
(self, seq, **kwargs)¶ Convert sequence from string to structural representation.
Based on the value of self.updated will either call the
translate_base()
function to use just the DiProDB values, or thetranslate_func()
to augment those with additional values.Parameters: seq – Necessary data for the sequence to be translated. See translate_base()
andtranslate_func()
for more information.Returns: Sequence in structural representation. Return type: 1D numpy array of floats.
-
translate_base
(self, str seq, **kwargs)¶ Convert sequence from string to structural representation.
Parameters: - seq (str.) – DNA sequence, all uppercase characters, composed of letters from set ACGTN.
- **kwargs – Ignored
Returns: Sequence in structural representation.
Return type: 1D numpy array of floats.
-
translate_func
(self, f_seq, **kwargs)¶ Convert sequence from string to structural representation, with additional features.
Parameters: - seq ((str, [args])) – DNA sequence, all uppercase characters, composed of letters from set ACGTN, with additional data for passing to the extra function, if necessary.
- **kwargs – Additional keyword arguments required
by
self.func
.
Returns: Sequence in structural representation.
Return type: 1D numpy array of floats.
-
update
(self, features, func, data=None)¶ Update the StruM to incorporate additional features.
Using this method will change the behavior of other methods, especially the
translate()
method is replaced bytranslate_func()
.Parameters: - features (list of str.) – Text description or label of the feature(s) being added into the model.
- func (function.) – The scoring function that produces the additional
features. These may be computed on sequence alone, or by
incorporating additional data. The output must be an array
of shape
[n, l-1]
, wheren
is the number of additional features (len(features)
) andl
is the length of the sequence being scored. The first argument of the function must be the sequence being scored. - data – Additional data that is used by the new function. May be a lookup table, for example, or a reference to an outside file.
-