Documentation

StruM: Structural Motifs

This package provides functionality for computing structural representations of DNA sequence motifs. Estimates for DNA structure comes from the DiNucleotide Property Database (http://diprodb.leibniz-fli.de/).

This version relies on the Cython framework for speed purposes. Improvements in speed are particularly seen with scoring longer sequences.

strum.rev_comp(str seq) → str

Reverse complement a DNA sequence.

Parameters:seq (str.) – A DNA sequence.
class strum.StruM

StruM Object: Train and work with Structural Motifs

StruMs can be learned via maximum likelihood (train()) from a known set of aligned binding sites, or via expectation maximization (train_EM()) from a set of broader regions.

For speed, this package will trim the sequences to be the same length. This can be user specified, else the shortest sequence will be used as a guide.

Additional features other than those from DiProDB can be incorporated by using the update() method.

__init__(self, str mode='full', int n_process=1, custom_filter=[])

Create a FastStruM object.

Parameters:
  • mode (str.) –

    Defines which subset of available features in the DiProDB table to use. Choose from: [‘basic’, ‘protein’, ‘groove’, ‘proteingroove’, ‘unique’, ‘full’, ‘nucs’, ‘custom’]

    MODE Features
    basic Twist, Rise, Bend.
    protein (for DNA-protein complex) Roll, Twist, Tilt, Slide, Shift, Rise.
    groove Major Groove Width, Major Groove Depth, Major Groove Size, Major Groove Distance, Minor Groove Width, Minor Groove Depth, Minor Groove Size, Minor Groove Distance.
    proteingroove The union of the “protein” and “groove” filters.
    unique Filters the table for the first occurrence of each type of feature.
    full All available double stranded DNA features.
    nucs Adenine content, Guanine content, Cytosine content, Thymine content.
    custom Manually select desired features.
  • custom_filter (list of ints.) – Specifies the indices of desired features from the DiProDB table.
  • n_process (int.) – Number of threads to use. -1 uses all processers.
define_PWM(self, list seqs, weights=None)

Computes a position weight matrix from sequences used to train the StruM.

Parameters:
  • seqs (list of str.) – Training set, composed of gapless alignment of binding sites of equal length.
  • weights (1D array of floats.) – Weights to associate with each of the sequences in seqs to use in learning the motif.
Returns:

None. Sets the position weight matrix self.PWM based on the weighted sequences.

filter(self)

Update StruM to ignore uninformative position-specific features.

Features with a high variance, i.e. non-specific features, do not contribute to the specificity of the StruM model. Filtering them out may increase the signal-to-noise ratio. The position-specific- features are rank ordered by their variance, and a univariate spline is fit to the distribution. The point of inflection is used as the threshold for masking less specific features.

Once this method is run and the attribute self.filter_mask is generated, the method score_seq_filt() will become available.

plot(self, save_path)

Save a graphical representation of the StruM.

Generates an image displaying the trends of each of the features in the StruM. The line indicates the average (scaled) value for that feature at each position. Shading represents +/- 1 standard deviation.

Note

This generates one row per feature. If you are including many features, your plot may be excessively tall.

Parameters:save_path (str.) – Filename to use when saving the image.
print_PWM(self, labels=False)

Pretty prints the PWM to std_out.

Parameters:labels (bool.) – Flag indicating whether to print the PWM with labels indicating the position associated with each column, and the nucleotide associated with each row.
Returns:Formatted position weight matrix suitable for display, or use in the MEME suite, e.g. Also prints the PWM to std_out.
Return type:str.
read_FASTA(self, fasta_file)

Reads a FASTA formatted file for headers and sequences.

Parameters:fasta_file (file object.) – FASTA formatted file containing DNA sequences.
Returns:The headers and sequences from the FASTA file, as two separate lists.
Return type:(list, list)
score_seq(self, seq, **kwargs)

Scores a sequence using pre-calculated StruM.

Note

This only scores the sequence in the given orientation. If you want to score both strands, use rev_comp() on seq, and run score_seq() again.

Parameters:seq (str.) – DNA sequence, all uppercase characters, composed of letters from set ACGTN. Follows format of the appropriate translate function: (translate_base(), translate_func())
Returns:Vector of scores for similarity of each kmer in seq to the StruM.
Return type:1D array.
score_seq_filt(self, seq, **kwargs)

A variation on score_seq() that masks non-specific features.

Once the self.filter_mask is generated, this method becomes available. This scores a sequence with the precomputed StruM, masking non-specific features.

Refer to score_seq() for more information about the arguments.

text_strum(self, avg, std, avg_range=None, std_range=None, colorbar=False)

Gerates a ANSI colored Unicode text representation of a StruM.

Parameters:
  • avg (Array of floats.) – The average values of the StruM to plot.
  • std (Array of floats.) – The standard deviation values of the StruM to plot.
  • avg_range (tuple (low_value, high_value).) – Seven bins of even width will be generated across this range, and the avg values will be assigned to one of the bins.
  • std_range (tuple (low_value, high_value).) – Eight bins of even width will be generated across this range, and the std values will be assigned to one of the bins.
  • colorbar (bool.) – Whether to include a colorbar
Returns:

Returns a tuple of strings. The first element is the formatted StruM representation. If colorbar=True, the second element is the colorbar string.

Return type:

tuple of strings.

train(self, training_sequences, weights=None, lim=None, **kwargs)

Learn structural motif from a set of known binding sites.

Parameters:
  • training_sequences (list of str. Or if updated, list of tuples: (str, [args])) – Training set, composed of gapless alignment of binding sites of equal length.
  • weights (1D array of floats.) – Weights to associate with each of the sequences in training_sequences to use in learning the motif.
  • lim (float) – Minimum value allowed for variation in a given position-specific-feature. Useful to prevent any deviation at that position from resulting in a probability of 0.
Returns:

None. Defines the structural motif self.strum and the corresponding position weight matrix self.PWM, sets attribute self.fit = True.

train_EM(self, data, fasta=True, params=None, int k=10, int max_iter=1000, double convergence_criterion=0.001, random_seed=0, int n_init=1, lim=None, seqlength=None, background=None, seed_motif=None, bool verbose=False)

Performs Expectation-Maximization on a set of sequences to find motif.

Parameters:
  • data (list of str, open file object referring to a FASTA file.) – A set of sequences to use for training the model. Assumed to have one occurrence of the binding site per sequence.
  • fasta (bool.) – Flag indicating whether data points to an open file object containing a FASTA formatted file with DNA sequences.
  • params (*args, **kwargs.) – Additional parameters to pass to self.func, if defined.
  • k (int.) – Size of binding site to consider. Since dinucleotides are considered, in sequence-space the size of the binding site will be k + 1.
  • max_iter (int.) – Maximum number of iterations of Expecation Maximization to perform if convergence is not attained.
  • convergence_criterion (float.) – If the change in the likelihood between two iterations is less than this value, the model is considered to have converged.
  • random_seed (int.) – Seed for the random number generator used in the EM algorithm for initialization.
  • n_init (int.) – Number of random restarts of the EM algorithm to perform.
  • lim (float) – Minimum value allowed for variation in a given position-specific-feature. Useful to prevent any deviation at that position from resulting in a probability of 0.
  • seqlength (int.) – If set, the sequences in the training data will be trimmed symmetrically to this length. .. note:: This must be longer than the shortes sequence.
  • background (str.) –

    Method to use for computing the background model. Default is to assume equal probability of each dinucleotide. Passing "compute" adapts the background model to represent the dinucleotide frequencies in the training sequences. Otherwise this can the path to a tab delimited file specifying the representation of each dinucleotide. E.g.

    AA  0.0625
    AC  0.0625
    ...
    
  • seed_motif (StruM.strum.) – Optional. A StruM.strum to use for initializing the Expectation-Maximization algorithm. If set, k will be replaced by the corresponding value from the seed_motif’s shape. n_init will also be reset to 1.
  • verbose – Specifies whether to print a text version of the converging StruM at each iteration. Default False.
Returns:

None. Defines the structural motif self.strum and the corresponding position weight matrix self.PWM.

translate(self, seq, **kwargs)

Convert sequence from string to structural representation.

Based on the value of self.updated will either call the translate_base() function to use just the DiProDB values, or the translate_func() to augment those with additional values.

Parameters:seq – Necessary data for the sequence to be translated. See translate_base() and translate_func() for more information.
Returns:Sequence in structural representation.
Return type:1D numpy array of floats.
translate_base(self, str seq, **kwargs)

Convert sequence from string to structural representation.

Parameters:
  • seq (str.) – DNA sequence, all uppercase characters, composed of letters from set ACGTN.
  • **kwargs – Ignored
Returns:

Sequence in structural representation.

Return type:

1D numpy array of floats.

translate_func(self, f_seq, **kwargs)

Convert sequence from string to structural representation, with additional features.

Parameters:
  • seq ((str, [args])) – DNA sequence, all uppercase characters, composed of letters from set ACGTN, with additional data for passing to the extra function, if necessary.
  • **kwargs – Additional keyword arguments required by self.func.
Returns:

Sequence in structural representation.

Return type:

1D numpy array of floats.

update(self, features, func, data=None)

Update the StruM to incorporate additional features.

Using this method will change the behavior of other methods, especially the translate() method is replaced by translate_func().

Parameters:
  • features (list of str.) – Text description or label of the feature(s) being added into the model.
  • func (function.) – The scoring function that produces the additional features. These may be computed on sequence alone, or by incorporating additional data. The output must be an array of shape [n, l-1], where n is the number of additional features (len(features)) and l is the length of the sequence being scored. The first argument of the function must be the sequence being scored.
  • data – Additional data that is used by the new function. May be a lookup table, for example, or a reference to an outside file.