skbio.alignment.StripedSmithWaterman

class skbio.alignment.StripedSmithWaterman

Performs a striped (banded) Smith Waterman Alignment.

First a StripedSmithWaterman object must be instantiated with a query sequence. The resulting object is then callable with a target sequence and may be reused on a large collection of target sequences.

Parameters:

query_sequence : string

The query sequence, this may be upper or lowercase from the set of {A, C, G, T, N} (nucleotide) or from the set of {A, R, N, D, C, Q, E, G, H, I, L, K, M, F, P, S, T, W, Y, V, B, Z, X, * } (protein)

gap_open_penalty : int, optional

The penalty applied to creating a gap in the alignment. This CANNOT be 0. Default is 5.

gap_extend_penalty : int, optional

The penalty applied to extending a gap in the alignment. This CANNOT be 0. Default is 2.

score_size : int, optional

If your estimated best alignment score is < 255 this should be 0. If your estimated best alignment score is >= 255, this should be 1. If you don’t know, this should be 2. Default is 2.

mask_length : int, optional

The distance between the optimal and suboptimal alignment ending position >= mask_length. We suggest to use len(query_sequence)/2, if you don’t have special concerns. Detailed description of mask_length: After locating the optimal alignment ending position, the suboptimal alignment score can be heuristically found by checking the second largest score in the array that contains the maximal score of each column of the SW matrix. In order to avoid picking the scores that belong to the alignments sharing the partial best alignment, SSW C library masks the reference loci nearby (mask length = mask_length) the best alignment ending position and locates the second largest score from the unmasked elements. Default is 15.

mask_auto : bool, optional

This will automatically set the used mask length to be max(int(len(query_sequence)/2), mask_length). Default is True.

score_only : bool, optional

This will prevent the best alignment beginning positions (BABP) and the cigar from being returned as a result. This overrides any setting on score_filter, distance_filter, and override_skip_babp. It has the highest precedence. Default is False.

score_filter : int, optional

If set, this will prevent the cigar and best alignment beginning positions (BABP) from being returned if the optimal alignment score is less than score_filter saving some time computationally. This filter may be overridden by score_only (prevents BABP and cigar, regardless of other arguments), distance_filter (may prevent cigar, but will cause BABP to be calculated), and override_skip_babp (will ensure BABP) returned. Default is None.

distance_filter : int, optional

If set, this will prevent the cigar from being returned if the length of the query_sequence or the target_sequence is less than distance_filter saving some time computationally. The results of this filter may be overridden by score_only (prevents BABP and cigar, regardless of other arguments), and score_filter (may prevent cigar). override_skip_babp has no effect with this filter applied, as BABP must be calculated to perform the filter. Default is None.

override_skip_babp : bool, optional

When True, the best alignment beginning positions (BABP) will always be returned unless score_only is set to True. Default is False.

protein : bool, optional

When True, the query_sequence and target_sequence will be read as protein sequence. When False, the query_sequence and target_sequence will be read as nucleotide sequence. If True, a substitution_matrix must be supplied. Default is False.

match_score : int, optional

When using a nucleotide sequence, the match_score is the score added when a match occurs. This is ignored if substitution_matrix is provided. Default is 2.

mismatch_score : int, optional

When using a nucleotide sequence, the mismatch is the score subtracted when a mismatch occurs. This should be a negative integer. This is ignored if substitution_matrix is provided. Default is -3.

substitution_matrix : 2D dict, optional

Provides the score for each possible substitution of sequence characters. This may be used for protein or nucleotide sequences. The entire set of possible combinations for the relevant sequence type MUST be enumerated in the dict of dicts. This will override match_score and mismatch_score. Required when protein is True. Default is None.

suppress_sequences : bool, optional

If True, the query and target sequences will not be returned for convenience. Default is False.

zero_index : bool, optional

If True, all inidices will start at 0. If False, all inidices will start at 1. Default is True.

Notes

This is a wrapper for the SSW package [R95].

mask_length has to be >= 15, otherwise the suboptimal alignment information will NOT be returned.

match_score is a positive integer and mismatch_score is a negative integer.

match_score and mismatch_score are only meaningful in the context of nucleotide sequences.

A substitution matrix must be provided when working with protein sequences.

References

[R95](1, 2) Zhao, Mengyao, Wan-Ping Lee, Erik P. Garrison, & Gabor T. Marth. “SSW Library: An SIMD Smith-Waterman C/C++ Library for Applications”. PLOS ONE (2013). Web. 11 July 2014. http://www.plosone.org/article/info:doi/10.1371/journal.pone.0082138