skbio.sequence.DNA.iter_contiguous

DNA.iter_contiguous(included, min_length=1, invert=False)[source]

Yield contiguous subsequences based on included.

State: Stable as of 0.4.0.

Parameters:

included : 1D array_like (bool) or iterable (slices or ints)

included is transformed into a flat boolean vector where each position will either be included or skipped. All contiguous included positions will be yielded as a single region.

min_length : int, optional

The minimum length of a subsequence for it to be yielded. Default is 1.

invert : bool, optional

Whether to invert included such that it describes what should be skipped instead of included. Default is False.

Yields:

Sequence

Contiguous subsequence as indicated by included.

Notes

If slices provide adjacent ranges, then they will be considered the same contiguous subsequence.

Examples

Here we use iter_contiguous to find all of the contiguous ungapped sequences using a boolean vector derived from our DNA sequence.

>>> from skbio import DNA
>>> s = DNA('AAA--TT-CCCC-G-')
>>> no_gaps = ~s.gaps()
>>> for ungapped_subsequence in s.iter_contiguous(no_gaps,
...                                               min_length=2):
...     print(ungapped_subsequence)
AAA
TT
CCCC

Note how the last potential subsequence was skipped because it would have been smaller than our min_length which was set to 2.

We can also use iter_contiguous on a generator of slices as is produced by find_motifs (and find_with_regex).

>>> from skbio import Protein
>>> s = Protein('ACDFNASANFTACGNPNRTESL')
>>> for subseq in s.iter_contiguous(s.find_motifs('N-glycosylation')):
...     print(subseq)
NASANFTA
NRTE

Note how the first subsequence contains two N-glycosylation sites. This happened because they were contiguous.