skbio.sequence.BiologicalSequence

class skbio.sequence.BiologicalSequence(sequence, id='', description='', quality=None, validate=False)[source]

Base class for biological sequences.

Parameters:

sequence : python Sequence (e.g., str, list or tuple)

The biological sequence.

id : str, optional

The sequence id (e.g., an accession number).

description : str, optional

A description or comment about the sequence (e.g., “green fluorescent protein”).

quality : 1-D array_like, int, optional

Phred quality scores stored as nonnegative integers, one per sequence character. If provided, must be the same length as the biological sequence. Can be a 1-D numpy.ndarray of integers, or a structure that can be converted to this representation using numpy.asarray. A copy will not be made if quality is already a 1-D numpy.ndarray with an int dtype. The array will be made read-only (i.e., its WRITEABLE flag will be set to False).

validate : bool, optional

If True, runs the is_valid method after construction and raises BiologicalSequenceError if is_valid == False.

Raises:

skbio.sequence.BiologicalSequenceError

If validate == True and is_valid == False, or if quality is not the correct shape.

Notes

BiologicalSequence objects are immutable. Where applicable, methods return a new object of the same class. Subclasses are typically defined by methods relevant to only a specific type of biological sequence, and by containing characters only contained in the IUPAC standard character set [R171] for that molecule type.

References

[R171](1, 2) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden

Examples

>>> from skbio.sequence import BiologicalSequence
>>> s = BiologicalSequence('GGUCGUGAAGGA')
>>> t = BiologicalSequence('GGUCCUGAAGGU')

Attributes

sequence String containing underlying biological sequence characters.
id ID of the biological sequence.
description Description of the biological sequence.
quality Quality scores of the characters in the biological sequence.

Methods

__contains__(other) The in operator.
__eq__(other) The equality operator.
__getitem__(i) The indexing operator.
__hash__() The hash operator.
__iter__() The iter operator.
__len__() The len operator.
__ne__(other) The inequality operator.
__repr__() The repr method.
__reversed__() The reversed operator.
__str__() The str operator
alphabet() Return the set of characters allowed in a BiologicalSequence.
copy(**kwargs) Return a copy of the current biological sequence.
count(subsequence) Returns the number of occurences of subsequence.
degap() Returns a new BiologicalSequence with gap characters removed.
distance(other[, distance_fn]) Returns the distance to other
equals(other[, ignore]) Compare two biological sequences for equality.
find_features(feature_type[, min_length, ...]) Search the sequence for features
fraction_diff(other) Return fraction of positions that differ relative to other
fraction_same(other) Return fraction of positions that are the same relative to other
gap_alphabet() Return the set of characters defined as gaps.
gap_maps() Return tuples mapping b/w gapped and ungapped positions
gap_vector() Return list indicating positions containing gaps
has_quality() Return bool indicating presence of quality scores in the sequence.
has_unsupported_characters() Return bool indicating presence/absence of unsupported characters
index(subsequence) Return the position where subsequence first occurs
is_gap(char) Return True if char is in the gap_alphabet set
is_gapped() Return True if char(s) in gap_alphabet are present
is_valid() Return True if the sequence is valid
iupac_characters() Return the non-degenerate and degenerate characters.
iupac_degeneracies() Return the mapping of degenerate to non-degenerate characters.
iupac_degenerate_characters() Return the degenerate IUPAC characters.
iupac_standard_characters() Return the non-degenerate IUPAC characters.
k_word_counts(k[, overlapping]) Get the counts of words of length k
k_word_frequencies(k[, overlapping]) Get the frequencies of words of length k
k_words(k[, overlapping]) Get the list of words of length k
lower() Convert the BiologicalSequence to lowercase
nondegenerates() Yield all nondegenerate versions of the sequence.
read(fp[, format]) Create a new BiologicalSequence instance from a file.
regex_iter(regex[, retrieve_group_0]) Find patterns specified by regular expression
to_fasta([field_delimiter, terminal_character]) Return the sequence as a fasta-formatted string
unsupported_characters() Return the set of unsupported characters in the BiologicalSequence
upper() Convert the BiologicalSequence to uppercase
write(fp[, format]) Write an instance of BiologicalSequence to a file.