class skbio.sequence.BiologicalSequence(sequence, id='', description='', quality=None, validate=False)[source]

Base class for biological sequences.


sequence : python Sequence (e.g., str, list or tuple)

The biological sequence.

id : str, optional

The sequence id (e.g., an accession number).

description : str, optional

A description or comment about the sequence (e.g., “green fluorescent protein”).

quality : 1-D array_like, int, optional

Phred quality scores stored as nonnegative integers, one per sequence character. If provided, must be the same length as the biological sequence. Can be a 1-D numpy.ndarray of integers, or a structure that can be converted to this representation using numpy.asarray. A copy will not be made if quality is already a 1-D numpy.ndarray with an int dtype. The array will be made read-only (i.e., its WRITEABLE flag will be set to False).

validate : bool, optional

If True, runs the is_valid method after construction and raises BiologicalSequenceError if is_valid == False.



If validate == True and is_valid == False, or if quality is not the correct shape.


BiologicalSequence objects are immutable. Where applicable, methods return a new object of the same class. Subclasses are typically defined by methods relevant to only a specific type of biological sequence, and by containing characters only contained in the IUPAC standard character set [R171] for that molecule type.


[R171](1, 2) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden


>>> from skbio.sequence import BiologicalSequence
>>> s = BiologicalSequence('GGUCGUGAAGGA')
>>> t = BiologicalSequence('GGUCCUGAAGGU')


sequence String containing underlying biological sequence characters.
id ID of the biological sequence.
description Description of the biological sequence.
quality Quality scores of the characters in the biological sequence.


__contains__(other) The in operator.
__eq__(other) The equality operator.
__getitem__(i) The indexing operator.
__hash__() The hash operator.
__iter__() The iter operator.
__len__() The len operator.
__ne__(other) The inequality operator.
__repr__() The repr method.
__reversed__() The reversed operator.
__str__() The str operator
alphabet() Return the set of characters allowed in a BiologicalSequence.
copy(**kwargs) Return a copy of the current biological sequence.
count(subsequence) Returns the number of occurences of subsequence.
degap() Returns a new BiologicalSequence with gap characters removed.
distance(other[, distance_fn]) Returns the distance to other
equals(other[, ignore]) Compare two biological sequences for equality.
fraction_diff(other) Return fraction of positions that differ relative to other
fraction_same(other) Return fraction of positions that are the same relative to other
gap_alphabet() Return the set of characters defined as gaps.
gap_maps() Return tuples mapping b/w gapped and ungapped positions
gap_vector() Return list indicating positions containing gaps
has_quality() Return bool indicating presence of quality scores in the sequence.
has_unsupported_characters() Return bool indicating presence/absence of unsupported characters
index(subsequence) Return the position where subsequence first occurs
is_gap(char) Return True if char is in the gap_alphabet set
is_gapped() Return True if char(s) in gap_alphabet are present
is_valid() Return True if the sequence is valid
iupac_characters() Return the non-degenerate and degenerate characters.
iupac_degeneracies() Return the mapping of degenerate to non-degenerate characters.
iupac_degenerate_characters() Return the degenerate IUPAC characters.
iupac_standard_characters() Return the non-degenerate IUPAC characters.
k_word_counts(k[, overlapping]) Get the counts of words of length k
k_word_frequencies(k[, overlapping]) Get the frequencies of words of length k
k_words(k[, overlapping]) Get the list of words of length k
lower() Convert the BiologicalSequence to lowercase
nondegenerates() Yield all nondegenerate versions of the sequence.
read(fp[, format]) Create a new BiologicalSequence instance from a file.
regex_iter(regex[, retrieve_group_0]) Find patterns specified by regular expression
to_fasta([field_delimiter, terminal_character]) Return the sequence as a fasta-formatted string
unsupported_characters() Return the set of unsupported characters in the BiologicalSequence
upper() Convert the BiologicalSequence to uppercase
write(fp[, format]) Write an instance of BiologicalSequence to a file.