skbio.sequence.GrammaredSequence

class skbio.sequence.GrammaredSequence(sequence, metadata=None, positional_metadata=None, interval_metadata=None, lowercase=False, validate=True)[source]

Store sequence data conforming to a character set.

This is an abstract base class (ABC) that cannot be instantiated.

This class is intended to be inherited from to create grammared sequences with custom alphabets.

Raises

ValueError – If sequence characters are not in the character set 1.

See also

DNA, RNA, Protein

References

1

Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden

Examples

Note in the example below that properties either need to be static or use skbio’s classproperty decorator.

>>> from skbio.sequence import GrammaredSequence
>>> from skbio.util import classproperty
>>> class CustomSequence(GrammaredSequence):
...     @classproperty
...     def degenerate_map(cls):
...         return {"X": set("AB")}
...
...     @classproperty
...     def definite_chars(cls):
...         return set("ABC")
...
...
...     @classproperty
...     def default_gap_char(cls):
...         return '-'
...
...     @classproperty
...     def gap_chars(cls):
...         return set('-.')
>>> seq = CustomSequence('ABABACAC')
>>> seq
CustomSequence
--------------------------
Stats:
    length: 8
    has gaps: False
    has degenerates: False
    has definites: True
--------------------------
0 ABABACAC
>>> seq = CustomSequence('XXXXXX')
>>> seq
CustomSequence
-------------------------
Stats:
    length: 6
    has gaps: False
    has degenerates: True
    has definites: False
-------------------------
0 XXXXXX

Attributes

alphabet

Return valid characters.

default_gap_char

Gap character to use when constructing a new gapped sequence.

default_write_format

definite_chars

Return definite characters.

degenerate_chars

Return degenerate characters.

degenerate_map

Return mapping of degenerate to definite characters.

gap_chars

Return characters defined as gaps.

interval_metadata

IntervalMetadata object containing info about interval features.

metadata

dict containing metadata which applies to the entire object.

nondegenerate_chars

Return non-degenerate characters.

observed_chars

Set of observed characters in the sequence.

positional_metadata

pd.DataFrame containing metadata along an axis.

values

Array containing underlying sequence characters.

Built-ins

bool(gs)

Returns truth value (truthiness) of sequence.

x in gs

Determine if a subsequence is contained in this sequence.

copy.copy(gs)

Return a shallow copy of this sequence.

copy.deepcopy(gs)

Return a deep copy of this sequence.

gs1 == gs2

Determine if this sequence is equal to another.

gs[x]

Slice this sequence.

iter(gs)

Iterate over positions in this sequence.

len(gs)

Return the number of characters in this sequence.

gs1 != gs2

Determine if this sequence is not equal to another.

reversed(gs)

Iterate over positions in this sequence in reverse order.

str(gs)

Return sequence characters as a string.

Methods

concat(sequences[, how])

Concatenate an iterable of Sequence objects.

count(subsequence[, start, end])

Count occurrences of a subsequence in this sequence.

definites()

Find positions containing definite characters in the sequence.

degap()

Return a new sequence with gap characters removed.

degenerates()

Find positions containing degenerate characters in the sequence.

distance(other[, metric])

Compute the distance to another sequence.

expand_degenerates()

Yield all possible definite versions of the sequence.

find_motifs(motif_type[, min_length, ignore])

Search the biological sequence for motifs.

find_with_regex(regex[, ignore])

Generate slices for patterns matched by a regular expression.

frequencies([chars, relative])

Compute frequencies of characters in the sequence.

gaps()

Find positions containing gaps in the biological sequence.

has_definites()

Determine if sequence contains one or more definite characters

has_degenerates()

Determine if sequence contains one or more degenerate characters.

has_gaps()

Determine if the sequence contains one or more gap characters.

has_interval_metadata()

Determine if the object has interval metadata.

has_metadata()

Determine if the object has metadata.

has_nondegenerates()

Determine if sequence contains one or more non-degenerate characters

has_positional_metadata()

Determine if the object has positional metadata.

index(subsequence[, start, end])

Find position where subsequence first occurs in the sequence.

iter_contiguous(included[, min_length, invert])

Yield contiguous subsequences based on included.

iter_kmers(k[, overlap])

Generate kmers of length k from this sequence.

kmer_frequencies(k[, overlap, relative])

Return counts of words of length k from this sequence.

lowercase(lowercase)

Return a case-sensitive string representation of the sequence.

match_frequency(other[, relative])

Return count of positions that are the same between two sequences.

matches(other)

Find positions that match with another sequence.

mismatch_frequency(other[, relative])

Return count of positions that differ between two sequences.

mismatches(other)

Find positions that do not match with another sequence.

nondegenerates()

Find positions containing non-degenerate characters in the sequence.

read(file[, format])

Create a new Sequence instance from a file.

replace(where, character)

Replace values in this sequence with a different character.

to_regex([within_capture])

Return regular expression object that accounts for degenerate chars.

write(file[, format])

Write an instance of Sequence to a file.