Biological sequences (skbio.sequence)

This module provides functionality for working with biological sequences, including generic sequences, nucelotide sequences, DNA sequences, and RNA sequences. Class methods and attributes are also available to obtain valid character sets, complement maps for different sequence types, and for obtaining degenerate character definitions. Additionaly this module defines the GeneticCode class, which represents an immutable object that translates RNA or DNA strings to amino acid sequences.


BiologicalSequence(sequence[, id, ...]) Base class for biological sequences.
NucleotideSequence(sequence[, id, ...]) Base class for nucleotide sequences.
DNASequence(sequence[, id, description, ...]) Base class for DNA sequences.
RNASequence(sequence[, id, description, ...]) Base class for RNA sequences.
ProteinSequence(sequence[, id, description, ...]) Base class for protein sequences.
GeneticCode(code_sequence[, id, name, ...]) Class to hold codon to amino acid mapping, and vice versa.


genetic_code(*id) skbio.sequence.GeneticCode factory given an optional id.


BiologicalSequenceError General error for biological sequence validation failures.
GeneticCodeError Base class exception used by the GeneticCode class
GeneticCodeInitError Exception raised by the GeneticCode class upon a bad initialization
InvalidCodonError Exception raised by the GeneticCode class if __getitem__ fails


>>> from skbio.sequence import DNASequence, RNASequence

New sequences are created with optional id and description fields.

>>> d1 = DNASequence('ACC--G-GGTA..')
>>> d1 = DNASequence('ACC--G-GGTA..',id="seq1")
>>> d1 = DNASequence('ACC--G-GGTA..',id="seq1",description="GFP")

New sequences can also be created from existing sequences, for example as their reverse complement or degapped (i.e., unaligned) version.

>>> d2 = d1.degap()
>>> d1
<DNASequence: ACC--G-GGT... (length: 13)>
>>> d2
<DNASequence: ACCGGGTA (length: 8)>
>>> d3 = d2.reverse_complement()
>>> d3
<DNASequence: TACCCGGT (length: 8)>

It’s also straight-forward to compute distances between sequences (optionally using user-defined distance metrics, default is Hamming distance) for use in sequence clustering, phylogenetic reconstruction, etc.

>>> d4 = DNASequence('GACCCGCT')
>>> d5 = DNASequence('GACCCCCT')
>>> d3.distance(d4)
>>> d3.distance(d5)

Class-level methods contain information about the molecule types.

>>> DNASequence.iupac_degeneracies()['B']
set(['C', 'T', 'G'])
>>> RNASequence.iupac_degeneracies()['B']
set(['C', 'U', 'G'])
>>> DNASequence.is_gap('-')

Creating and using a GeneticCode object

>>> from skbio.sequence import genetic_code
>>> from pprint import pprint
>>> sgc = genetic_code(1)
>>> sgc
>>> sgc['UUU'] == 'F'
>>> sgc['TTT'] == 'F'
>>> sgc['F'] == ['TTT', 'TTC']          #in arbitrary order
>>> sgc['*'] == ['TAA', 'TAG', 'TGA']   #in arbitrary order

Retrieving the anticodons of the object

>>> pprint(sgc.anticodons)
{'*': ['TTA', 'CTA', 'TCA'],
 'A': ['AGC', 'GGC', 'TGC', 'CGC'],
 'C': ['ACA', 'GCA'],
 'D': ['ATC', 'GTC'],
 'E': ['TTC', 'CTC'],
 'F': ['AAA', 'GAA'],
 'G': ['ACC', 'GCC', 'TCC', 'CCC'],
 'H': ['ATG', 'GTG'],
 'I': ['AAT', 'GAT', 'TAT'],
 'K': ['TTT', 'CTT'],
 'L': ['TAA', 'CAA', 'AAG', 'GAG', 'TAG', 'CAG'],
 'M': ['CAT'],
 'N': ['ATT', 'GTT'],
 'P': ['AGG', 'GGG', 'TGG', 'CGG'],
 'Q': ['TTG', 'CTG'],
 'R': ['ACG', 'GCG', 'TCG', 'CCG', 'TCT', 'CCT'],
 'S': ['AGA', 'GGA', 'TGA', 'CGA', 'ACT', 'GCT'],
 'T': ['AGT', 'GGT', 'TGT', 'CGT'],
 'V': ['AAC', 'GAC', 'TAC', 'CAC'],
 'W': ['CCA'],
 'Y': ['ATA', 'GTA']}

NucleotideSequences can be translated using a GeneticCode object.

>>> d6 = DNASequence('ATGTCTAAATGA')
>>> from skbio.sequence import genetic_code
>>> gc = genetic_code(11)
>>> gc.translate(d6)
<ProteinSequence: MSK* (length: 4)>