Biological sequences (skbio.sequence)

This module provides functionality for working with biological sequences, including generic sequences, nucelotide sequences, DNA sequences, and RNA sequences. Class methods and attributes are also available to obtain valid character sets, complement maps for different sequence types, and for obtaining degenerate character definitions. Additionaly this module defines the GeneticCode class, which represents an immutable object that translates RNA or DNA strings to amino acid sequences.

Classes

BiologicalSequence(sequence[, id, ...]) Base class for biological sequences.
NucleotideSequence(sequence[, id, ...]) Base class for nucleotide sequences.
DNASequence(sequence[, id, description, ...]) Base class for DNA sequences.
RNASequence(sequence[, id, description, ...]) Base class for RNA sequences.
ProteinSequence(sequence[, id, description, ...]) Base class for protein sequences.
GeneticCode(code_sequence[, id, name, ...]) Class to hold codon to amino acid mapping, and vice versa.

Functions

genetic_code(*id) skbio.sequence.GeneticCode factory given an optional id.

Exceptions

BiologicalSequenceError General error for biological sequence validation failures.
GeneticCodeError Base class exception used by the GeneticCode class
GeneticCodeInitError Exception raised by the GeneticCode class upon a bad initialization
InvalidCodonError Exception raised by the GeneticCode class if __getitem__ fails

Examples

>>> from skbio.sequence import DNASequence, RNASequence

New sequences are created with optional id and description fields.

>>> d1 = DNASequence('ACC--G-GGTA..')
>>> d1 = DNASequence('ACC--G-GGTA..',id="seq1")
>>> d1 = DNASequence('ACC--G-GGTA..',id="seq1",description="GFP")

New sequences can also be created from existing sequences, for example as their reverse complement or degapped (i.e., unaligned) version.

>>> d2 = d1.degap()
>>> d1
<DNASequence: ACC--G-GGT... (length: 13)>
>>> d2
<DNASequence: ACCGGGTA (length: 8)>
>>> d3 = d2.reverse_complement()
>>> d3
<DNASequence: TACCCGGT (length: 8)>

It’s also straight-forward to compute distances between sequences (optionally using user-defined distance metrics, default is Hamming distance) for use in sequence clustering, phylogenetic reconstruction, etc.

>>> d4 = DNASequence('GACCCGCT')
>>> d5 = DNASequence('GACCCCCT')
>>> d3.distance(d4)
0.25
>>> d3.distance(d5)
0.375

Class-level methods contain information about the molecule types.

>>> DNASequence.iupac_degeneracies()['B']
set(['C', 'T', 'G'])
>>> RNASequence.iupac_degeneracies()['B']
set(['C', 'U', 'G'])
>>> DNASequence.is_gap('-')
True

Creating and using a GeneticCode object

>>> from skbio.sequence import genetic_code
>>> from pprint import pprint
>>> sgc = genetic_code(1)
>>> sgc
GeneticCode(FFLLSSSSYY**CC*WLLLLPPPPHHQQRRRRIIIMTTTTNNKKSSRRVVVVAAAADDEEGGGG)
>>> sgc['UUU'] == 'F'
True
>>> sgc['TTT'] == 'F'
True
>>> sgc['F'] == ['TTT', 'TTC']          #in arbitrary order
True
>>> sgc['*'] == ['TAA', 'TAG', 'TGA']   #in arbitrary order
True

Retrieving the anticodons of the object

>>> pprint(sgc.anticodons)
{'*': ['TTA', 'CTA', 'TCA'],
 'A': ['AGC', 'GGC', 'TGC', 'CGC'],
 'C': ['ACA', 'GCA'],
 'D': ['ATC', 'GTC'],
 'E': ['TTC', 'CTC'],
 'F': ['AAA', 'GAA'],
 'G': ['ACC', 'GCC', 'TCC', 'CCC'],
 'H': ['ATG', 'GTG'],
 'I': ['AAT', 'GAT', 'TAT'],
 'K': ['TTT', 'CTT'],
 'L': ['TAA', 'CAA', 'AAG', 'GAG', 'TAG', 'CAG'],
 'M': ['CAT'],
 'N': ['ATT', 'GTT'],
 'P': ['AGG', 'GGG', 'TGG', 'CGG'],
 'Q': ['TTG', 'CTG'],
 'R': ['ACG', 'GCG', 'TCG', 'CCG', 'TCT', 'CCT'],
 'S': ['AGA', 'GGA', 'TGA', 'CGA', 'ACT', 'GCT'],
 'T': ['AGT', 'GGT', 'TGT', 'CGT'],
 'V': ['AAC', 'GAC', 'TAC', 'CAC'],
 'W': ['CCA'],
 'Y': ['ATA', 'GTA']}

NucleotideSequences can be translated using a GeneticCode object.

>>> d6 = DNASequence('ATGTCTAAATGA')
>>> from skbio.sequence import genetic_code
>>> gc = genetic_code(11)
>>> gc.translate(d6)
<ProteinSequence: MSK* (length: 4)>