This module provides functionality for working with biological sequence collections and alignments. These can be composed of generic sequences, nucelotide sequences, DNA sequences, and RNA sequences. By default, input is not validated, except that sequence ids must be unique, but all contructor methods take a validate option which checks different features of the input based on SequenceCollection type.
SequenceCollection(seqs[, validate]) | Class for storing collections of biological sequences. |
Alignment(seqs[, validate, score, ...]) | Class for storing alignments of biological sequences. |
StockholmAlignment(seqs[, gf, gs, gr, gc, ...]) | Contains the metadata information in a Stockholm file alignment |
StripedSmithWaterman | Performs a striped (banded) Smith Waterman Alignment. |
AlignmentStructure | Wraps the result of an alignment c struct so it is accessible to Python |
local_pairwise_align_ssw | Align query and target sequences with Striped Smith-Waterman. |
pairwise.global_pairwise_align_nucleotide(...) | Globally align exactly two nucleotide seqs with Needleman-Wunsch |
pairwise.global_pairwise_align_protein(seq1, ...) | Globally align exactly two protein seqs with Needleman-Wunsch |
pairwise.global_pairwise_align(seq1, seq2, ...) | Globally align exactly two seqs with Needleman-Wunsch |
pairwise.local_pairwise_align_nucleotide(...) | Locally align exactly two nucleotide seqs with Smith-Waterman |
pairwise.local_pairwise_align_protein(seq1, seq2) | Locally align exactly two protein seqs with Smith-Waterman |
pairwise.local_pairwise_align(seq1, seq2, ...) | Locally align exactly two seqs with Smith-Waterman |
>>> from StringIO import StringIO
>>> from skbio.core.alignment import SequenceCollection, Alignment
>>> from skbio.core.sequence import DNA
>>> seqs = [DNA("ACC--G-GGTA..", id="seq1"),
... DNA("TCC--G-GGCA..", id="seqs2")]
>>> a1 = Alignment(seqs)
>>> a1
<Alignment: n=2; mean +/- std length=13.00 +/- 0.00>
>>> seqs = [DNA("ACCGGG", id="seq1"),
... DNA("TCCGGGCA", id="seq2")]
>>> s1 = SequenceCollection(seqs)
>>> s1
<SequenceCollection: n=2; mean +/- std length=7.00 +/- 1.00>
>>> from skbio.parse.sequences import parse_fasta
>>> fasta_f = StringIO('>seq1\n'
... 'CGATGTCGATCGATCGATCGATCAG\n'
... '>seq2\n'
... 'CATCGATCGATCGATGCATGCATGCATG\n')
>>> s1 = SequenceCollection.from_fasta_records(parse_fasta(fasta_f), DNA)
>>> s1
<SequenceCollection: n=2; mean +/- std length=26.50 +/- 1.50>
>>> from skbio.core.sequence import RNA
>>> from skbio.core.alignment import StockholmAlignment
>>> seqs = [RNA("ACC--G-GGGU", id="seq1"),
... RNA("TCC--G-GGGA", id="seq2")]
>>> gc = {'SS_cons': '(((.....)))'}
>>> sto = StockholmAlignment(seqs, gc=gc)
>>> print(sto)
# STOCKHOLM 1.0
seq1 ACC--G-GGGU
seq2 TCC--G-GGGA
#=GC SS_cons (((.....)))
//
>>> sto.gc
{'SS_cons': '(((.....)))'}
Using the convenient local_pairwise_align_ssw function:
>>> from skbio.core.alignment import local_pairwise_align_ssw
>>> alignment = local_pairwise_align_ssw(
... "ACTAAGGCTCTCTACCCCTCTCAGAGA",
... "ACTAAGGCTCCTAACCCCCTTTTCTCAGA"
... )
>>> print alignment
>query
ACTAAGGCTCTC-TACCC----CTCTCAGA
>target
ACTAAGGCTC-CTAACCCCCTTTTCTCAGA
Using the StripedSmithWaterman object:
>>> from skbio.core.alignment import StripedSmithWaterman
>>> query = StripedSmithWaterman("ACTAAGGCTCTCTACCCCTCTCAGAGA")
>>> alignment = query("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA")
>>> print alignment
ACTAAGGCTC...
ACTAAGGCTC...
Score: 49
Length: 28
Using the StripedSmithWaterman object for multiple targets in an efficient way and finding the aligned sequence representations:
>>> from skbio.core.alignment import StripedSmithWaterman
>>> alignments = []
>>> target_sequences = [
... "GCTAACTAGGCTCCCTTCTACCCCTCTCAGAGA",
... "GCCCAGTAGCTTCCCAATATGAGAGCATCAATTGTAGATCGGGCC",
... "TCTATAAGATTCCGCATGCGTTACTTATAAGATGTCTCAACGG",
... "TAGAGATTAATTGCCACTGCCAAAATTCTG"
... ]
>>> query_sequence = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
>>> query = StripedSmithWaterman(query_sequence)
>>> for target_sequence in target_sequences:
... alignment = query(target_sequence)
... alignments.append(alignment)
...
>>> print alignments[0]
ACTAAGGCT-...
ACT-AGGCTC...
Score: 38
Length: 30
>>> print alignments[0].aligned_query_sequence
ACTAAGGCT---CTCTACCCCTCTCAGAGA
>>> print alignments[0].aligned_target_sequence
ACT-AGGCTCCCTTCTACCCCTCTCAGAGA
scikit-bio also provides pure-Python implementations of Smith-Waterman and Needleman-Wunsch alignment. These are much slower than the methods described above, but serve as useful educational examples as they’re simpler to experiment with. Functions are provided for local and global alignment of protein and nucleotide sequences. The global* and local* functions differ in the underlying algorithm that is applied (global* uses Needleman- Wunsch while local* uses Smith-Waterman), and *protein and *nucleotide differ in their default scoring of matches, mismatches, and gaps.
Here we locally align a pair of protein sequences using gap open penalty of 11 and a gap extend penalty of 1 (in other words, it is much more costly to open a new gap than extend an existing one).
>>> from skbio.core.alignment.pairwise import local_pairwise_align_protein
>>> s1 = "HEAGAWGHEE"
>>> s2 = "PAWHEAE"
>>> r = local_pairwise_align_protein(s1, s2, 11, 1)
This returns an skbio.Alignment object. We can look at the aligned sequences:
>>> print(str(r[0]))
AWGHE
>>> print(str(r[1]))
AW-HE
We can identify the start and end positions of each aligned sequence as follows:
>>> r.start_end_positions()
[(4, 8), (1, 4)]
And we can view the score of the alignment using the score method:
>>> r.score()
25.0
Similarly, we can perform global alignment of nucleotide sequences, and print the resulting alignment as fasta records:
>>> from skbio.core.alignment.pairwise import global_pairwise_align_nucleotide
>>> s1 = "GCGTGCCTAAGGTATGCAAG"
>>> s2 = "ACGTGCCTAGGTACGCAAG"
>>> r = global_pairwise_align_nucleotide(s1, s2)
>>> print(r.to_fasta())
>0
GCGTGCCTAAGGTATGCAAG
>1
ACGTGCCTA-GGTACGCAAG