skbio.parse.sequences.parse_clustal

skbio.parse.sequences.parse_clustal(record, strict=True)[source]

yields labels and sequences

Parameters:

data : open file object

An open Clustal file.

strict : boolean

Whether or not to raise a RecordError when no labels are found.

Returns:

label : str

label of the sequence

seq : str

sequence for each label

Notes

Currently, does not check whether sequences are the same length and are in order. Skips any line that starts with a blank.

parse_clustal preserves the order of the sequences from the original file. However, it does use a dict as an intermediate, so two sequences can’t have the same label. This is probably OK since Clustal will refuse to run on a FASTA file in which two sequences have the same label, but could potentially cause trouble with manually edited files (all the segments of the conflicting sequences would be interleaved, possibly in an unpredictable way).

If the lines have trailing numbers (i.e. Clustal was run with -LINENOS=ON), silently deletes them. Does not check that the numbers actually correspond to the number of chars in the sequence printed so far.

References

[R85]Thompson JD, Higgins DG, Gibson TJ, “CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Thompson”, Nucleic Acids Res. 1994 Nov 11;22(22):4673-80.

Examples

Assume we have a fasta formatted file with the following contents:

CLUSTAL W (1.82) multiple sequence alignment

abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCAUCA 60
def   ----------------------------------
xyz   ----------------------------------

abc   GUCGAUACAUACGUACGUCGUACGUACGU-CGAC 11
def   ---------------CGCGAUGCAUGCAU-CGAU 18
xyz   -----------CAUGCAUCGUACGUACGCAUGAC 23

We can use the following code:

>>> from StringIO import StringIO
>>> from skbio.parse.sequences import parse_clustal
>>> clustal_f = StringIO("abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCA 60\n"
...                      'def   -------------------------------\n'
...                      'xyz   -------------------------------\n'
...                      '\n'
...                      'abc   GUCGAUACAUACGUACGUCGGUACGU-CGAC 11\n'
...                      'def   ---------------CGUGCAUGCAU-CGAU 18\n'
...                      'xyz   -----------CAUUCGUACGUACGCAUGAC 23\n')
>>> for label, seq in parse_clustal(clustal_f):
...     print(label)
...     print(seq)
abc
GCAUGCAUCUGCAUACGUACGUACGCAUGCAGUCGAUACAUACGUACGUCGGUACGU-CGAC
def
----------------------------------------------CGUGCAUGCAU-CGAU
xyz
------------------------------------------CAUUCGUACGUACGCAUGAC