Clustal format (skbio.io.format.clustal)

Clustal format (clustal) stores multiple sequence alignments. This format was originally introduced in the Clustal package [R143].

Format Support

Has Sniffer: Yes

Reader Writer Object Class
Yes Yes skbio.alignment.Alignment

Format Specification

A clustal-formatted file is a plain text format. It can optionally have a header, which states the clustal version number. This is followed by the multiple sequence alignment, and optional information about the degree of conservation at each position in the alignment [R144].

Alignment Section

Each sequence in the alignment is divided into subsequences each at most 60 characters long. The sequence identifier for each sequence precedes each subsequence. Each subsequence can optionally be followed by the cumulative number of non-gap characters up to that point in the full sequence (not included in the examples below). A line containing conservation information about each position in the alignment can optionally follow all of the subsequences (not included in the examples below).

Note

scikit-bio does not support writing conservation information

Note

scikit-bio will only write a clustal-formatted file if the alignment’s sequence characters are valid IUPAC characters, as defined in skbio.sequence. The specific lexicon that is validated against depends on the type of sequences stored in the alignment.

Examples

Assume we have a clustal-formatted file with the following contents:

CLUSTAL W (1.82) multiple sequence alignment

abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCAUCA
def   ----------------------------------
xyz   ----------------------------------

abc   GUCGAUACAUACGUACGUCGUACGUACGU-CGAC
def   ---------------CGCGAUGCAUGCAU-CGAU
xyz   -----------CAUGCAUCGUACGUACGCAUGAC

We can use the following code to read a clustal file into an Alignment:

>>> from skbio import Alignment
>>> clustal_f = [u'CLUSTAL W (1.82) multiple sequence alignment\n',
...              u'\n',
...              u'abc   GCAUGCAUCUGCAUACGUACGUACGCAUGCA\n',
...              u'def   -------------------------------\n',
...              u'xyz   -------------------------------\n',
...              u'\n',
...              u'abc   GUCGAUACAUACGUACGUCGGUACGU-CGAC\n',
...              u'def   ---------------CGUGCAUGCAU-CGAU\n',
...              u'xyz   -----------CAUUCGUACGUACGCAUGAC\n']
>>> Alignment.read(clustal_f, format="clustal")
<Alignment: n=3; mean +/- std length=62.00 +/- 0.00>

We can use the following code to write an Alignment to a clustal-formatted file:

>>> from io import StringIO
>>> from skbio import DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id': 'seq1'}),
...         DNA('A--GTCGAA-GTACCT', metadata={'id': 'sequence-2'}),
...         DNA('AGAGTTGAAGGTATCT', metadata={'id': '3'})]
>>> aln = Alignment(seqs)
>>> fh = StringIO()
>>> _ = aln.write(fh, format='clustal')
>>> print(fh.getvalue()) 
CLUSTAL


seq1        ACCGTTGTA-GTAGCT
sequence-2  A--GTCGAA-GTACCT
3           AGAGTTGAAGGTATCT