PHYLIP multiple sequence alignment format (skbio.io.format.phylip)

The PHYLIP file format stores a multiple sequence alignment. The format was originally defined and used in Joe Felsenstein’s PHYLIP package [R176], and has since been supported by several other bioinformatics tools (e.g., RAxML [R177]). See [R178] for the original format description, and [R179] and [R180] for additional descriptions.

An example PHYLIP-formatted file taken from [R178]:

      5    42
Turkey    AAGCTNGGGC ATTTCAGGGT GAGCCCGGGC AATACAGGGT AT
Salmo gairAAGCCTTGGC AGTGCAGGGT GAGCCGTGGC CGGGCACGGT AT
H. SapiensACCGGTTGGC CGTTCAGGGT ACAGGTTGGC CGTTCAGGGT AA
Chimp     AAACCCTTGC CGTTACGCTT AAACCGAGGC CGGGACACTC AT
Gorilla   AAACCCTTGC CGGTACGCTT AAACCATTGC CGGTACGCTT AA

Note

Original copyright notice for the above PHYLIP file:

(c) Copyright 1986-2008 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

Format Support

Has Sniffer: Yes

Reader Writer Object Class
Yes Yes skbio.alignment.TabularMSA

Format Specification

PHYLIP format is a plain text format containing exactly two sections: a header describing the dimensions of the alignment, followed by the multiple sequence alignment itself.

The format described here is “strict” PHYLIP, as described in [R179]. Strict PHYLIP requires that each sequence identifier is exactly 10 characters long (padded with spaces as necessary). Other bioinformatics tools (e.g., RAxML) may relax this rule to allow for longer sequence identifiers. See the Alignment Section below for more details.

The format described here is “sequential” format. The original PHYLIP format specification [R178] describes both sequential and interleaved formats.

Note

scikit-bio currently supports reading and writing strict, sequential PHYLIP-formatted files. Relaxed and/or interleaved PHYLIP formats are not supported.

Header Section

The header consists of a single line describing the dimensions of the alignment. It must be the first line in the file. The header consists of optional spaces, followed by two positive integers (n and m) separated by one or more spaces. The first integer (n) specifies the number of sequences (i.e., the number of rows) in the alignment. The second integer (m) specifies the length of the sequences (i.e., the number of columns) in the alignment. The smallest supported alignment dimensions are 1x1.

Note

scikit-bio will write the PHYLIP format header without preceding spaces, and with only a single space between n and m.

PHYLIP format does not support blank line(s) between the header and the alignment.

Alignment Section

The alignment section immediately follows the header. It consists of n lines (rows), one for each sequence in the alignment. Each row consists of a sequence identifier (ID) and characters in the sequence, in fixed width format.

The sequence ID can be up to 10 characters long. IDs less than 10 characters must have spaces appended to them to reach the 10 character fixed width. Within an ID, all characters except newlines are supported, including spaces, underscores, and numbers.

Note

When reading a PHYLIP-formatted file into an skbio.alignment.TabularMSA object, sequence identifiers/labels are stored as TabularMSA index labels (index property).

When writing an skbio.alignment.TabularMSA object as a PHYLIP-formatted file, TabularMSA index labels will be converted to strings and written as sequence identifiers/labels.

scikit-bio supports the empty string ('') as a valid sequence ID. An empty ID will be padded with 10 spaces when writing.

Sequence characters immediately follow the sequence ID. They must start at the 11th character in the line, as the first 10 characters are reserved for the sequence ID. While PHYLIP format does not explicitly restrict the set of supported characters that may be used to represent a sequence, the original format description [R178] specifies the IUPAC nucleic acid lexicon for DNA or RNA sequences, and the IUPAC protein lexicon for protein sequences. The original PHYLIP specification uses - as a gap character, though older versions also supported .. The sequence characters may contain optional spaces (e.g., to improve readability), and both upper and lower case characters are supported.

Note

scikit-bio will read/write a PHYLIP-formatted file as long as the alignment’s sequence characters are valid for the type of in-memory sequence object being read into or written from. This differs from the PHYLIP specification, which states that a PHYLIP-formatted file can only contain valid IUPAC characters. See the constructor format parameter below for details.

Since scikit-bio supports both - and . as gap characters (e.g., in DNA, RNA, and Protein sequence objects), both are supported when reading/writing a PHYLIP-formatted file.

When writing a PHYLIP-formatted file, scikit-bio will split up each sequence into chunks that are 10 characters long. Each chunk will be separated by a single space. The sequence will always appear on a single line (sequential format). It will not be wrapped across multiple lines. Sequences are chunked in this manner for improved readability, and because most example PHYLIP files are chunked in a similar way (e.g., see the example file above). Note that this chunking is not required when reading PHYLIP-formatted files, nor by the PHYLIP format specification itself.

Format Parameters

The only supported format parameter is constructor, which specifies the type of in-memory sequence object to read each aligned sequence into. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if you know that the PHYLIP file you’re reading contains DNA sequences, you would pass constructor=DNA to the reader call.

Examples

Let’s create a TabularMSA with three DNA sequences:

>>> from skbio import TabularMSA, DNA
>>> seqs = [DNA('ACCGTTGTA-GTAGCT', metadata={'id':'seq1'}),
...         DNA('A--GTCGAA-GTACCT', metadata={'id':'sequence-2'}),
...         DNA('AGAGTTGAAGGTATCT', metadata={'id':'3'})]
>>> msa = TabularMSA(seqs, minter='id')
>>> msa
TabularMSA[DNA]
----------------------
Stats:
    sequence count: 3
    position count: 16
----------------------
ACCGTTGTA-GTAGCT
A--GTCGAA-GTACCT
AGAGTTGAAGGTATCT
>>> msa.index
Index(['seq1', 'sequence-2', '3'], dtype='object')

Now let’s write the TabularMSA to file in PHYLIP format and take a look at the output:

>>> from io import StringIO
>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
seq1      ACCGTTGTA- GTAGCT
sequence-2A--GTCGAA- GTACCT
3         AGAGTTGAAG GTATCT

>>> fh.close()

Notice that the 16-character sequences were split into two chunks, and that each sequence appears on a single line (sequential format). Also note that each sequence ID is padded with spaces to 10 characters in order to produce a fixed width column.

If the index labels in a TabularMSA surpass the 10-character limit, an error will be raised when writing:

>>> msa.index = ['seq1', 'long-sequence-2', 'seq3']
>>> fh = StringIO()
>>> msa.write(fh, format='phylip')
Traceback (most recent call last):
    ...
skbio.io._exception.PhylipFormatError: ``TabularMSA`` can only be written in PHYLIP format if all sequence index labels have 10 or fewer characters. Found sequence with index label 'long-sequence-2' that exceeds this limit. Use ``TabularMSA.reassign_index`` to assign shorter index labels.
>>> fh.close()

One way to work around this is to assign shorter index labels. The recommended way to do this is via TabularMSA.reassign_index. For example, to reassign default integer index labels:

>>> msa.reassign_index()
>>> msa.index
RangeIndex(start=0, stop=3, step=1)

We can now write the TabularMSA in PHYLIP format:

>>> fh = StringIO()
>>> print(msa.write(fh, format='phylip').getvalue())
3 16
0         ACCGTTGTA- GTAGCT
1         A--GTCGAA- GTACCT
2         AGAGTTGAAG GTATCT

>>> fh.close()