FASTA/QUAL format (skbio.io.fasta)

The FASTA file format (fasta) stores biological (i.e., nucleotide or protein) sequences in a simple plain text format that is both human-readable and easy to parse. The file format was first introduced and used in the FASTA software package [R153]. Additional descriptions of the file format can be found in [R154] and [R155].

An example of a FASTA-formatted file containing two DNA sequences:

>seq1 db-accession-149855
CGATGTCGATCGATCGATCGATCAG
>seq2 db-accession-34989
CATCGATCGATCGATGCATGCATGCATG

The QUAL file format is an additional format related to FASTA. A FASTA file is sometimes accompanied by a QUAL file, particuarly when the fasta file contains sequences generated on a high-throughput sequencing instrument. QUAL files store a Phred quality score (nonnegative integer) for each base in a sequence stored in FASTA format (see [R156] for more details). scikit-bio supports reading and writing FASTA (and optionally QUAL) file formats.

Format Support

Has Sniffer: Yes

Reader Writer Object Class
Yes Yes generator of skbio.sequence.BiologicalSequence objects
Yes Yes skbio.alignment.SequenceCollection
Yes Yes skbio.alignment.Alignment
Yes Yes skbio.sequence.BiologicalSequence
Yes Yes skbio.sequence.NucleotideSequence
Yes Yes skbio.sequence.DNASequence
Yes Yes skbio.sequence.RNASequence
Yes Yes skbio.sequence.ProteinSequence

Note

All readers and writers support an optional QUAL file via the qual parameter. If one is provided, quality scores will be read/written in addition to FASTA sequence data.

Format Specification

The following sections define the FASTA and QUAL file formats in detail.

FASTA Format

A FASTA file contains one or more biological sequences. The sequences are stored sequentially, with a record for each sequence (also referred to as a FASTA record). Each record consists of a single-line header (sometimes referred to as a defline, label, description, or comment) followed by the sequence data, optionally split over multiple lines. Blank or whitespace-only lines are not allowed anywhere in the FASTA file.

Note

scikit-bio does not currently support legacy FASTA format (i.e., headers/comments denoted with a semicolon). The format supported by scikit-bio (described below in detail) most closely resembles the description given in NCBI’s BLAST documentation [R155]. See [R154] for more details on legacy FASTA format. If you would like legacy FASTA format support added to scikit-bio, please consider submitting a feature request on the scikit-bio issue tracker (pull requests are also welcome!).

Sequence Header

Each sequence header consists of a single line beginning with a greater-than (>) symbol. Immediately following this is a sequence identifier (ID) and description separated by one or more whitespace characters. Both sequence ID and description are optional and are represented as the empty string ('') in scikit-bio’s objects if they are not present in the header.

A sequence ID consists of a single word: all characters after the greater- than symbol and before the first whitespace character (if any) are taken as the sequence ID. Unique sequence IDs are not strictly enforced by the FASTA format itself. A single standardized ID format is similarly not enforced by FASTA format, though it is often common to use a unique library accession number for a sequence ID (e.g., NCBI’s FASTA defline format [R157]).

Note

scikit-bio will enforce sequence ID uniqueness depending on the type of object that the FASTA file is read into. For example, reading a FASTA file as a generator of BiologicalSequence objects will not enforce unique IDs since it simply yields each sequence it finds in the FASTA file. However, if the FASTA file is read into a SequenceCollection object, ID uniqueness will be enforced because that is a requirement of a SequenceCollection.

If a description is present, it is taken as the remaining characters that follow the sequence ID and initial whitespace(s). The description is considered additional information about the sequence (e.g., comments about the source of the sequence or the molecule that it encodes).

For example, consider the following header:

>seq1 db-accession-149855

seq1 is the sequence ID and db-accession-149855 is the sequence description.

Note

scikit-bio’s readers will remove all leading and trailing whitespace from the description. If a header line begins with whitespace following the >, the ID is assumed to be missing and the remainder of the line is taken as the description.

Sequence Data

Biological sequence data follows the header, and can be split over multiple lines. The sequence data (i.e., nucleotides or amino acids) are stored using the standard IUPAC lexicon (single-letter codes).

Note

scikit-bio supports both upper and lower case characters. Both - and . are supported as gap characters. See skbio.sequence for more details on how scikit-bio interprets sequence data in its in-memory objects.

scikit-bio will remove leading and trailing whitespace from each line of sequence data before joining the sequence chunks into a single sequence. Whitespace characters are not removed from the middle of the sequence chunks. Likewise, other invalid IUPAC characters are not removed from the sequence data as it is read. Thus, it is possible to create an invalid in-memory sequence object (see warning below).

Warning

In an effort to maintain reasonable performance while reading FASTA files (which can be quite large), validation of sequence data is not performed during reading. It is the responsibility of the user to validate their in-memory representation of the data if desired (e.g., by calling is_valid on the returned object). Thus, it is possible to read invalid characters into objects (e.g. whitespace occurring in the middle of a sequence, or invalid IUPAC DNA characters in a DNA sequence).

QUAL Format

A QUAL file contains quality scores for one or more biological sequences stored in a corresponding FASTA file. QUAL format is very similar to FASTA format: it stores records sequentially, with each record beginning with a header line containing a sequence ID and description. The same rules apply to QUAL headers as FASTA headers (see the above sections for details). scikit-bio processes FASTA and QUAL headers in exactly the same way.

Instead of storing biological sequence data in each record, a QUAL file stores a Phred quality score for each base in the corresponding sequence. Quality scores are represented as nonnegative integers separated by whitespace (typically a single space or newline), and can span multiple lines.

Note

When reading FASTA and QUAL files, scikit-bio requires records to be in the same order in both files (i.e., each FASTA and QUAL record must have the same ID and description after being parsed). In addition to having the same order, the number of FASTA records must match the number of QUAL records (i.e., missing or additonal records are not allowed). scikit-bio also requires that the number of quality scores match the number of bases in the corresponding sequence.

When writing FASTA and QUAL files, scikit-bio will maintain the same ordering of records in both files (i.e., using the same ID and description in both records) to support future reading.

Format Parameters

The following parameters are available to change how FASTA/QUAL files are read or written in scikit-bio.

QUAL File Parameter (Readers and Writers)

The qual parameter is available to all FASTA format readers and writers. It can be any file-like type supported by scikit-bio’s I/O registry (e.g., file handle, file path, etc.). If qual is provided when reading, quality scores will be included in each in-memory BiologicalSequence object, in addition to sequence data stored in the FASTA file. When writing, quality scores will be written in QUAL format in addition to the sequence data being written in FASTA format.

Reader-specific Parameters

The available reader parameters differ depending on which reader is used.

Generator, SequenceCollection, and Alignment Reader Parameters

The constructor parameter can be used with the BiologicalSequence generator, SequenceCollection, and Alignment FASTA readers. constructor specifies the in-memory type of each sequence that is parsed, and defaults to BiologicalSequence. constructor should be a subclass of BiologicalSequence. For example, if you know that the FASTA file you’re reading contains protein sequences, you would pass constructor=ProteinSequence to the reader call.

Note

The FASTA sniffer will not attempt to guess the constructor parameter, so it will always default to BiologicalSequence if another type is not provided to the reader.

BiologicalSequence Reader Parameters

The seq_num parameter can be used with the BiologicalSequence, NucleotideSequence, DNASequence, RNASequence, and ProteinSequence FASTA readers. seq_num specifies which sequence to read from the FASTA file (and optional QUAL file), and defaults to 1 (i.e., such that the first sequence is read). For example, to read the 50th sequence from a FASTA file, you would pass seq_num=50 to the reader call.

Writer-specific Parameters

The following parameters are available to all FASTA format writers:

  • id_whitespace_replacement: string to replace each whitespace character in a sequence ID. This parameter is useful for cases where an in-memory sequence ID contains whitespace, which would result in an on-disk representation that would not be read back into memory as the same ID (since IDs in FASTA format cannot contain whitespace). Defaults to _. If None, no whitespace replacement is performed and IDs are written as they are stored in memory (this has the potential to create an invalid FASTA-formatted file; see note below). This parameter also applies to a QUAL file if one is provided.
  • description_newline_replacement: string to replace each newline character in a sequence description. Since a FASTA header must be a single line, newlines are not allowed in sequence descriptions and must be replaced in order to write a valid FASTA file. Defaults to a single space. If None, no newline replacement is performed and descriptions are written as they are stored in memory (this has the potential to create an invalid FASTA-formatted file; see note below). This parameter also applies to a QUAL file if one is provided.
  • max_width: integer specifying the maximum line width (i.e., number of characters) for sequence data and/or quality scores. If a sequence or its quality scores are longer than max_width, it will be split across multiple lines, each with a maximum width of max_width. Note that there are some caveats when splitting quality scores. A single quality score will never be split across multiple lines, otherwise it would become two different quality scores when read again. Thus, splitting only occurs between quality scores. This makes it possible to have a single long quality score written on its own line that exceeds max_width. For example, the quality score 12345 would not be split across multiple lines even if max_width=3. Thus, a 5-character line would be written. Default behavior is to not split sequence data or quality scores across multiple lines.

Note

The FASTA format writers will have noticeably better runtime performance if id_whitespace_replacement and/or description_newline_replacement are set to None so that whitespace replacement is not performed during writing. However, this can potentially create invalid FASTA files, especially if there are newline characters in the IDs or descriptions. For IDs with whitespace, this can also affect how the IDs are read into memory in a subsequent read operation. For example, if an in-memory sequence ID is 'seq 1' and id_whitespace_replacement=None, reading the FASTA file back into memory would result in an ID of 'seq', and '1' would be part of the sequence description.

Examples

Reading and Writing FASTA Files

Suppose we have the following FASTA file with five equal-length sequences (example modified from [R158]):

>seq1 Turkey
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>seq2 Salmo gair
AAGCCTTGGCAGTGCAGGGTGAGCCGTGG
CCGGGCACGGTAT
>seq3 H. Sapiens
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>seq4 Chimp
AAACCCTTGCCG
TTACGCTTAAAC
CGAGGCCGGGAC
ACTCAT
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

Note

Original copyright notice for the above example file:

(c) Copyright 1986-2008 by The University of Washington. Written by Joseph Felsenstein. Permission is granted to copy this document provided that no fee is charged for it and that this copyright notice is not removed.

Note that the sequences are not required to be of equal length in order for the file to be a valid FASTA file (this depends on the object that you’re reading the file into). Also note that some of the sequences occur on a single line, while others are split across multiple lines.

Let’s define this file in-memory as a StringIO, though this could be a real file path, file handle, or anything that’s supported by scikit-bio’s I/O registry in practice:

>>> from StringIO import StringIO
>>> fs = (
...     ">seq1 Turkey\n"
...     "AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT\n"
...     ">seq2 Salmo gair\n"
...     "AAGCCTTGGCAGTGCAGGGTGAGCCGTGG\n"
...     "CCGGGCACGGTAT\n"
...     ">seq3 H. Sapiens\n"
...     "ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA\n"
...     ">seq4 Chimp\n"
...     "AAACCCTTGCCG\n"
...     "TTACGCTTAAAC\n"
...     "CGAGGCCGGGAC\n"
...     "ACTCAT\n"
...     ">seq5 Gorilla\n"
...     "AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA\n")
>>> fh = StringIO(fs)

Let’s read the FASTA file into a SequenceCollection:

>>> from skbio import SequenceCollection
>>> sc = SequenceCollection.read(fh)
>>> sc.sequence_lengths()
[42, 42, 42, 42, 42]
>>> sc.ids()
['seq1', 'seq2', 'seq3', 'seq4', 'seq5']

We see that all 5 sequences have 42 characters, and that each of the sequence IDs were successfully read into memory.

Since these sequences are of equal length (presumably because they’ve been aligned), let’s load the FASTA file into an Alignment object, which is a more appropriate data structure:

>>> from skbio import Alignment
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> aln = Alignment.read(fh)
>>> aln.sequence_length()
42

Note that we were able to read the FASTA file into two different data structures (SequenceCollection and Alignment) using the exact same read method call (and underlying reading/parsing logic). Also note that we didn’t specify a file format in the read call. The FASTA sniffer detected the correct file format for us!

Let’s inspect the type of sequences stored in the Alignment:

>>> aln[0]
<BiologicalSequence: AAGCTNGGGC... (length: 42)>

By default, sequences are loaded as BiologicalSequence objects. We can change the type of sequence via the constructor parameter:

>>> from skbio import DNASequence
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> aln = Alignment.read(fh, constructor=DNASequence)
>>> aln[0]
<DNASequence: AAGCTNGGGC... (length: 42)>

We now have an Alignment of DNASequence objects instead of BiologicalSequence objects. Validation of sequence character data is not performed during reading (see warning above for details). To verify that each of the sequences are valid DNA sequences:

>>> aln.is_valid()
True

To write the alignment in FASTA format:

>>> new_fh = StringIO()
>>> aln.write(new_fh)
>>> print(new_fh.getvalue())
>seq1 Turkey
AAGCTNGGGCATTTCAGGGTGAGCCCGGGCAATACAGGGTAT
>seq2 Salmo gair
AAGCCTTGGCAGTGCAGGGTGAGCCGTGGCCGGGCACGGTAT
>seq3 H. Sapiens
ACCGGTTGGCCGTTCAGGGTACAGGTTGGCCGTTCAGGGTAA
>seq4 Chimp
AAACCCTTGCCGTTACGCTTAAACCGAGGCCGGGACACTCAT
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

>>> new_fh.close()

Both SequenceCollection and Alignment load all of the sequences from the FASTA file into memory at once. If the FASTA file is large (which is often the case), this may be infeasible if you don’t have enough memory. To work around this issue, you can stream the sequences using scikit-bio’s generator-based FASTA reader and writer. The generator-based reader yields BiologicalSequence objects (or subclasses if constructor is supplied) one at a time, instead of loading all sequences into memory. For example, let’s use the generator-based reader to process a single sequence at a time in a for loop:

>>> import skbio.io
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> for seq in skbio.io.read(fh, format='fasta'):
...     seq
<BiologicalSequence: AAGCTNGGGC... (length: 42)>
<BiologicalSequence: AAGCCTTGGC... (length: 42)>
<BiologicalSequence: ACCGGTTGGC... (length: 42)>
<BiologicalSequence: AAACCCTTGC... (length: 42)>
<BiologicalSequence: AAACCCTTGC... (length: 42)>

A single sequence can also be read into a BiologicalSequence (or subclass):

>>> from skbio import BiologicalSequence
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> BiologicalSequence.read(fh)
<BiologicalSequence: AAGCTNGGGC... (length: 42)>

By default, the first sequence in the FASTA file is read. This can be controlled with seq_num. For example, to read the fifth sequence:

>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> BiologicalSequence.read(fh, seq_num=5)
<BiologicalSequence: AAACCCTTGC... (length: 42)>

We can use the same API to read the fifth sequence into a DNASequence:

>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
>>> dna_seq = DNASequence.read(fh, seq_num=5)
>>> dna_seq
<DNASequence: AAACCCTTGC... (length: 42)>

Individual sequence objects can also be written in FASTA format:

>>> new_fh = StringIO()
>>> dna_seq.write(new_fh)
>>> print(new_fh.getvalue())
>seq5 Gorilla
AAACCCTTGCCGGTACGCTTAAACCATTGCCGGTACGCTTAA

>>> new_fh.close()

Reading and Writing FASTA/QUAL Files

In addition to reading and writing standalone FASTA files, scikit-bio also supports reading and writing FASTA and QUAL files together. Suppose we have the following FASTA file:

>seq1 db-accession-149855
CGATGTC
>seq2 db-accession-34989
CATCG

Also suppose we have the following QUAL file:

>seq1 db-accession-149855
40 39 39 4
50 1 100
>seq2 db-accession-34989
3 3 10 42 80
>>> fasta_fs = (
...     ">seq1 db-accession-149855\n"
...     "CGATGTC\n"
...     ">seq2 db-accession-34989\n"
...     "CATCG\n")
>>> fasta_fh = StringIO(fasta_fs)
>>> qual_fs = (
...     ">seq1 db-accession-149855\n"
...     "40 39 39 4\n"
...     "50 1 100\n"
...     ">seq2 db-accession-34989\n"
...     "3 3 10 42 80\n")
>>> qual_fh = StringIO(qual_fs)

To read in a single BiologicalSequence at a time, we can use the generator-based reader as we did above, providing both FASTA and QUAL files:

>>> for seq in skbio.io.read(fasta_fh, qual=qual_fh, format='fasta'):
...     seq
...     seq.quality
<BiologicalSequence: CGATGTC (length: 7)>
array([ 40,  39,  39,   4,  50,   1, 100])
<BiologicalSequence: CATCG (length: 5)>
array([ 3,  3, 10, 42, 80])

Note that the sequence objects have quality scores since we provided a QUAL file. The other FASTA readers operate in a similar manner.

Now let’s load the sequences and their quality scores into a SequenceCollection:

>>> fasta_fh = StringIO(fasta_fs) # reload to read from the beginning again
>>> qual_fh = StringIO(qual_fs) # reload to read from the beginning again
>>> sc = SequenceCollection.read(fasta_fh, qual=qual_fh)
>>> sc
<SequenceCollection: n=2; mean +/- std length=6.00 +/- 1.00>

To write the sequence data and quality scores in the SequenceCollection to FASTA and QUAL files, respectively, we run:

>>> new_fasta_fh = StringIO()
>>> new_qual_fh = StringIO()
>>> sc.write(new_fasta_fh, qual=new_qual_fh)
>>> print(new_fasta_fh.getvalue())
>seq1 db-accession-149855
CGATGTC
>seq2 db-accession-34989
CATCG

>>> print(new_qual_fh.getvalue())
>seq1 db-accession-149855
40 39 39 4 50 1 100
>seq2 db-accession-34989
3 3 10 42 80

>>> new_fasta_fh.close()
>>> new_qual_fh.close()

References

[R153]Lipman, DJ; Pearson, WR (1985). “Rapid and sensitive protein similarity searches”. Science 227 (4693): 1435-41.
[R154](1, 2) http://en.wikipedia.org/wiki/FASTA_format
[R155](1, 2) http://blast.ncbi.nlm.nih.gov/blastcgihelp.shtml
[R156]https://www.broadinstitute.org/crd/wiki/index.php/Qual
[R157]Madden T. The BLAST Sequence Analysis Tool. 2002 Oct 9 [Updated 2003 Aug 13]. In: McEntyre J, Ostell J, editors. The NCBI Handbook [Internet]. Bethesda (MD): National Center for Biotechnology Information (US); 2002-. Chapter 16. Available from: http://www.ncbi.nlm.nih.gov/books/NBK21097/
[R158]http://evolution.genetics.washington.edu/phylip/doc/sequence.html