# FASTQ format (skbio.io.fastq)¶

The FASTQ file format (fastq) stores biological (e.g., nucleotide) sequences and their quality scores in a simple plain text format that is both human-readable and easy to parse. The file format was invented by Jim Mullikin at the Wellcome Trust Sanger Institute but wasn’t given a formal definition, though it has informally become a standard file format for storing high-throughput sequence data. More information about the format and its variants can be found in [R160] and [R161].

Conceptually, a FASTQ file is similar to paired FASTA and QUAL files in that it stores both biological sequences and their quality scores. FASTQ differs from FASTA/QUAL because the quality scores are stored in the same file as the biological sequence data.

An example FASTQ-formatted file containing two DNA sequences and their quality scores:

@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
Y^T]]c^cabcacc^Lb^ccYT\T\Y\WF
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^ac^\\bTc\aTbbb


## Format Support¶

Has Sniffer: Yes

Yes Yes generator of skbio.sequence.BiologicalSequence objects
Yes Yes skbio.alignment.SequenceCollection
Yes Yes skbio.alignment.Alignment
Yes Yes skbio.sequence.BiologicalSequence
Yes Yes skbio.sequence.NucleotideSequence
Yes Yes skbio.sequence.DNASequence
Yes Yes skbio.sequence.RNASequence
Yes Yes skbio.sequence.ProteinSequence

## Format Specification¶

A FASTQ file contains one or more biological sequences and their corresponding quality scores stored sequentially as records. Each record consists of four sections:

1. Sequence header line consisting of a sequence identifier (ID) and description (both optional)
2. Biological sequence data (typically stored using the standard IUPAC lexicon), optionally split over multiple lines
3. Quality header line separating sequence data from quality scores (optionally repeating the ID and description from the sequence header line)
4. Quality scores as printable ASCII characters, optionally split over multiple lines. Decoding of quality scores will depend on the specified FASTQ variant (see below for more details)

For the complete FASTQ format specification, see [R160]. scikit-bio’s FASTQ implementation follows the format specification described in this excellent publication, including validating the implementation against the FASTQ examples provided in the publication’s supplementary data.

Note

IDs and descriptions will be parsed from sequence header lines in exactly the same way as FASTA headers (skbio.io.fasta).

Whitespace is not allowed in sequence data or quality scores. Leading and trailing whitespace is not stripped from sequence data or quality scores, resulting in an error being raised if found.

scikit-bio will write FASTQ files in a normalized format, with each record section on a single line. Thus, each record will be composed of exactly four lines. The quality header line won’t have the sequence ID and description repeated.

### Quality Score Variants¶

FASTQ associates quality scores with sequence data, with each quality score encoded as a single printable ASCII character. In scikit-bio, all quality scores are decoded as Phred quality scores. This is the most common quality score metric, though there are others (e.g., Solexa quality scores). Unfortunately, different sequencers have different ways of encoding quality scores as ASCII characters, notably Sanger and Illumina. Below is a table highlighting the different encoding variants supported by scikit-bio, as well as listing the equivalent variant names used in the Open Bioinformatics Foundation (OBF) [R162] projects (e.g., Biopython, BioPerl, etc.).

Variant ASCII Range Offset Quality Range Notes
sanger 33 to 126 33 0 to 93 Equivalent to OBF’s fastq-sanger.
illumina1.3 64 to 126 64 0 to 62 Equivalent to OBF’s fastq-illumina. Use this if your data was generated using Illumina 1.3-1.7 software.
illumina1.8 33 to 95 33 0 to 62 Equivalent to sanger but with 0 to 62 quality score range check. Use this if your data was generated using Illumina 1.8 software or later.
solexa 59 to 126 64 -5 to 62 Not currently implemented.

Note

When writing, Phred quality scores will be truncated to the maximum value in the variant’s range and a warning will be issued. This is consistent with the OBF projects.

When reading, an error will be raised if a decoded quality score is outside the variant’s range.

## Format Parameters¶

The following parameters are available to all FASTQ format readers and writers:

• variant: A string indicating the quality score variant used to decode/encode Phred quality scores. Must be one of sanger, illumina1.3, illumina1.8, or solexa. This parameter is preferred over phred_offset because additional quality score range checks and conversions can be performed. It is also more explicit.
• phred_offset: An integer indicating the ASCII code offset used to decode/encode Phred quality scores. Must be in the range [33, 126]. All decoded scores will be assumed to be Phred scores (i.e., no additional conversions are performed). Prefer using variant over this parameter whenever possible.

Note

You must provide variant or phred_offset when reading or writing a FASTQ file. variant and phred_offset cannot both be provided at the same time.

The following additional parameters are the same as in FASTA format (skbio.io.fasta):

• constructor: see constructor parameter in FASTA format
• seq_num: see seq_num parameter in FASTA format
• id_whitespace_replacement: see id_whitespace_replacement parameter in FASTA format
• description_newline_replacement: see description_newline_replacement parameter in FASTA format

## Examples¶

Suppose we have the following FASTQ file with two DNA sequences:

@seq1 description 1
AACACCAAACTTCTCCACC
ACGTGAGCTACAAAAGGGT
+seq1 description 1
''''Y^T]']C^CABCACC
^LB^CCYT\T\Y\WF^^^
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB


Note that the first sequence and its quality scores are split across multiple lines, while the second sequence and its quality scores are each on a single line. Also note that the first sequence has a duplicate ID and description on the quality header line, while the second sequence does not.

Let’s define this file in-memory as a StringIO, though this could be a real file path, file handle, or anything that’s supported by scikit-bio’s I/O registry in practice:

>>> from StringIO import StringIO
>>> fs = '\n'.join([
...     r"@seq1 description 1",
...     r"AACACCAAACTTCTCCACC",
...     r"ACGTGAGCTACAAAAGGGT",
...     r"+seq1 description 1",
...     r"''''Y^T]']C^CABCACC",
...     r"'^LB^CCYT\T\Y\WF^^^",
...     r"@seq2 description 2",
...     r"TATGTATATATAACATATACATATATACATACATA",
...     r"+",
...     r"]KZ[PY]_[YY^'''AC^\\'BT''C'\AT''BBB"])
>>> fh = StringIO(fs)


To load the sequences into a SequenceCollection, we run:

>>> from skbio import SequenceCollection
>>> sc
<SequenceCollection: n=2; mean +/- std length=36.50 +/- 1.50>


Note that quality scores are decoded from Sanger. To load the second sequence as a DNASequence:

>>> from skbio import DNASequence
>>> fh = StringIO(fs) # reload the StringIO to read from the beginning again
<DNASequence: TATGTATATA... (length: 35)>


To write our SequenceCollection to a FASTQ file with quality scores encoded using the illumina1.3 variant:

>>> new_fh = StringIO()
>>> sc.write(new_fh, format='fastq', variant='illumina1.3')
>>> print(new_fh.getvalue())
@seq1 description 1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAGGGT
+
FFFFx}s|F|b}babbbF}ka}bbxs{s{x{ve}}}
@seq2 description 2
TATGTATATATAACATATACATATATACATACATA
+
|jyzox|~zxx}FFFb}{{FasFFbF{sFFaaa

>>> new_fh.close()


Note that the file has been written in normalized format: sequence and quality scores each only occur on a single line and the sequence header line is not repeated in the quality header line. Note also that the quality scores are different because they have been encoded using a different variant.

## References¶

 [R160] (1, 2) Peter J. A. Cock, Christopher J. Fields, Naohisa Goto, Michael L. Heuer, and Peter M. Rice. The Sanger FASTQ file format for sequences with quality scores, and the Solexa/Illumina FASTQ variants. Nucl. Acids Res. (2010) 38 (6): 1767-1771. first published online December 16, 2009. doi:10.1093/nar/gkp1137 http://nar.oxfordjournals.org/content/38/6/1767