skbio.parse.sequences.parse_fastq

skbio.parse.sequences.parse_fastq(data, strict=False, phred_offset=33)[source]

yields label, seq, and qual from a fastq file.

Parameters:

data : open file object or str

An open fastq file (opened in binary mode) or a path to it.

strict : bool

If strict is true a FastqParse error will be raised if the seq and qual labels dont’ match.

phred_offset : int or None

Force a Phred offset, currently restricted to either 33 or 64. Default behavior is to infer the Phred offset.

Returns:

label, seq, qual : (str, bytes, np.array)

yields the label, sequence and quality for each entry

Examples

Assume we have a fastq formatted file with the following contents:

@seq1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
+
````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF
@seq2
TATGTATATATAACATATACATATATACATACATA
+
]KZ[PY]_[YY^```ac^\\`bT``c`\aT``bbb

We can use the following code:

>>> from StringIO import StringIO
>>> from skbio.parse.sequences import parse_fastq
>>> fastq_f = StringIO('@seq1\n'
...                     'AACACCAAACTTCTCCACCACGTGAGCTACAAAAG\n'
...                     '+\n'
...                     '````Y^T]`]c^cabcacc`^Lb^ccYT\T\Y\WF\n'
...                     '@seq2\n'
...                     'TATGTATATATAACATATACATATATACATACATA\n'
...                     '+\n'
...                     ']KZ[PY]_[YY^```ac^\\\`bT``c`\\aT``bbb\n')
>>> for label, seq, qual in parse_fastq(fastq_f, phred_offset=64):
...     print label
...     print seq
...     print qual
seq1
AACACCAAACTTCTCCACCACGTGAGCTACAAAAG
[32 32 32 32 25 30 20 29 32 29 35 30 35 33 34 35 33 35 35 32 30 12 34 30 35
 35 25 20 28 20 28 25 28 23  6]
seq2
TATGTATATATAACATATACATATATACATACATA
[29 11 26 27 16 25 29 31 27 25 25 30 32 32 32 33 35 30 28 28 32 34 20 32 32
 35 32 28 33 20 32 32 34 34 34]