skbio.io.format.qseq
)¶The QSeq format (qseq) is a record-based, plain text output format produced by some DNA sequencers for storing biological sequence data, quality scores, per-sequence filtering information, and run-specific metadata.
Has Sniffer: Yes
Reader | Writer | Object Class |
---|---|---|
Yes | No | generator of skbio.sequence.Sequence objects |
Yes | No | skbio.sequence.Sequence |
Yes | No | skbio.sequence.DNA |
Yes | No | skbio.sequence.RNA |
Yes | No | skbio.sequence.Protein |
A QSeq file is composed of single-line records, delimited by tabs. There are 11 fields in a record:
For more details please refer to the CASAVA documentation [R181].
Note
When a QSeq file is read into a scikit-bio object, the object’s metadata attribute is automatically populated with data corresponding to the names above.
Note
lowercase functionality is supported when reading QSeq files. Refer to specific object constructor documentation for details.
Note
scikit-bio allows for the filter field to be ommitted, but it is not clear if this is part of the original format specification.
The following parameters are the same as in FASTQ format
(skbio.io.format.fastq
):
variant
: see variant
parameter in FASTQ formatphred_offset
: see phred_offset
parameter in FASTQ formatThe following additional parameters are the same as in FASTA format
(skbio.io.format.fasta
):
constructor
: see constructor
parameter in FASTA formatseq_num
: see seq_num
parameter in FASTA formatfilter
: If True, excludes sequences that did not pass filtering
(i.e., filter field is 0). Default is True.Suppose we have the following QSeq file:
illumina 1 3 34 -30 30 0 1 ACG....ACGTAC ruBBBBrBCEFGH 1
illumina 1 3 34 30 -30 0 1 CGGGCATTGCA CGGGCasdGCA 0
illumina 1 3 35 -30 30 0 2 ACGTA.AATAAAC geTaAafhwqAAf 1
illumina 1 3 35 30 -30 0 3 CATTTAGGA.TGCA tjflkAFnkKghvM 0
Let’s define this file in-memory as a StringIO
, though this could be a real
file path, file handle, or anything that’s supported by scikit-bio’s I/O
registry in practice:
>>> from io import StringIO
>>> fs = '\n'.join([
... 'illumina\t1\t3\t34\t-30\t30\t0\t1\tACG....ACGTAC\truBBBBrBCEFGH\t1',
... 'illumina\t1\t3\t34\t30\t-30\t0\t1\tCGGGCATTGCA\tCGGGCasdGCA\t0',
... 'illumina\t1\t3\t35\t-30\t30\t0\t2\tACGTA.AATAAAC\tgeTaAafhwqAAf\t1',
... 'illumina\t1\t3\t35\t30\t-30\t0\t3\tCATTTAGGA.TGCA\ttjflkAFnkKghvM\t0'
... ])
>>> fh = StringIO(fs)
To iterate over the sequences using the generator reader, we run:
>>> import skbio.io
>>> for seq in skbio.io.read(fh, format='qseq', variant='illumina1.3'):
... seq
... print('')
Sequence
--------------------------------------
Metadata:
'id': 'illumina_1:3:34:-30:30#0/1'
'index': 0
'lane_number': 3
'machine_name': 'illumina'
'read_number': 1
'run_number': 1
'tile_number': 34
'x': -30
'y': 30
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 13
--------------------------------------
0 ACG....ACG TAC
Sequence
--------------------------------------
Metadata:
'id': 'illumina_1:3:35:-30:30#0/2'
'index': 0
'lane_number': 3
'machine_name': 'illumina'
'read_number': 2
'run_number': 1
'tile_number': 35
'x': -30
'y': 30
Positional metadata:
'quality': <dtype: uint8>
Stats:
length: 13
--------------------------------------
0 ACGTA.AATA AAC
Note that only two sequences were loaded because the QSeq reader filters out
sequences whose filter field is 0 (unless filter=False
is supplied).