This module provides functions for parsing sequence files in a variety of different formats. Two interfaces are provided for parsing sequence files: sequence iterators (high-level, recommended interface) and parsing functions (lower-level interface).
The sequence iterator interface is the recommended way to parse sequence files. The load function provides a standard, high-level interface to iterate over sequence files regardless of file type or whether they are compressed. The method accepts single or multiple file paths and employs the correct file handlers, iterator objects, and parsers for the user.
The benefit of the sequence iterator interface is that the type of the file and any file format details are abstracted away from the user. In this manner, the user does not need to worry about whether they’re operating on FASTA or FASTQ files or any differences in the returns from their respective parsers.
SequenceIterator(seq[, qual, transform, ...]) | Provide a standard API for interacting with sequence data |
FastaIterator(seq[, qual, transform, ...]) | Populate state based on fasta sequence and qual (if provided) |
FastqIterator(*args, **kwargs) | Populate state based on fastq sequence |
load(seqs[, qual, constructor]) | Construct the appropriate iterator for all your processing needs |
For the first set of sequence iterator examples, we’re going to use the load function. The load function is intended to operate on file paths, so let’s create two files for it to use. The first one will be a regular FASTA file, and the second will be a gzip’d FASTQ file:
>>> import os
>>> import gzip
>>> out = open('test_seqs.fna', 'w')
>>> out.write(">s1\nATGC\n>s2\nATGGC\n")
>>> out.close()
>>> outgz = gzip.open('test_seqs.fq.gz', 'w')
>>> _ = outgz.write("@s3\nAATTGG\n+\nghghgh\n@s4\nAAA\n+\nfgh\n")
>>> outgz.close()
Now let’s see what load can do:
>>> it = load(['test_seqs.fna', 'test_seqs.fq.gz'], phred_offset=64)
>>> for rec in it:
... print rec['SequenceID']
... print rec['Sequence']
... print rec['Qual']
s1
ATGC
None
s2
ATGGC
None
s3
AATTGG
[39 40 39 40 39 40]
s4
AAA
[38 39 40]
To be polite, let’s remove the files we just created:
>>> os.remove('test_seqs.fna')
>>> os.remove('test_seqs.fq.gz')
In the following examples, we’ll see how to use the sequence iterators directly instead of using load.
>>> from StringIO import StringIO
>>> from skbio.parse.sequences import FastaIterator, FastqIterator
In this first example, we’re going to construct a FASTA iterator that is also paired with quality scores (e.g., as in 454 fasta/qual files).
>>> seqs = StringIO(">seq1\n"
... "ATGC\n"
... ">seq2\n"
... "TTGGCC\n")
>>> qual = StringIO(">seq1\n"
... "10 20 30 40\n"
... ">seq2\n"
... "1 2 3 4 5 6\n")
>>> it = FastaIterator(seq=[seqs], qual=[qual])
>>> for record in it:
... print record['Sequence']
... print record['Qual']
ATGC
[10 20 30 40]
TTGGCC
[1 2 3 4 5 6]
In the next example, we’re going to iterate over multiple FASTQ files at once.
>>> seqs1 = StringIO("@seq1\n"
... "ATGC\n"
... "+\n"
... "hhhh\n")
>>> seqs2 = StringIO("@seq2\n"
... "AATTGGCC\n"
... ">seq2\n"
... "abcdefgh\n")
>>> it = FastqIterator(seq=[seqs1, seqs2], phred_offset=64)
>>> for record in it:
... print record['Sequence']
... print record['Qual']
ATGC
[40 40 40 40]
AATTGGCC
[33 34 35 36 37 38 39 40]
Finally, we can apply arbitrary transforms to the sequences during iteration.
>>> seqs1 = StringIO("@seq1\n"
... "ATGC\n"
... "+\n"
... "hhhh\n")
>>> seqs2 = StringIO("@seq2\n"
... "AATTGGCC\n"
... ">seq2\n"
... "abcdefgh\n")
>>> def rev_f(st):
... st['Sequence'] = st['Sequence'][::-1]
... st['Qual'] = st['Qual'][::-1] if st['Qual'] is not None else None
>>> it = FastqIterator(seq=[seqs1, seqs2], transform=rev_f, phred_offset=64)
>>> for record in it:
... print record['Sequence']
... print record['Qual']
CGTA
[40 40 40 40]
CCGGTTAA
[40 39 38 37 36 35 34 33]
Lower-level parsing functions are also made available in addition to the sequence iterator interface. These functions can be used to directly parse a single sequence file. They accept file paths, file handles, or file-like objects.
parse_clustal(record[, strict]) | yields labels and sequences |
parse_fasta(infile[, strict, label_to_name, ...]) | Generator of labels and sequences from a fasta file. |
parse_fastq(data[, strict, ...]) | yields label, seq, and qual from a fastq file. |
parse_qual(infile[, full_header]) | yields label and qual from a qual file. |