skbio.parse.sequences.parse_fasta

skbio.parse.sequences.parse_fasta(infile, strict=True, label_to_name=None, finder=<function parser at 0x29bac08>, label_characters='>', ignore_comment=False)[source]

Generator of labels and sequences from a fasta file.

Parameters:

infile : open file object or str

An open fasta file or a path to a fasta file.

strict : bool

If True a RecordError will be raised if no header line is found.

label_to_name : function

A function to apply to the sequence label (i.e., text on the header line) before yielding it. By default, the sequence label is returned with no processing. This function must take a single string as input and return a single string as output.

finder : function

The function to apply to find records in the fasta file. In general you should not have to change this.

label_characters : str

String used to indicate the beginning of a new record. In general you should not have to change this.

ignore_comment : bool

If True, split the sequence label on spaces, and return the label only as the first space separated field (i.e., the sequence identifier). Note: if both ignore_comment and label_to_name are passed, ignore_comment is ignored (both operate on the label, so there is potential for things to get messy otherwise).

Returns:

two-item tuple of str

yields the label and sequence for each entry.

Raises:

RecordError

If strict == True, raises a RecordError if there is a fasta label line with no associated sequence, or a sequence with no associated label line (in other words, if there is a partial record).

Examples

Assume we have a fasta-formatted file with the following contents:

>seq1 db-accession-149855
CGATGTCGATCGATCGATCGATCAG
>seq2 db-accession-34989
CATCGATCGATCGATGCATGCATGCATG
>>> from StringIO import StringIO
>>> fasta_f = StringIO('>seq1 db-accession-149855\n'
...                    'CGATGTCGATCGATCGATCGATCAG\n'
...                    '>seq2 db-accession-34989\n'
...                    'CATCGATCGATCGATGCATGCATGCATG\n')

We can parse this as follows:

>>> from skbio.parse.sequences import parse_fasta
>>> for label, seq in parse_fasta(fasta_f):
...     print(label, seq)
seq1 db-accession-149855 CGATGTCGATCGATCGATCGATCAG
seq2 db-accession-34989 CATCGATCGATCGATGCATGCATGCATG

The sequence label or header line in a fasta file is defined as containing two separate pieces of information, delimited by a space. The first space- separated entry is the sequence identifier, and everything following the first space is considered additional information (e.g., comments about the source of the sequence or the molecule that it encodes). Often we don’t care about that information within our code. If you want to just return the sequence identifier from that line, you can pass ignore_comment=True:

>>> from StringIO import StringIO
>>> fasta_f = StringIO('>seq1 db-accession-149855\n'
...                    'CGATGTCGATCGATCGATCGATCAG\n'
...                    '>seq2 db-accession-34989\n'
...                    'CATCGATCGATCGATGCATGCATGCATG\n')
>>> from skbio.parse.sequences import parse_fasta
>>> for label, seq in parse_fasta(fasta_f, ignore_comment=True):
...     print(label, seq)
seq1 CGATGTCGATCGATCGATCGATCAG
seq2 CATCGATCGATCGATGCATGCATGCATG