# skbio.parse.sequences.parse_fasta¶

skbio.parse.sequences.parse_fasta(infile, strict=True, label_to_name=None, finder=<function parser at 0x2afb599ea410>, label_characters='>', ignore_comment=False)[source]

Generator of labels and sequences from a fasta file.

Note

Deprecated in scikit-bio 0.2.0-dev parse_fasta will be removed in scikit-bio 0.3.0. It is replaced by read, which is a more general method for deserializing FASTA-formatted files. read supports multiple file formats, automatic file format detection, etc. by taking advantage of scikit-bio’s I/O registry system. See skbio.io for more details.

Parameters: infile : open file object or str An open fasta file or a path to a fasta file. strict : bool If True a RecordError will be raised if there is a fasta label line with no associated sequence, or a sequence with no associated label line (in other words, if there is a partial record). If False, partial records will be skipped. label_to_name : function A function to apply to the sequence label (i.e., text on the header line) before yielding it. By default, the sequence label is returned with no processing. This function must take a single string as input and return a single string as output. finder : function The function to apply to find records in the fasta file. In general you should not have to change this. label_characters : str String used to indicate the beginning of a new record. In general you should not have to change this. ignore_comment : bool If True, split the sequence label on spaces, and return the label only as the first space separated field (i.e., the sequence identifier). Note: if both ignore_comment and label_to_name are passed, ignore_comment is ignored (both operate on the label, so there is potential for things to get messy otherwise). two-item tuple of str yields the label and sequence for each entry. RecordError If strict == True, raises a RecordError if there is a fasta label line with no associated sequence, or a sequence with no associated label line (in other words, if there is a partial record).

Examples

Assume we have a fasta-formatted file with the following contents:

>seq1 db-accession-149855
CGATGTCGATCGATCGATCGATCAG
>seq2 db-accession-34989
CATCGATCGATCGATGCATGCATGCATG

>>> from StringIO import StringIO
>>> fasta_f = StringIO('>seq1 db-accession-149855\n'
...                    'CGATGTCGATCGATCGATCGATCAG\n'
...                    '>seq2 db-accession-34989\n'
...                    'CATCGATCGATCGATGCATGCATGCATG\n')


We can parse this as follows:

>>> from skbio.parse.sequences import parse_fasta
>>> for label, seq in parse_fasta(fasta_f):
...     print(label, seq)
seq1 db-accession-149855 CGATGTCGATCGATCGATCGATCAG
seq2 db-accession-34989 CATCGATCGATCGATGCATGCATGCATG


The sequence label or header line in a fasta file is defined as containing two separate pieces of information, delimited by a space. The first space- separated entry is the sequence identifier, and everything following the first space is considered additional information (e.g., comments about the source of the sequence or the molecule that it encodes). Often we don’t care about that information within our code. If you want to just return the sequence identifier from that line, you can pass ignore_comment=True:

>>> from StringIO import StringIO
>>> fasta_f = StringIO('>seq1 db-accession-149855\n'
...                    'CGATGTCGATCGATCGATCGATCAG\n'
...                    '>seq2 db-accession-34989\n'
...                    'CATCGATCGATCGATGCATGCATGCATG\n')

>>> from skbio.parse.sequences import parse_fasta
>>> for label, seq in parse_fasta(fasta_f, ignore_comment=True):
...     print(label, seq)
seq1 CGATGTCGATCGATCGATCGATCAG
seq2 CATCGATCGATCGATGCATGCATGCATG