BLAST+7 format (skbio.io.format.blast7)

The BLAST+7 format (blast+7) stores the results of a BLAST [R160] database search. This format is produced by both BLAST+ output format 7 and legacy BLAST output format 9. The results are stored in a simple tabular format with headers. Values are separated by the tab character.

An example BLAST+7-formatted file comparing two nucleotide sequences, taken from [R161] (tab characters represented by <tab>):

# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Subject: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 5 hits found
AE000111<tab>AE000111<tab>0.0<tab>1<tab>10596<tab>1<tab>10596
AE000111<tab>AE000174<tab>8e-30<tab>5565<tab>5671<tab>6928<tab>6821
AE000111<tab>AE000394<tab>1e-27<tab>5587<tab>5671<tab>135<tab>219
AE000111<tab>AE000425<tab>6e-26<tab>5587<tab>5671<tab>8552<tab>8468
AE000111<tab>AE000171<tab>3e-24<tab>5587<tab>5671<tab>2214<tab>2130

Format Support

Has Sniffer: Yes

State: Experimental as of 0.4.1.

Reader Writer Object Class
Yes No pandas.DataFrame

Format Specification

There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output format 7 (-outfmt 7) and legacy BLAST output format 9 (-m 9). Both file formats are structurally similar, with minor differences.

Example BLAST+ output format 7 file:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject2
# Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjctframe, query acc.ver, subject acc.ver
# 2 hits found
1   8       3       10      8       0       1       query1  subject2
2   5       2       15      8       0       2       query1  subject2

Note

Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these “empty” records:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject1
# 0 hits found

Example legacy BLAST output format 9 file:

# BLASTN 2.2.3 [May-13-2002]
# Database: other_vertebrate
# Query: AF178033
# Fields:
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
AF178033    EMORG:AF178033  100.00  811 0   0   1   811 1   811 0.0 1566.6
AF178033    EMORG:AF031394  99.63   811 3   0   1   811 99  909 0.0 1542.8

Note

scikit-bio requires fields to be consistent within a file.

BLAST Column Types

The following column types are output by BLAST and supported by scikit-bio. For more information on these column types, see skbio.io.format.blast6.

Field Name DataFrame Column Name
query id qseqid
query gi qgi
query acc. qacc
query acc.ver qaccver
query length qlen
subject id sseqid
subject ids sallseqid
subject gi sgi
subject gis sallgi
subject acc. sacc
subject acc.ver saccver
subject accs sallacc
subject length slen
q. start qstart
q. end qend
s. start sstart
s. end send
query seq qseq
subject seq sseq
evalue evalue
bit score bitscore
score score
alignment length length
% identity pident
identical nident
mismatches mismatch
positives positive
gap opens gapopen
gaps gaps
% positives ppos
query/sbjct frames frames
query frame qframe
sbjct frame sframe
BTOP btop
subject tax ids staxids
subject sci names sscinames
subject com names scomnames
subject blast names sblastnames
subject super kingdoms sskingdoms
subject title stitle
subject strand sstrand
subject titles salltitles
% query coverage per subject qcovs
% query coverage per hsp qcovhsp

Examples

Suppose we have a BLAST+7 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.18+',
...     '# Query: gi|1786181|gb|AE000111.1|AE000111',
...     '# Database: ecoli',
...     '# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end',
...     '# 5 hits found',
...     'AE000111\tAE000111\t0.0\t1\t10596\t1\t10596',
...     'AE000111\tAE000174\t8e-30\t5565\t5671\t6928\t6821',
...     'AE000111\tAE000171\t3e-24\t5587\t5671\t2214\t2130',
...     'AE000111\tAE000425\t6e-26\t5587\t5671\t8552\t8468'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df 
       qacc      sacc        evalue  qstart     qend  sstart     send
0  AE000111  AE000111  0.000000e+00     1.0  10596.0     1.0  10596.0
1  AE000111  AE000174  8.000000e-30  5565.0   5671.0  6928.0   6821.0
2  AE000111  AE000171  3.000000e-24  5587.0   5671.0  2214.0   2130.0
3  AE000111  AE000425  6.000000e-26  5587.0   5671.0  8552.0   8468.0

Suppose we have a legacy BLAST 9 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.3 [May-13-2002]',
...     '# Database: other_vertebrate',
...     '# Query: AF178033',
...     '# Fields: ',
...     'Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score',
...     'AF178033\tEMORG:AF178033\t100.00\t811\t0\t0\t1\t811\t1\t811\t0.0\t1566.6',
...     'AF178033\tEMORG:AF178032\t94.57\t811\t44\t0\t1\t811\t1\t811\t0.0\t1217.7',
...     'AF178033\tEMORG:AF178031\t94.82\t811\t42\t0\t1\t811\t1\t811\t0.0\t1233.5'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df 
     qseqid          sseqid  pident  length  mismatch  gapopen  qstart  qend \
0  AF178033  EMORG:AF178033  100.00   811.0       0.0      0.0     1.0  811.0
1  AF178033  EMORG:AF178032   94.57   811.0      44.0      0.0     1.0  811.0
2  AF178033  EMORG:AF178031   94.82   811.0      42.0      0.0     1.0  811.0

   sstart   send  evalue  bitscore
0     1.0  811.0     0.0    1566.6
1     1.0  811.0     0.0    1217.7
2     1.0  811.0     0.0    1233.5

References

[R160]Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
[R161]http://www.ncbi.nlm.nih.gov/books/NBK279682/