BLAST+7 format (skbio.io.format.blast7)

The BLAST+7 format (blast+7) stores the results of a BLAST 1 database search. This format is produced by both BLAST+ output format 7 and legacy BLAST output format 9. The results are stored in a simple tabular format with headers. Values are separated by the tab character.

An example BLAST+7-formatted file comparing two nucleotide sequences, taken from 2 (tab characters represented by <tab>):

# BLASTN 2.2.18+
# Query: gi|1786181|gb|AE000111.1|AE000111
# Subject: ecoli
# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end
# 5 hits found
AE000111<tab>AE000111<tab>0.0<tab>1<tab>10596<tab>1<tab>10596
AE000111<tab>AE000174<tab>8e-30<tab>5565<tab>5671<tab>6928<tab>6821
AE000111<tab>AE000394<tab>1e-27<tab>5587<tab>5671<tab>135<tab>219
AE000111<tab>AE000425<tab>6e-26<tab>5587<tab>5671<tab>8552<tab>8468
AE000111<tab>AE000171<tab>3e-24<tab>5587<tab>5671<tab>2214<tab>2130

Format Support

Has Sniffer: Yes

State: Experimental as of 0.4.1.

Reader

Writer

Object Class

Yes

No

pandas.DataFrame

Format Specification

There are two BLAST+7 file formats supported by scikit-bio: BLAST+ output format 7 (-outfmt 7) and legacy BLAST output format 9 (-m 9). Both file formats are structurally similar, with minor differences.

Example BLAST+ output format 7 file:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject2
# Fields: q. start, q. end, s. start, s. end, identical, mismatches, sbjctframe, query acc.ver, subject acc.ver
# 2 hits found
1   8       3       10      8       0       1       query1  subject2
2   5       2       15      8       0       2       query1  subject2

Note

Database searches without hits may occur in BLAST+ output format 7 files. scikit-bio ignores these “empty” records:

# BLASTP 2.2.31+
# Query: query1
# Subject: subject1
# 0 hits found

Example legacy BLAST output format 9 file:

# BLASTN 2.2.3 [May-13-2002]
# Database: other_vertebrate
# Query: AF178033
# Fields:
Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score
AF178033    EMORG:AF178033  100.00  811 0   0   1   811 1   811 0.0 1566.6
AF178033    EMORG:AF031394  99.63   811 3   0   1   811 99  909 0.0 1542.8

Note

scikit-bio requires fields to be consistent within a file.

BLAST Column Types

The following column types are output by BLAST and supported by scikit-bio. For more information on these column types, see skbio.io.format.blast6.

Field Name

DataFrame Column Name

query id

qseqid

query gi

qgi

query acc.

qacc

query acc.ver

qaccver

query length

qlen

subject id

sseqid

subject ids

sallseqid

subject gi

sgi

subject gis

sallgi

subject acc.

sacc

subject acc.ver

saccver

subject accs

sallacc

subject length

slen

q. start

qstart

q. end

qend

s. start

sstart

s. end

send

query seq

qseq

subject seq

sseq

evalue

evalue

bit score

bitscore

score

score

alignment length

length

% identity

pident

identical

nident

mismatches

mismatch

positives

positive

gap opens

gapopen

gaps

gaps

% positives

ppos

query/sbjct frames

frames

query frame

qframe

sbjct frame

sframe

BTOP

btop

subject tax ids

staxids

subject sci names

sscinames

subject com names

scomnames

subject blast names

sblastnames

subject super kingdoms

sskingdoms

subject title

stitle

subject strand

sstrand

subject titles

salltitles

% query coverage per subject

qcovs

% query coverage per hsp

qcovhsp

Examples

Suppose we have a BLAST+7 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.18+',
...     '# Query: gi|1786181|gb|AE000111.1|AE000111',
...     '# Database: ecoli',
...     '# Fields: query acc., subject acc., evalue, q. start, q. end, s. start, s. end',
...     '# 5 hits found',
...     'AE000111\tAE000111\t0.0\t1\t10596\t1\t10596',
...     'AE000111\tAE000174\t8e-30\t5565\t5671\t6928\t6821',
...     'AE000111\tAE000171\t3e-24\t5587\t5671\t2214\t2130',
...     'AE000111\tAE000425\t6e-26\t5587\t5671\t8552\t8468'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df 
       qacc      sacc        evalue  qstart     qend  sstart     send
0  AE000111  AE000111  0.000000e+00     1.0  10596.0     1.0  10596.0
1  AE000111  AE000174  8.000000e-30  5565.0   5671.0  6928.0   6821.0
2  AE000111  AE000171  3.000000e-24  5587.0   5671.0  2214.0   2130.0
3  AE000111  AE000425  6.000000e-26  5587.0   5671.0  8552.0   8468.0

Suppose we have a legacy BLAST 9 file:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     '# BLASTN 2.2.3 [May-13-2002]',
...     '# Database: other_vertebrate',
...     '# Query: AF178033',
...     '# Fields: ',
...     'Query id,Subject id,% identity,alignment length,mismatches,gap openings,q. start,q. end,s. start,s. end,e-value,bit score',
...     'AF178033\tEMORG:AF178033\t100.00\t811\t0\t0\t1\t811\t1\t811\t0.0\t1566.6',
...     'AF178033\tEMORG:AF178032\t94.57\t811\t44\t0\t1\t811\t1\t811\t0.0\t1217.7',
...     'AF178033\tEMORG:AF178031\t94.82\t811\t42\t0\t1\t811\t1\t811\t0.0\t1233.5'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame:

>>> df = skbio.io.read(fh, into=pd.DataFrame)
>>> df 
     qseqid          sseqid  pident  length  mismatch  gapopen  qstart  qend \
0  AF178033  EMORG:AF178033  100.00   811.0       0.0      0.0     1.0  811.0
1  AF178033  EMORG:AF178032   94.57   811.0      44.0      0.0     1.0  811.0
2  AF178033  EMORG:AF178031   94.82   811.0      42.0      0.0     1.0  811.0

   sstart   send  evalue  bitscore
0     1.0  811.0     0.0    1566.6
1     1.0  811.0     0.0    1217.7
2     1.0  811.0     0.0    1233.5

References

1

Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.

2

http://www.ncbi.nlm.nih.gov/books/NBK279682/