BLAST+6 format (skbio.io.format.blast6)

The BLAST+6 format (blast+6) stores the results of a BLAST [R157] database search. The results are stored in a simple tabular format with no column headers. Values are separated by the tab character.

An example BLAST+6-formatted file comparing two protein sequences, taken from [R158] (tab characters represented by <tab>):

moaC<tab>gi|15800534|ref|NP_286546.1|<tab>100.00<tab>161<tab>0<tab>0<tab>1<tab>161<tab>1<tab>161<tab>3e-114<tab>330
moaC<tab>gi|170768970|ref|ZP_02903423.1|<tab>99.38<tab>161<tab>1<tab>0<tab>1<tab>161<tab>1<tab>161<tab>9e-114<tab>329

Format Support

Has Sniffer: No

State: Experimental as of 0.4.1.

Reader Writer Object Class
Yes No pandas.DataFrame

Format Specification

BLAST+6 format is a tabular text-based format produced by both BLAST+ output format 6 (-outfmt 6) and legacy BLAST output format 8 (-m 8). It is tab-separated and has no column headers. With BLAST+, users can specify the columns that are present in their BLAST output file by specifying column names (e.g., -outfmt "6 qseqid sseqid bitscore qstart sstart"), if the default columns output by BLAST are not desired.

BLAST Column Types

The following column types are output by BLAST and supported by scikit-bio. This information is taken from [R159].

Name Description Type
qseqid Query Seq-id str
qgi Query GI int
qacc Query accesion str
qaccver Query accesion.version str
qlen Query sequence length int
sseqid Subject Seq-id str
sallseqid All subject Seq-id(s), separated by a ‘;’ str
sgi Subject GI int
sallgi All subject GIs int
sacc Subject accesion str
saccver Subject accesion.version str
sallacc All subject accesions str
slen Subject sequence length int
qstart Start of alignment in query int
qend End of alignment in query int
sstart Start of alignment in subject int
send End of alignment in subject int
qseq Aligned part of query sequence str
sseq Aligned part of subject sequence str
evalue Expect value float
bitscore Bit score float
score Raw score int
length Alignment length int
pident Percent of identical matches float
nident Number of identical matches int
mismatch Number of mismatches int
positive Number of positive-scoring matches int
gapopen Number of gap openings int
gaps Total number of gaps int
ppos Percentage of positive-scoring matches float
frames Query and subject frames separated by a ‘/’ str
qframe Query frame int
sframe Subject frame int
btop Blast traceback operations (BTOP) int
staxids Unique Subject Taxonomy ID(s), separated by a ‘;’ (in numerical order) str
sscinames Unique Subject Scientific Name(s), separated by a ‘;’ str
scomnames Unique Subject Common Name(s), separated by a ‘;’ str
sblastnames unique Subject Blast Name(s), separated by a ‘;’ (in alphabetical order) str
sskingdoms unique Subject Super Kingdom(s), separated by a ‘;’ (in alphabetical order) str
stitle Subject Title str
sstrand Subject Strand str
salltitles All Subject Title(s), separated by a ‘<>’ str
qcovs Query Coverage Per Subject int
qcovhsp Query Coverage Per HSP int

Note

When a BLAST+6-formatted file contains N/A values, scikit-bio will convert these values into np.nan, matching pandas’ convention for representing missing data.

Note

scikit-bio stores columns of type int as type float in the returned pd.DataFrame. This is necessary in order to allow N/A values in integer columns (this is currently a limitation of pandas).

Format Parameters

The following format parameters are available in blast+6 format:

  • default_columns: False by default. If True, will use the default columns output by BLAST, which are qseqid, sseqid, pident, length, mismatch, gapopen, qstart, qend, sstart, send, evalue, and bitscore.

    Warning

    When reading legacy BLAST files, you must pass default_columns=True because legacy BLAST does not allow users to specify which columns are present in the output file.

  • columns: None by default. If provided, must be a list of column names in the order they will appear in the file.

Note

Either default_columns or columns must be provided, as blast+6 does not contain column headers.

Examples

Suppose we have a blast+6 file with default columns:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     'moaC\tgi|15800534|ref|NP_286546.1|\t100.00\t161\t0\t0\t1\t161\t1\t161\t3e-114\t330',
...     'moaC\tgi|170768970|ref|ZP_02903423.1|\t99.38\t161\t1\t0\t1\t161\t1\t161\t9e-114\t329'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify that default columns should be used:

>>> df = skbio.io.read(fh, format="blast+6", into=pd.DataFrame,
...                    default_columns=True)
>>> df 
  qseqid                           sseqid  pident  length  mismatch  gapopen \
0   moaC     gi|15800534|ref|NP_286546.1|  100.00   161.0       0.0      0.0
1   moaC  gi|170768970|ref|ZP_02903423.1|   99.38   161.0       1.0      0.0

   qstart   qend  sstart   send         evalue  bitscore
0     1.0  161.0     1.0  161.0  3.000000e-114     330.0
1     1.0  161.0     1.0  161.0  9.000000e-114     329.0

Suppose we have a blast+6 file with user-supplied (non-default) columns:

>>> from io import StringIO
>>> import skbio.io
>>> import pandas as pd
>>> fs = '\n'.join([
...     'moaC\t100.00\t0\t161\t0\t161\t330\t1',
...     'moaC\t99.38\t1\t161\t0\t161\t329\t1'
... ])
>>> fh = StringIO(fs)

Read the file into a pd.DataFrame and specify which columns are present in the file:

>>> df = skbio.io.read(fh, format="blast+6", into=pd.DataFrame,
...                    columns=['qseqid', 'pident', 'mismatch', 'length',
...                             'gapopen', 'qend', 'bitscore', 'sstart'])
>>> df 
  qseqid  pident  mismatch  length  gapopen   qend  bitscore  sstart
0   moaC  100.00       0.0   161.0      0.0  161.0     330.0     1.0
1   moaC   99.38       1.0   161.0      0.0  161.0     329.0     1.0

References

[R157]Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) “Basic local alignment search tool.” J. Mol. Biol. 215:403-410.
[R158]http://blastedbio.blogspot.com/2014/11/column-headers-in-blast-tabular-and-csv.html
[R159]http://www.ncbi.nlm.nih.gov/books/NBK279675/