GenBank format (skbio.io.format.genbank)

GenBank format (GenBank Flat File Format) stores sequence and its annotation together. The start of the annotation section is marked by a line beginning with the word “LOCUS”. The start of sequence section is marked by a line beginning with the word “ORIGIN” and the end of the section is marked by a line with only “//”.

The GenBank file usually ends with .gb or sometimes .gbk. The GenBank format for protein has been renamed to GenPept. The GenBank (for nucleotide) and Genpept are essentially the same format. An example of a GenBank file can be seen here 1.

Format Support

Has Sniffer: Yes

Reader

Writer

Object Class

Yes

Yes

skbio.sequence.Sequence

Yes

Yes

skbio.sequence.DNA

Yes

Yes

skbio.sequence.RNA

Yes

Yes

skbio.sequence.Protein

Yes

Yes

generator of skbio.sequence.Sequence objects

Format Specification

State: Experimental as of 0.4.1.

Sections before FEATURES

All the sections before FEATURES will be read into the attribute of metadata. The header and its content of a section is stored as a pair of key and value in metadata. For the REFERENCE section, its value is stored as a list, as there are often multiple reference sections in one GenBank record.

FEATURES section

The International Nucleotide Sequence Database Collaboration (INSDC 2) is a joint effort among the DDBJ, EMBL, and GenBank. These organisations all use the same “Feature Table” layout in their plain text flat file formats, which are documented in detail 3. The feature keys and their qualifiers are also described in this webpage 4.

The FEATURES section will be stored in interval_metadata of Sequence or its sub-class. Each sub-section is stored as an Interval object in interval_metadata. Each Interval object has metadata keeping the information of this feature in the sub-section.

To normalize the vocabulary between multiple formats (currently only the INSDC Feature Table and GFF3) to store metadata of interval features, we rename some terms in some formats to the same common name when parsing them into memory, as described in this table:

INSDC feature table

GFF3 columns or attributes

Key stored

Value type stored

Description

inference

source (column 2)

source

str

the algorithm or experiment used to generate this feature

feature key

type (column 3)

type

str

the type of the feature

N/A

score (column 6)

score

float

the score of the feature

N/A

strand (column 7)

strand

str

the strand of the feature. + for positive strand, - for minus strand, and . for features that are not stranded. In addition, ? can be used for features whose strandedness is relevant, but unknown.

codon_start

phase (column 8)

phase

int

the offset at which the first complete codon of a coding feature can be found, relative to the first base of that feature. It is 0, 1, or 2 in GFF3 or 1, 2, or 3 in GenBank. The stored value is 0, 1, or 2, following in GFF3 format.

db_xref

Dbxref

db_xref

list of str

A database cross reference

N/A

ID

ID

str

feature ID

note

Note

note

str

any comment or additional information

translation

N/A

translation

str

the protein sequence for CDS features

Location string

There are 5 types of location descriptors defined in Feature Table. This explains how they will be parsed into the bounds of Interval object (note it converts the 1-based coordinate to 0-based):

  1. a single base number. e.g. 67. This is parsed to (66, 67).

  2. a site between two neighboring bases. e.g. 67^68. This is parsed to (66, 67).

  3. a single base from inside a range. e.g. 67.89. This is parsed to (66, 89).

  4. a pair of base numbers defining a sequence span. e.g. 67..89. This is parsed to (66, 89).

  5. a remote sequence identifier followed by a location descriptor defined above. e.g. J00123.1:67..89. This will be discarded because it is not on the current sequence. When it is combined with local descriptor like J00123.1:67..89,200..209, the local part will be kept to be (199, 209).

Note

The Location string is fully stored in Interval.metadata with key __location. The key starting with __ is “private” and should be modified with care.

ORIGIN section

The sequence in the ORIGIN section is always in lowercase for the GenBank files downloaded from NCBI. For the RNA molecules, t (thymine), instead of u (uracil) is used in the sequence. All GenBank writers follow these conventions while writing GenBank files.

Format Parameters

Reader-specific Parameters

The constructor parameter can be used with the Sequence generator to specify the in-memory type of each GenBank record that is parsed. constructor should be Sequence or a sub-class of Sequence. It is also detected by the unit label on the LOCUS line. For example, if it is bp, it will be read into DNA; if it is aa, it will be read into Protein. Otherwise, it will be read into Sequence. This default behavior is overridden by setting constructor.

lowercase is another parameter available for all GenBank readers. By default, it is set to True to read in the ORIGIN sequence as lowercase letters. This parameter is passed to Sequence or its sub-class constructor.

seq_num is a parameter used with the Sequence, DNA, RNA, and Protein GenBank readers. It specifies which GenBank record to read from a GenBank file with multiple records in it.

Examples

Reading and Writing GenBank Files

Suppose we have the following GenBank file example modified from 5:

>>> gb_str = '''
... LOCUS       3K1V_A       34 bp    RNA     linear   SYN 10-OCT-2012
... DEFINITION  Chain A, Structure Of A Mutant Class-I Preq1.
... ACCESSION   3K1V_A
... VERSION     3K1V_A  GI:260656459
... KEYWORDS    .
... SOURCE      synthetic construct
...   ORGANISM  synthetic construct
...             other sequences; artificial sequences.
... REFERENCE   1  (bases 1 to 34)
...   AUTHORS   Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
...   TITLE     Cocrystal structure of a class I preQ1 riboswitch
...   JOURNAL   Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
...    PUBMED   19234468
... COMMENT     SEQRES.
... FEATURES             Location/Qualifiers
...      source          1..34
...                      /organism="synthetic construct"
...                      /mol_type="other RNA"
...                      /db_xref="taxon:32630"
...      misc_binding    1..30
...                      /note="Preq1 riboswitch"
...                      /bound_moiety="preQ1"
... ORIGIN
...         1 agaggttcta gcacatccct ctataaaaaa ctaa
... //
... '''

Now we can read it as DNA object:

>>> import io
>>> from skbio import DNA, RNA, Sequence
>>> gb = io.StringIO(gb_str)
>>> dna_seq = DNA.read(gb)
>>> dna_seq
DNA
-----------------------------------------------------------------
Metadata:
    'ACCESSION': '3K1V_A'
    'COMMENT': 'SEQRES.'
    'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
    'KEYWORDS': '.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': '3K1V_A  GI:260656459'
Interval metadata:
    2 interval features
Stats:
    length: 34
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGTTCTA GCACATCCCT CTATAAAAAA CTAA

Since this is a riboswitch molecule, we may want to read it as RNA. As the GenBank file usually have t instead of u in the sequence, we can read it as RNA by converting t to u:

>>> gb = io.StringIO(gb_str)
>>> rna_seq = RNA.read(gb)
>>> rna_seq
RNA
-----------------------------------------------------------------
Metadata:
    'ACCESSION': '3K1V_A'
    'COMMENT': 'SEQRES.'
    'DEFINITION': 'Chain A, Structure Of A Mutant Class-I Preq1.'
    'KEYWORDS': '.'
    'LOCUS': <class 'dict'>
    'REFERENCE': <class 'list'>
    'SOURCE': <class 'dict'>
    'VERSION': '3K1V_A  GI:260656459'
Interval metadata:
    2 interval features
Stats:
    length: 34
    has gaps: False
    has degenerates: False
    has definites: True
    GC-content: 35.29%
-----------------------------------------------------------------
0 AGAGGUUCUA GCACAUCCCU CUAUAAAAAA CUAA
>>> rna_seq == dna_seq.transcribe()
True
>>> with io.StringIO() as fh:
...     print(dna_seq.write(fh, format='genbank').getvalue())
LOCUS       3K1V_A   34 bp   RNA   linear   SYN   10-OCT-2012
DEFINITION  Chain A, Structure Of A Mutant Class-I Preq1.
ACCESSION   3K1V_A
VERSION     3K1V_A  GI:260656459
KEYWORDS    .
SOURCE      synthetic construct
  ORGANISM  synthetic construct
            other sequences; artificial sequences.
REFERENCE   1  (bases 1 to 34)
  AUTHORS   Klein,D.J., Edwards,T.E. and Ferre-D'Amare,A.R.
  TITLE     Cocrystal structure of a class I preQ1 riboswitch
  JOURNAL   Nat. Struct. Mol. Biol. 16 (3), 343-344 (2009)
  PUBMED    19234468
COMMENT     SEQRES.
FEATURES             Location/Qualifiers
     source          1..34
                     /db_xref="taxon:32630"
                     /mol_type="other RNA"
                     /organism="synthetic construct"
     misc_binding    1..30
                     /bound_moiety="preQ1"
                     /note="Preq1 riboswitch"
ORIGIN
        1 agaggttcta gcacatccct ctataaaaaa ctaa
//

References

1

http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html

2

http://www.insdc.org/

3

http://www.insdc.org/files/feature_table.html

4

http://www.ebi.ac.uk/ena/WebFeat/

5

http://www.ncbi.nlm.nih.gov/nuccore/3K1V_A