Stockholm format (skbio.io.format.stockholm)

The Stockholm format is a multiple sequence alignment format (MSA) that optionally supports storing arbitrary alignment features (metadata). Features can be placed into four different categories: GF, GS, GR, and GC (described in more detail below).

An example Stockholm file, taken from [R182]:

# STOCKHOLM 1.0
#=GF ID    UPSK
#=GF SE    Predicted; Infernal
#=GF SS    Published; PMID 9223489
#=GF RN    [1]
#=GF RM    9223489
#=GF RT    The role of the pseudoknot at the 3' end of turnip yellow mosaic
#=GF RT    virus RNA in minus-strand synthesis by the viral RNA-dependent RNA
#=GF RT    polymerase.
#=GF RA    Deiman BA, Kortlever RM, Pleij CW;
#=GF RL    J Virol 1997;71:5990-5996.
AF035635.1/619-641             UGAGUUCUCGAUCUCUAAAAUCG
M24804.1/82-104                UGAGUUCUCUAUCUCUAAAAUCG
J04373.1/6212-6234             UAAGUUCUCGAUCUUUAAAAUCG
M24803.1/1-23                  UAAGUUCUCGAUCUCUAAAAUCG
#=GC SS_cons                   .AAA....<<<<aaa....>>>>
//

Format Support

Has Sniffer: Yes

State: Experimental as of 0.4.2.

Reader Writer Object Class
Yes Yes skbio.alignment.TabularMSA

Format Specification

The Stockholm format consists of a header, a multiple sequence alignment, associated metadata (features), and a footer.

Multiple Sequence Alignment

Sequence lines consist of a sequence name, followed by whitespace, followed by the aligned sequence. For example:

seq1 ACG-T-GGT
seq2 ACCGTTCG-

Sequence names (seq1, seq2) are stored in the TabularMSA index.

Note

scikit-bio currently supports reading Stockholm files where each sequence is contained on a single line. Interleaved/wrap-around Stockholm files are not supported. When writing, each sequence will be placed on its own line.

Warning

Sequence names must be unique in the Stockholm file. Likewise, when writing from a TabularMSA, index must be unique.

Metadata

Stockholm files support storing arbitrary metadata (features) about the MSA. All metadata described in the following sections are optional and may appear in any order. Metadata “mark-up” lines begin with either #=GF, #=GS, #=GR, or #=GC, and each line describes a single feature of the alignment.

Note

Stockholm format supports generic features. [R182] and [R183] provide a list of common features output by Pfam/Rfam. scikit-bio does not require that these features are present. These features are processed in the same way as any arbitrary feature would be, as a simple key-value pair of strings. When writing, feature names, feature data, and sequence names are converted to type str.

Note

When writing a Stockholm file, scikit-bio will place the metadata in the format’s recommended order:

  • GF: Above the alignment
  • GS: Above the alignment (after GF)
  • GR: Below corresponding sequence
  • GC: Below the alignment

GF metadata

Data relating to the multiple sequence alignment as a whole, such as authors or number of sequences in the alignment. Starts with #=GF followed by a feature name and data relating to the feature. Typically comes first in a Stockholm file.

For example (taken from [R183]):

#=GF DE CBS domain

Where DE is the feature name and CBS Domain is the feature data.

GF metadata is stored in the TabularMSA metadata dictionary.

Note

When reading, duplicate GF feature names will have their values concatenated in the order they appear in the file. Concatenation will also add a space between lines if one isn’t already there in order to avoid joining words together. When writing, each GF feature will be placed on its own line, regardless of length.

Note

Trees labelled with NH/TN are handled differently than other GF features. When reading a Stockholm file with these features, the reader follows the rules described in [R183]. Trees split over multiple lines will have their values concatenated. Unlike other GF features, trees will never have a space added when they are concatenated.

A single tree without an identifier will be stored as:

metadata = {
    'NH': 'tree in NHX format'
}

A single tree with an identifier will be stored as:

metadata = {
    'NH': {
        'tree-id': 'tree in NHX format'
    }
}

Multiple trees (which must have identifiers) will be stored as:

metadata = {
    'NH': {
        'tree-id-1': 'tree in NHX format',
        'tree-id-2': 'tree in NHX format'
    }
}

Note

References labelled with RN/RM/RT/RA/RL/RC are handled differently than other GF features. When reading a Stockholm file with these features, the reader populates a list of dictionaries, where each dictionary represents a single reference. The list contains references in the order they appear in the file, regardless of the value provided for RN. If a reference does not include all possible reference tags (e.g. RC is missing), the dictionary will only contain the reference tags present for that reference. When writing, the writer adds a reference number (RN) line before writing each reference, for example:

#=GF RN [1]
#=GF RA Kestrel Gorlick
...
#=GF RN [2]
...

References will be stored as:

metadata = {
    'RN': [{
        'RM': 'reference medline',
        'RT': 'reference title',
        'RA': 'reference author',
        'RL': 'reference location',
        'RC': 'reference comment'
    }, {
        'RM': 'reference medline',
        ...
    }]
}

GS metadata

Data relating to a specific sequence in the multiple sequence alignment. Starts with #=GS followed by the sequence name followed by a feature name and data relating to the feature. Typically comes after GF metadata in a Stockholm file.

For example (taken from [R183]):

#=GS O83071/259-312 AC O83071

Where O83071/259-312 is the sequence name, AC is the feature name, and 083071 is the feature data.

GS metadata is stored in the sequence-specific metadata dictionary.

Note

When reading, duplicate GS feature names will have their values concatenated in the order they appear in the file. Concatenation will also add a space between lines if one isn’t already there in order to avoid joining words together. When writing, each GS feature will be placed on its own line, regardless of length.

GR metadata

Data relating to the columns of a specific sequence in a multiple sequence alignment. Starts with #=GR followed by the sequence name followed by a feature name and data relating to the feature, one character per column. Typically comes after the sequence line it relates to.

For example (taken from [R183]):

#=GR O31698/18-71 SS    CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH

Where O31698/18-71 is the sequence name, SS is the feature name, and CCCHHHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEHHH is the feature data.

GR metadata is stored in sequence-specific positional_metadata.

Note

Duplicate GR feature names attributed to a single sequence are disallowed.

GC metadata

Data relating to the columns of the multiple sequence alignment as a whole. Starts with #=GC followed by a feature name and data relating to the feature, one character per column. Typically comes at the end of the multiple sequence alignment.

For example (taken from [R183]):

#=GC SS_cons            CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH

Where SS_cons is the feature name and CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEEEEEEEH is the feature data.

GC metadata is stored in TabularMSA positional_metadata.

Note

Duplicate GC feature names are disallowed.

Format Parameters

The only supported format parameter is constructor, which specifies the type of in-memory sequence object to read each aligned sequence into. This must be a subclass of GrammaredSequence (e.g., DNA, RNA, Protein) and is a required format parameter. For example, if you know that the Stockholm file you’re reading contains DNA sequences, you would pass constructor=DNA to the reader call.

Examples

Suppose we have a Stockholm file containing an MSA of protein sequences (modified from [R183]):

>>> import skbio.io
>>> from io import StringIO
>>> from skbio import Protein, TabularMSA
>>> fs = '\n'.join([
...         '# STOCKHOLM 1.0',
...         '#=GF CC CBS domains are small intracellular modules mostly'
...         ' found',
...         '#=GF CC in 2 or four copies within a protein.',
...         '#=GS O83071/192-246 AC O83071',
...         '#=GS O31698/88-139 OS Bacillus subtilis',
...         'O83071/192-246          MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV',
...         '#=GR O83071/192-246 SA  999887756453524252..55152525....36463',
...         'O83071/259-312          MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA',
...         'O31698/18-71            MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI',
...         'O31698/88-139           EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
...         'O31699/88-139           EVMLTDIPRLHINDPIMK..GFGMVINN......GFV',
...         '#=GR O31699/88-139 AS   ________________*____________________',
...         '#=GR O31699/88-139 IN   ____________1______________2_________',
...         '#=GC SS_cons            CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE',
...         '//'
...     ])
>>> fh = StringIO(fs)
>>> msa = TabularMSA.read(fh, constructor=Protein)
>>> msa 
TabularMSA[Protein]
----------------------------------------------------------------------
Metadata:
    'CC': 'CBS domains are small intracellular modules mostly found in
           2 or four copies within a protein.'
Positional metadata:
    'SS_cons': <dtype: object>
Stats:
    sequence count: 5
    position count: 37
----------------------------------------------------------------------
MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
EVMLTDIPRLHINDPIMK..GFGMVINN......GFV

The sequence names are stored in the index:

>>> msa.index
Index(['O83071/192-246', 'O83071/259-312', 'O31698/18-71', 'O31698/88-139',
       'O31699/88-139'],
      dtype='object')

The TabularMSA has GF metadata stored in its metadata dictionary:

>>> msa.metadata
OrderedDict([('CC', 'CBS domains are small intracellular modules mostly found in 2 or four copies within a protein.')])

GC metadata is stored in the TabularMSA positional_metadata:

>>> msa.positional_metadata  
   SS_cons
0        C
1        C
2        C
3        C
4        C
5        H
6        H
7        H
8        H
9        H
...

GS metadata is stored in the sequence-specific metadata dictionary:

>>> msa[0].metadata
OrderedDict([('AC', 'O83071')])

GR metadata is stored in sequence-specific positional_metadata:

>>> msa[4].positional_metadata  
   AS IN
0   _  _
1   _  _
2   _  _
3   _  _
4   _  _
5   _  _
6   _  _
7   _  _
8   _  _
9   _  _
...

Let’s write this TabularMSA in Stockholm format:

>>> fh = StringIO()
>>> _ = msa.write(fh, format='stockholm')
>>> print(fh.getvalue())
# STOCKHOLM 1.0
#=GF CC CBS domains are small intracellular modules mostly found in 2 or four copies within a protein.
#=GS O83071/192-246 AC O83071
#=GS O31698/88-139 OS Bacillus subtilis
O83071/192-246         MTCRAQLIAVPRASSLAE..AIACAQKM....RVSRV
#=GR O83071/192-246 SA 999887756453524252..55152525....36463
O83071/259-312         MQHVSAPVFVFECTRLAY..VQHKLRAH....SRAVA
O31698/18-71           MIEADKVAHVQVGNNLEH..ALLVLTKT....GYTAI
O31698/88-139          EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
O31699/88-139          EVMLTDIPRLHINDPIMK..GFGMVINN......GFV
#=GR O31699/88-139 AS  ________________*____________________
#=GR O31699/88-139 IN  ____________1______________2_________
#=GC SS_cons           CCCCCHHHHHHHHHHHHH..EEEEEEEE....EEEEE
//

>>> fh.close()