I/O Registry (skbio.io.registry)

Classes

IORegistry() Create a registry of formats and implementations which map to classes.
Format(name[, encoding, newline]) Defines a format on which readers/writers/sniffer can be registered.

Functions

create_format(self, *args, **kwargs) A simple factory for creating new file formats.

Exceptions

DuplicateRegistrationError Raised when a function is already registered in skbio.io
InvalidRegistrationError Raised if function doesn’t meet the expected API of its registration.

Creating a new format for scikit-bio

scikit-bio makes it simple to add new file formats to its I/O registry. scikit-bio maintains a singleton of the IORegistry class called io_registry. This is where all scikit-bio file formats are registered. One could also instantiate their own IORegistry, but that is not the focus of this tutorial.

The first step to creating a new format is to add a submodule in skbio/io/format/ named after the file format you are implementing. For example, if the format you are implementing is called myformat then you would create a file called skbio/io/format/myformat.py.

The next step is to import the create_format() factory from skbio.io. This will allow you to create a new Format object that io_registry will know about.

Ideally you should name the result of create_format() as your file name. For example:

from skbio.io import create_format

myformat = create_format('myformat')

The myformat object is what we will use to register our new functionality. At this point you should evaulate whether your format is binary or text. If your format is binary, your create_format() call should look like this:

myformat = create_format('myformat', encoding='binary')

Alternatively if your format is text and has a specific encoding or newline handling you can also specify that:

myformat = create_format('myformat', encoding='ascii', newline='\n')

This will ensure that our registry will open files with a default encoding of ‘ascii’ for ‘myformat’ and expect all newlines to be ‘n’ characters.

Having worked out these details, we are ready to register the actual functionality of our format (e.g., sniffer, readers, and writers).

To create a sniffer simply decorate the following onto your sniffer function:

@myformat.sniffer()
def _myformat_sniffer(fh):
    # do something with `fh` to determine the membership of the file

For futher details on sniffer functions see Format.sniffer().

Creating a reader is very similar, but has one difference:

@myformat.reader(SomeSkbioClass)
def _myformat_to_some_skbio_class(fh, kwarg1='default', extra=FileSentinel):
    # parse `fh` and return a SomeSkbioClass instance here
    # `extra` will also be an open filehandle if provided else None

Here we bound a function to a specific class. We also demonstrated using our FileSentinel object to indicate to the registry that this reader can take auxilary files that should be handled in the same way as the primary file. For futher details on reader functions see Format.reader().

Creating a writer is about the same:

@myformat.writer(SomeSkbioClass)
def _some_skbio_class_to_myformat(obj, fh, kwarg1='whatever',
                                  extra=FileSentinel):
    # write the contents of `obj` into `fh` and whatever else into `extra`
    # do not return anything, it will be ignored

This is exactly the same as the reader above just in reverse, we also receive the object we are writing as the first parameter instead of the file (which is the second one). For further details on writer functions see Format.writer().

Note

When raising errors in readers and writers, the error should be a subclass of FileFormatError specific to your new format.

Once you are satisfied with the functionality, you will need to ensure that skbio/io/__init__.py contains an import of your new submodule so the decorators are executed. Add the function import_module('skbio.io.format.myformat') with your module name to the existing list.

Note

Because scikit-bio handles all of the I/O boilerplate, you only need to unit-test the actual business logic of your readers, writers, and sniffers.

Reserved Keyword Arguments

The following keyword args may not be used when defining new readers or writers as they already have special meaning to the registry system:

  • format
  • into
  • verify
  • mode
  • encoding
  • errors
  • newline
  • compression
  • compresslevel

The following are not yet used but should be avoided as well:

  • auth
  • user
  • password
  • buffering
  • buffer_size
  • closefd
  • exclusive
  • append