As project size increases, consistency of the code base and documentation becomes more important. We therefore provide guidelines for code and documentation that is contributed to scikit-bio. Our goal is to create a consistent code base where:
As scikit-bio is in pre-alpha release stage, our coding guidelines are presented here as a working draft. These guidelines are requirements for all code submitted to scikit-bio, but at this stage the guidelines themselves are malleable. If you disagree with something, or have a suggestion for something new to include, you should create an issue to initiate a discussion.
We adhere to the PEP 8 python coding guidelines for code and documentation standards. Before submitting any code to scikit-bio, you should read these carefully and apply the guidelines in your code.
The following list of abbreviations can be considered well-known and used with impunity within mixed name variables, but some should not be used by themselves as they would conflict with common functions, python built-in’s, or raise an exception. Do not use the following by themselves as variable names: dir, exp (a common math module function), in, max, and min. They can, however, be used as part of a name, eg matrix_exp.
Full | Abbreviated |
---|---|
alignment | aln |
archaeal | arch |
auxiliary | aux |
bacterial | bact |
citation | cite |
current | curr |
database | db |
dictionary | dict |
directory | dir |
distance matrix | dm |
end of file | eof |
eukaryotic | euk |
filepath | fp |
frequency | freq |
expected | exp |
index | idx |
input | in |
maximum | max |
minimum | min |
mitochondrial | mt |
number | num |
observation | obs |
observed | obs |
original | orig |
output | out |
parameter | param |
phylogeny | phylo |
previous | prev |
probability | prob |
protein | prot |
record | rec |
reference | ref |
sequence | seq |
standard deviation | stdev |
statistics | stats |
string | str |
structure | struct |
temporary | temp |
taxa | tax |
taxon | tax |
taxonomic | tax |
taxonomy | tax |
variance | var |
import numpy as np
import numpy.testing as npt
import pandas as pd
from matplotlib import pyplot as plt
The structure of your module should be similar to the example below. scikit-bio uses the NumPy doc standard for documentation. Our doc/README.md explains how to write your docstrings using the NumPy doc standards for scikit-bio:
r"""
Numbers (:mod:`skbio.core.numbers`)
===================================
.. currentmodule:: skbio.core.numbers
Numbers holds a sequence of numbers, and defines several statistical
operations (mean, stdev, etc.) FrequencyDistribution holds a mapping from
items (not necessarily numbers) to counts, and defines operations such as
Shannon entropy and frequency normalization.
Classes
-------
.. autosummary::
:toctree: generated/
Numbers
"""
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------
from __future__ import absolute_import, division, print_function
import numpy as np
from random import choice, random
from utils import indices
class Numbers(list):
pass # much code deleted
class FrequencyDistribution(dict):
pass # much code deleted
Always update the comments when the code changes. Incorrect comments are far worse than no comments, since they are actively misleading.
Comments should say more than the code itself. Examine your comments carefully: they may indicate that you’d be better off rewriting your code (especially if renaming your variables would allow you to get rid of the comment.) In particular, don’t scatter magic numbers and other constants that have to be explained through your code. It’s far better to use variables whose names are self-documenting, especially if you use the same constant more than once. Also, think about making constants into class or instance data, since it’s all too common for ‘constants’ to need to change or to be needed in several methods.
Wrong win_size -= 20 # decrement win_size by 20 OK win_size -= 20 # leave space for the scroll bar Right self._scroll_bar_size = 20 win_size -= self._scroll_bar_size
Use comments starting with #, not strings, inside blocks of code.
Start each method, class and function with a docstring using triple double quotes (“””). Make sure the docstring follows the NumPy doc standard.
Always update the docstring when the code changes. Like outdated comments, outdated docstrings can waste a lot of time. “Correct examples are priceless, but incorrect examples are worse than worthless.” Jim Fulton.
There are several different approaches for testing code in python: nose, unittest and numpy.testing. Their purpose is the same, to check that execution of code given some input produces a specified output. The cases to which the approaches lend themselves are different.
Whatever approach is employed, the general principle is every line of code should be tested. It is critical that your code be fully tested before you draw conclusions from results it produces. For scientific work, bugs don’t just mean unhappy users who you’ll never actually meet: they may mean retracted publications.
Tests are an opportunity to invent the interface(s) you want. Write the test for a method before you write the method: often, this helps you figure out what you would want to call it and what parameters it should take. It’s OK to write the tests a few methods at a time, and to change them as your ideas about the interface change. However, you shouldn’t change them once you’ve told other people what the interface is.
Never treat prototypes as production code. It’s fine to write prototype code without tests to try things out, but when you’ve figured out the algorithm and interfaces you must rewrite it with tests to consider it finished. Often, this helps you decide what interfaces and functionality you actually need and what you can get rid of.
“Code a little test a little”. For production code, write a couple of tests, then a couple of methods, then a couple more tests, then a couple more methods, then maybe change some of the names or generalize some of the functionality. If you have a huge amount of code where all you have to do is write the tests’, you’re probably closer to 30% done than 90%. Testing vastly reduces the time spent debugging, since whatever went wrong has to be in the code you wrote since the last test suite. And remember to use python’s interactive interpreter for quick checks of syntax and ideas.
Run the test suite when you change anything. Even if a change seems trivial, it will only take a couple of seconds to run the tests and then you’ll be sure. This can eliminate long and frustrating debugging sessions where the change turned out to have been made long ago, but didn’t seem significant at the time. Note that tests are executed using Travis CI, see this document’s section for further discussion.
$ nosetests -v
skbio.maths.diversity.alpha.tests.test_ace.test_ace ... ok
test_berger_parker_d (skbio.maths.diversity.alpha.tests.test_base.BaseTests) ... ok
----------------------------------------------------------------------
Ran 2 tests in 0.1234s
OK
#!/usr/bin/env python
from __future__ import division
# ----------------------------------------------------------------------------
# Copyright (c) 2013--, scikit-bio development team.
#
# Distributed under the terms of the Modified BSD License.
#
# The full license is in the file COPYING.txt, distributed with this software.
# ----------------------------------------------------------------------------
import numpy as np
from nose.tools import assert_almost_equal, assert_raises
from skbio.math.diversity.alpha.ace import ace
def test_ace():
assert_almost_equal(ace(np.array([2, 0])), 1.0)
assert_almost_equal(ace(np.array([12, 0, 9])), 2.0)
assert_almost_equal(ace(np.array([12, 2, 8])), 3.0)
assert_almost_equal(ace(np.array([12, 2, 1])), 4.0)
assert_almost_equal(ace(np.array([12, 1, 2, 1])), 7.0)
assert_almost_equal(ace(np.array([12, 3, 2, 1])), 4.6)
assert_almost_equal(ace(np.array([12, 3, 6, 1, 10])), 5.62749672)
# Just returns the number of OTUs when all are abundant.
assert_almost_equal(ace(np.array([12, 12, 13, 14])), 4.0)
# Border case: only singletons and 10-tons, no abundant OTUs.
assert_almost_equal(ace([0, 1, 1, 0, 0, 10, 10, 1, 0, 0]), 9.35681818182)
def test_ace_only_rare_singletons():
with assert_raises(ValueError):
ace([0, 0, 43, 0, 1, 0, 1, 42, 1, 43])
if __name__ == '__main__':
import nose
nose.runmodule()
Commit messages are a useful way to document the changes being made to a project, it additionally documents who is making these changes and when are these changes being made, all of which are relevant when tracing back problems.
The most important metadata in a commit message is (arguably) the author’s name and the author’s e-mail. GitHub uses this information to attribute your contributions to a project, see for example the scikit-bio list of contributors.
Follow this guide to set up your system and make sure the e-mail you use in this step is the same e-mail associated to your GitHub account.
After doing this you should see your name and e-mail when you run the following commands:
$ git config --global user.name
Yoshiki Vázquez Baeza
$ git config --global user.email
yoshiki89@gmail.com
In general the writing of a commit message should adhere to NumPy’s guidelines which if followed correctly will help you structure your changes better i. e. bug fixes will be in a commit followed by a commit updating the test suite and with one last commit that update the documentation as needed.
GitHub provides a set of handy features that will link together a commit message to a ticket in the issue tracker, this is specially helpful because you can close an issue automatically when the change is merged into the main repository, this reduces the amount of work that has to be done making sure outdated issues are not open.