Beta diversity measures (skbio.diversity.beta)

This package contains helper functions for working with scipy’s pairwise distance (pdist) functions in scikit-bio, and will eventually be expanded to contain pairwise distance/dissimilarity methods that are not implemented (or planned to be implemented) in scipy.

The functions in this package currently support applying pdist functions to all pairs of samples in a sample by observation count or abundance matrix and returning an skbio.DistanceMatrix object. This application is illustrated below for a few different forms of input.

Functions

pw_distances(counts[, ids, metric]) Compute distances between all pairs of columns in a counts matrix
pw_distances_from_table(table[, metric]) Compute distances between all pairs of samples in table

Examples

Create a table containing 7 OTUs and 6 samples:

>>> from skbio.diversity.beta import pw_distances
>>> import numpy as np
>>> data = [[23, 64, 14, 0, 0, 3, 1],
...         [0, 3, 35, 42, 0, 12, 1],
...         [0, 5, 5, 0, 40, 40, 0],
...         [44, 35, 9, 0, 1, 0, 0],
...         [0, 2, 8, 0, 35, 45, 1],
...         [0, 0, 25, 35, 0, 19, 0]]
>>> ids = list('ABCDEF')

Compute Bray-Curtis distances between all pairs of samples and return a DistanceMatrix object:

>>> bc_dm = pw_distances(data, ids, "braycurtis")
>>> print(bc_dm)
6x6 distance matrix
IDs:
A, B, C, D, E, F
Data:
[[ 0.          0.78787879  0.86666667  0.30927835  0.85714286  0.81521739]
 [ 0.78787879  0.          0.78142077  0.86813187  0.75        0.1627907 ]
 [ 0.86666667  0.78142077  0.          0.87709497  0.09392265  0.71597633]
 [ 0.30927835  0.86813187  0.87709497  0.          0.87777778  0.89285714]
 [ 0.85714286  0.75        0.09392265  0.87777778  0.          0.68235294]
 [ 0.81521739  0.1627907   0.71597633  0.89285714  0.68235294  0.        ]]

Compute Jaccard distances between all pairs of samples and return a DistanceMatrix object:

>>> j_dm = pw_distances(data, ids, "jaccard")
>>> print(j_dm)
6x6 distance matrix
IDs:
A, B, C, D, E, F
Data:
[[ 0.          0.83333333  1.          1.          0.83333333  1.        ]
 [ 0.83333333  0.          1.          1.          0.83333333  1.        ]
 [ 1.          1.          0.          1.          1.          1.        ]
 [ 1.          1.          1.          0.          1.          1.        ]
 [ 0.83333333  0.83333333  1.          1.          0.          1.        ]
 [ 1.          1.          1.          1.          1.          0.        ]]

Determine if the resulting distance matrices are significantly correlated by computing the Mantel correlation between them. Then determine if the p-value is significant based on an alpha of 0.05:

>>> from skbio.stats.distance import mantel
>>> r, p_value = mantel(j_dm, bc_dm)
>>> print(r)
-0.209362157621
>>> print(p_value < 0.05)
False

Compute PCoA for both distance matrices, and then find the Procrustes M-squared value that results from comparing the coordinate matrices.

>>> from skbio.stats.ordination import PCoA
>>> bc_pc = PCoA(bc_dm).scores()
>>> j_pc = PCoA(j_dm).scores()
>>> from skbio.stats.spatial import procrustes
>>> print(procrustes(bc_pc.site, j_pc.site)[2])
0.466134984787

All of this only gets interesting in the context of sample metadata, so let’s define some:

>>> import pandas as pd
>>> try:
...     # not necessary for normal use
...     pd.set_option('show_dimensions', True)
... except KeyError:
...     pass
>>> sample_md = {
...    'A': {'body_site': 'gut', 'subject': '1'},
...    'B': {'body_site': 'skin', 'subject': '1'},
...    'C': {'body_site': 'tongue', 'subject': '1'},
...    'D': {'body_site': 'gut', 'subject': '2'},
...    'E': {'body_site': 'tongue', 'subject': '2'},
...    'F': {'body_site': 'skin', 'subject': '2'}}
>>> sample_md = pd.DataFrame.from_dict(sample_md, orient='index')
>>> sample_md
  subject body_site
A       1       gut
B       1      skin
C       1    tongue
D       2       gut
E       2    tongue
F       2      skin

[6 rows x 2 columns]

We’ll put a quick 3D plotting function together. This function is adapted from the matplotlib gallery [R140].

>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits.mplot3d import Axes3D
>>> def scatter_3d(ord_results, df, column, color_map, title='', axis1=0,
...                axis2=1, axis3=2):
...    coord_matrix = ord_results.site.T
...    ids = ord_results.site_ids
...    colors = [color_map[df[column][id_]] for id_ in ord_results.site_ids]
...
...    fig = plt.figure()
...    ax = fig.add_subplot(111, projection='3d')
...
...    xs = coord_matrix[axis1]
...    ys = coord_matrix[axis2]
...    zs = coord_matrix[axis3]
...    plot = ax.scatter(xs, ys, zs, c=colors)
...
...    ax.set_xlabel('PC %d' % (axis1 + 1))
...    ax.set_ylabel('PC %d' % (axis2 + 1))
...    ax.set_zlabel('PC %d' % (axis3 + 1))
...    ax.set_xticklabels([])
...    ax.set_yticklabels([])
...    ax.set_zticklabels([])
...    ax.set_title(title)
...    return fig

Now let’s plot our PCoA results, coloring each sample by the subject it was taken from:

>>> fig = scatter_3d(bc_pc, sample_md, 'subject', {'1': 'b', '2': 'r'},
...                  'Samples colored by subject')

(Source code, png, hires.png, pdf)

../_images/skbio-diversity-beta-1.png

We don’t see any clustering/grouping of samples. If we were to instead color the samples by the body site they were taken from, we see that the samples form three separate groups:

>>> plt.close('all') # not necessary for normal use
>>> fig = scatter_3d(bc_pc, sample_md, 'body_site',
...                  {'gut': 'b', 'skin': 'r', 'tongue': 'g'},
...                  'Samples colored by body site')

(Source code, png, hires.png, pdf)

../_images/skbio-diversity-beta-2.png

Ordination techniques, such as PCoA, are useful for exploratory analysis. The next step is to quantify the strength of the grouping/clustering that we see in ordination plots. There are many statistical methods available to accomplish this; many operate on distance matrices. Let’s use ANOSIM to quantify the strength of the clustering we see in the ordination plots above, using our Bray-Curtis distance matrix and sample metadata.

First test the grouping of samples by subject:

>>> from skbio.stats.distance import ANOSIM
>>> anosim = ANOSIM(bc_dm, sample_md, column='subject')
>>> results = anosim(999)
>>> results.statistic
-0.4074074074074075
>>> results.p_value < 0.05
False

The negative value of ANOSIM’s R statistic indicates anti-clustering and the p-value is insignificant at an alpha of 0.05.

Now let’s test the grouping of samples by body site:

>>> anosim = ANOSIM(bc_dm, sample_md, column='body_site')
>>> results = anosim(999)
>>> results.statistic
1.0
>>> results.p_value < 0.1
True

The R statistic of 1.0 indicates strong separation of samples based on body site. The p-value is significant at an alpha of 0.1.