Dissimilarity and distance matrices (skbio.core.distance)

This module provides functionality for serializing, deserializing, and manipulating dissimilarity and distance matrices in memory. There are two matrix classes available, DissimilarityMatrix and DistanceMatrix. Both classes can store measures of difference/distinction between objects. A dissimilarity/distance matrix includes both a matrix of dissimilarities/distances (floats) between objects, as well as unique IDs (object labels; strings) identifying each object in the matrix.

DissimilarityMatrix can be used to store measures of dissimilarity between objects, and does not require that the dissimilarities are symmetric (e.g., dissimilarities obtained using the Gain in PD measure [R14]). DissimilarityMatrix is a more general container to store differences than DistanceMatrix.

DistanceMatrix has the additional requirement that the differences it stores are symmetric (e.g., Euclidean or Hamming distances).

Note

DissimilarityMatrix can be used to store distances, but it is recommended to use DistanceMatrix to store this type of data as it provides an additional check for symmetry. A distance matrix is a dissimilarity matrix; this is modeled in the class design by having DistanceMatrix as a subclass of DissimilarityMatrix.

Classes

DissimilarityMatrix(data[, ids]) Store dissimilarities between objects.
DistanceMatrix(data[, ids]) Store distances between objects.

Functions

randdm(num_objects[, ids, constructor, ...]) Generate a distance matrix populated with random distances.

References

[R14]Faith, D. P. (1992). “Conservation evaluation and phylogenetic diversity”.

Examples

Assume we have the following delimited text file storing distances between three objects with IDs a, b, and c:

\ta\tb\tc
a\t0.0\t0.5\t1.0
b\t0.5\t0.0\t0.75
c\t1.0\t0.75\t0.0

Load a distance matrix from the file:

>>> from StringIO import StringIO
>>> from skbio.core.distance import DistanceMatrix
>>> dm_f = StringIO("\ta\tb\tc\n"
...                 "a\t0.0\t0.5\t1.0\n"
...                 "b\t0.5\t0.0\t0.75\n"
...                 "c\t1.0\t0.75\t0.0\n")
>>> dm = DistanceMatrix.from_file(dm_f)
>>> print(dm)
3x3 distance matrix
IDs:
a, b, c
Data:
[[ 0.    0.5   1.  ]
 [ 0.5   0.    0.75]
 [ 1.    0.75  0.  ]]

Access the distance (scalar) between objects 'a' and 'c':

>>> dm['a', 'c']
1.0

Get a row vector of distances between object 'b' and all other objects:

>>> dm['b']
array([ 0.5 ,  0.  ,  0.75])

numpy indexing/slicing also works as expected. Extract the third column:

>>> dm[:, 2]
array([ 1.  ,  0.75,  0.  ])

Serialize the distance matrix to delimited text file:

>>> out_f = StringIO()
>>> dm.to_file(out_f)
>>> out_f.getvalue()
'\ta\tb\tc\na\t0.0\t0.5\t1.0\nb\t0.5\t0.0\t0.75\nc\t1.0\t0.75\t0.0\n'
>>> out_f.getvalue() == dm_f.getvalue()
True

A distance matrix object can also be created from an existing numpy.array (or an array-like object, such as a nested Python list):

>>> import numpy as np
>>> data = np.array([[0.0, 0.5, 1.0],
...                  [0.5, 0.0, 0.75],
...                  [1.0, 0.75, 0.0]])
>>> ids = ["a", "b", "c"]
>>> dm_from_np = DistanceMatrix(data, ids)
>>> print(dm_from_np)
3x3 distance matrix
IDs:
a, b, c
Data:
[[ 0.    0.5   1.  ]
 [ 0.5   0.    0.75]
 [ 1.    0.75  0.  ]]
>>> dm_from_np == dm
True

IDs may be omitted when constructing a dissimilarity/distance matrix. Monotonically-increasing integers (cast as strings) will be automatically used:

>>> dm = DistanceMatrix(data)
>>> dm.ids
('0', '1', '2')