# Distance-based statistics (skbio.math.stats.distance)¶

This package contains various statistical methods that operate on distance matrices, often relating distances (e.g., community distances) to categorical and/or continuous variables of interest (e.g., gender or age). Methods are also provided for comparing distance matrices (e.g., computing the correlation between two or more distance matrices using the Mantel test).

## Categorical Variable Stats¶

 ANOSIM(distance_matrix, grouping[, column]) ANOSIM statistical method executor. PERMANOVA(distance_matrix, grouping[, column]) PERMANOVA statistical method executor. CategoricalStatsResults(short_method_name, ...) Statistical method results container.

## Continuous Variable Stats¶

 bioenv(distance_matrix, data_frame[, columns]) Find subset of variables maximally correlated with distances.

## Distance Matrix Comparisons¶

 mantel(x, y[, method, permutations, alternative]) Compute correlation between distance matrices using the Mantel test. pwmantel(dms[, labels, strict, lookup, ...]) Run Mantel tests for every pair of distance matrices.

## Examples¶

Load a 4x4 distance matrix and grouping vector denoting 2 groups of objects. Note that these statistical methods require symmetric distances:

>>> from skbio.core.distance import DistanceMatrix
>>> dm = DistanceMatrix([[0, 1, 1, 4],
...                      [1, 0, 3, 2],
...                      [1, 3, 0, 3],
...                      [4, 2, 3, 0]],
...                     ['s1', 's2', 's3', 's4'])
>>> grouping = ['Group1', 'Group1', 'Group2', 'Group2']


Create an ANOSIM instance and run the method with 99 permutations:

>>> import numpy as np
>>> np.random.seed(0) # Make output deterministic; not necessary for normal use
>>> from skbio.math.stats.distance import ANOSIM
>>> anosim = ANOSIM(dm, grouping)
>>> results = anosim(99)
>>> print results
Method name  Sample size  Number of groups  R statistic  p-value  Number of permutations
ANOSIM            4                 2         0.25     0.67                      99


It is possible to rerun a method using an existing instance. Rerun ANOSIM with 999 permutations this time. Note that we obtain the same R statistic as before:

>>> results = anosim(999)
>>> print results
Method name  Sample size  Number of groups  R statistic  p-value  Number of permutations
ANOSIM            4                 2         0.25    0.667                     999


To suppress calculation of the p-value and only obtain the R statistic, specify zero permutations:

>>> results = anosim(0)
>>> print results
Method name  Sample size  Number of groups  R statistic  p-value  Number of permutations
ANOSIM            4                 2         0.25      N/A                       0


A statistical results object can also format its results as delimited text. This is useful, for example, if you want to view the results in a spreadsheet program such as Excel:

>>> print results.summary(delimiter=',')
Method name,Sample size,Number of groups,R statistic,p-value,Number of permutations
ANOSIM,4,2,0.25,N/A,0


Individual values of the results can be accessed via the attributes of the CategoricalStatsResults class:

>>> results.statistic
0.25
>>> print results.p_value
None
>>> results.permutations
0


You can also provide a pandas.DataFrame and a column denoting the grouping instead of a grouping vector. The following data frame’s Group column specifies the same grouping as the vector we used in all of the previous examples:

>>> np.random.seed(0) # Make output deterministic; not necessary for normal use
>>> import pandas as pd
>>> df = pd.DataFrame.from_dict(
...     {'Group': {'s2': 'Group1', 's3': 'Group2', 's4': 'Group2',
...                's5': 'Group3', 's1': 'Group1'}})
>>> anosim = ANOSIM(dm, df, column='Group')
>>> results = anosim(99)
>>> print results
Method name  Sample size  Number of groups  R statistic  p-value  Number of permutations
ANOSIM            4                 2         0.25     0.67                      99


The results match the results we saw in the first example above.

Note that when providing a data frame, the ordering of rows and/or columns does not affect the grouping vector that is extracted. The data frame must be indexed by the distance matrix IDs (i.e., the row labels must be distance matrix IDs).

If IDs (rows) are present in the data frame but not in the distance matrix, they are ignored (the previous example’s s5 ID illustrates this behavior). Thus, the data frame can be a superset of the distance matrix IDs. Note that the reverse is not true: IDs in the distance matrix must be present in the data frame or an error will be raised.