Find subset of variables maximally correlated with distances.
Finds subsets of variables whose Euclidean distances (after scaling the variables; see Notes section below for details) are maximally rankcorrelated with the distance matrix. For example, the distance matrix might contain distances between communities, and the variables might be numeric environmental variables (e.g., pH). Correlation between the community distance matrix and Euclidean environmental distance matrix is computed using Spearman’s rank correlation coefficient (\(\rho\)).
Subsets of environmental variables range in size from 1 to the total number of variables (inclusive). For example, if there are 3 variables, the “best” variable subsets will be computed for subset sizes 1, 2, and 3.
The “best” subset is chosen by computing the correlation between the community distance matrix and all possible Euclidean environmental distance matrices at the given subset size. The combination of environmental variables with maximum correlation is chosen as the “best” subset.
Parameters:  distance_matrix : DistanceMatrix
data_frame : pandas.DataFrame
columns : iterable of strs, optional


Returns:  pandas.DataFrame

Raises:  TypeError
ValueError

See also
scipy.stats.spearmanr
Notes
See [R70] for the original method reference (originally called BIOENV). The general algorithm and interface are similar to vegan::bioenv, available in R’s vegan package [R71]. This method can also be found in PRIMERE [R72] (originally called BIOENV, but is now called BEST).
Warning
This method can take a long time to run if a large number of variables are specified, as all possible subsets are evaluated at each subset size.
The variables are scaled before computing the Euclidean distance: each column is centered and then scaled by its standard deviation.
References
[R70]  (1, 2) Clarke, K. R & Ainsworth, M. 1993. “A method of linking multivariate community structure to environmental variables”. Marine Ecology Progress Series, 92, 205219. 
[R71]  (1, 2) http://cran.rproject.org/web/packages/vegan/index.html 
[R72]  (1, 2) http://www.primere.com/primer.htm 
Examples
Import the functionality we’ll use in the following examples. The call to pd.set_option ensures consistent data frame formatting across different versions of pandas. This call is not necessary for normal use; it is only included here so that the doctests will pass.
>>> import pandas as pd
>>> from skbio.core.distance import DistanceMatrix
>>> from skbio.math.stats.distance import bioenv
>>> try:
... # not necessary for normal use
... pd.set_option('show_dimensions', True)
... except KeyError:
... pass
Load a 4x4 community distance matrix:
>>> dm = DistanceMatrix([[0.0, 0.5, 0.25, 0.75],
... [0.5, 0.0, 0.1, 0.42],
... [0.25, 0.1, 0.0, 0.33],
... [0.75, 0.42, 0.33, 0.0]],
... ['A', 'B', 'C', 'D'])
Load a pandas.DataFrame with two environmental variables, pH and elevation:
>>> df = pd.DataFrame([[7.0, 400],
... [8.0, 530],
... [7.5, 450],
... [8.5, 810]],
... index=['A','B','C','D'],
... columns=['pH', 'Elevation'])
Note that the data frame is indexed with the same IDs ('A', 'B', 'C', and 'D') that are in the distance matrix. This is necessary in order to link the environmental variables (metadata) to each of the objects in the distance matrix. In this example, the IDs appear in the same order in both the distance matrix and data frame, but this is not necessary.
Find the best subsets of environmental variables that are correlated with community distances:
>>> bioenv(dm, df)
size correlation
vars
pH 1 0.771517
pH, Elevation 2 0.714286
[2 rows x 2 columns]
We see that in this simple example, pH alone is maximally rankcorrelated with the community distances (\(\rho=0.771517\)).