skbio.stats.isubsample

skbio.stats.isubsample(items, maximum, minimum=1, buf_size=1000, bin_f=None)[source]

Randomly subsample items from bins, without replacement.

State: Experimental as of 0.4.0.

Randomly subsample items without replacement from an unknown number of input items, that may fall into an unknown number of bins. This method is intended for data that either a) cannot fit into memory or b) subsampling collections of arbitrary datatypes.

Parameters:

items : Iterable

The items to evaluate.

maximum : unsigned int

The maximum number of items per bin.

minimum : unsigned int, optional

The minimum number of items per bin. The default is 1.

buf_size : unsigned int, optional

The size of the random value buffer. This buffer holds the random values assigned to each item from items. In practice, it is unlikely that this value will need to change. Increasing it will require more resident memory, but potentially reduce the number of function calls made to the PRNG, whereas decreasing it will result in more function calls and lower memory overhead. The default is 1000.

bin_f : function, optional

Method to determine what bin an item is associated with. If None (the default), then all items are considered to be part of the same bin. This function will be provided with each entry in items, and must return a hashable value indicating the bin that that entry should be placed in.

Returns:

generator

(bin, item)

Raises:

ValueError

If minimum is > maximum.

ValueError

If minimum < 1 or if maximum < 1.

See also

subsample_counts

Notes

Randomly get up to maximum items for each bin. If the bin has less than maximum, only those bins that have >= minimum items are returned.

This method will at most hold maximum * N data, where N is the number of bins.

All items associated to a bin have an equal probability of being retained.

Examples

Randomly keep up to 2 sequences per sample from a set of demultiplexed sequences:

>>> from skbio.stats import isubsample
>>> import numpy as np
>>> np.random.seed(123)
>>> seqs = [('sampleA', 'AATTGG'),
...         ('sampleB', 'ATATATAT'),
...         ('sampleC', 'ATGGCC'),
...         ('sampleB', 'ATGGCT'),
...         ('sampleB', 'ATGGCG'),
...         ('sampleA', 'ATGGCA')]
>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, bin_f=bin_f)):
...     print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATATATAT
sampleB ATGGCG
sampleC ATGGCC

Now, let’s set the minimum to 2:

>>> bin_f = lambda item: item[0]
>>> for bin_, item in sorted(isubsample(seqs, 2, 2, bin_f=bin_f)):
...     print(bin_, item[1])
sampleA AATTGG
sampleA ATGGCA
sampleB ATATATAT
sampleB ATGGCG