skbio.stats.isubsample¶

skbio.stats.
isubsample
(items, maximum, minimum=1, buf_size=1000, bin_f=None)[source]¶ Randomly subsample items from bins, without replacement.
State: Experimental as of 0.4.0.
Randomly subsample items without replacement from an unknown number of input items, that may fall into an unknown number of bins. This method is intended for data that either a) cannot fit into memory or b) subsampling collections of arbitrary datatypes.
 Parameters
items (Iterable) – The items to evaluate.
maximum (unsigned int) – The maximum number of items per bin.
minimum (unsigned int, optional) – The minimum number of items per bin. The default is 1.
buf_size (unsigned int, optional) – The size of the random value buffer. This buffer holds the random values assigned to each item from items. In practice, it is unlikely that this value will need to change. Increasing it will require more resident memory, but potentially reduce the number of function calls made to the PRNG, whereas decreasing it will result in more function calls and lower memory overhead. The default is 1000.
bin_f (function, optional) – Method to determine what bin an item is associated with. If None (the default), then all items are considered to be part of the same bin. This function will be provided with each entry in items, and must return a hashable value indicating the bin that that entry should be placed in.
 Returns
(bin, item)
 Return type
generator
 Raises
ValueError – If
minimum
is >maximum
.ValueError – If
minimum
< 1 or ifmaximum
< 1.
See also
Notes
Randomly get up to
maximum
items for each bin. If the bin has less thanmaximum
, only those bins that have >=minimum
items are returned.This method will at most hold
maximum
* N data, where N is the number of bins.All items associated to a bin have an equal probability of being retained.
Examples
Randomly keep up to 2 sequences per sample from a set of demultiplexed sequences:
>>> from skbio.stats import isubsample >>> import numpy as np >>> np.random.seed(123) >>> seqs = [('sampleA', 'AATTGG'), ... ('sampleB', 'ATATATAT'), ... ('sampleC', 'ATGGCC'), ... ('sampleB', 'ATGGCT'), ... ('sampleB', 'ATGGCG'), ... ('sampleA', 'ATGGCA')] >>> bin_f = lambda item: item[0] >>> for bin_, item in sorted(isubsample(seqs, 2, bin_f=bin_f)): ... print(bin_, item[1]) sampleA AATTGG sampleA ATGGCA sampleB ATATATAT sampleB ATGGCG sampleC ATGGCC
Now, let’s set the minimum to 2:
>>> bin_f = lambda item: item[0] >>> for bin_, item in sorted(isubsample(seqs, 2, 2, bin_f=bin_f)): ... print(bin_, item[1]) sampleA AATTGG sampleA ATGGCA sampleB ATATATAT sampleB ATGGCG