skbio.sequence.DNA.frequencies

DNA.frequencies(chars=None, relative=False)[source]

Compute frequencies of characters in the sequence.

State: Experimental as of 0.4.1.

Parameters:

chars : str or set of str, optional

Characters to compute the frequencies of. May be a str containing a single character or a set of single-character strings. If None, frequencies will be computed for all characters present in the sequence.

relative : bool, optional

If True, return the relative frequency of each character instead of its count. If chars is provided, relative frequencies will be computed with respect to the number of characters in the sequence, not the total count of characters observed in chars. Thus, the relative frequencies will not necessarily sum to 1.0 if chars is provided.

Returns:

dict

Frequencies of characters in the sequence.

Raises:

TypeError

If chars is not a str or set of str.

ValueError

If chars is not a single-character str or a set of single-character strings.

ValueError

If chars contains characters outside the allowable range of characters in a Sequence object.

Notes

If the sequence is empty (i.e., length zero), relative=True, and chars is provided, the relative frequency of each specified character will be np.nan.

If chars is not provided, this method is equivalent to, but faster than, seq.kmer_frequencies(k=1).

If chars is not provided, it is equivalent to, but faster than, passing chars=seq.observed_chars.

Examples

Compute character frequencies of a sequence:

>>> from pprint import pprint
>>> from skbio import Sequence
>>> seq = Sequence('AGAAGACC')
>>> freqs = seq.frequencies()
>>> pprint(freqs) # using pprint to display dict in sorted order
{'A': 4, 'C': 2, 'G': 2}

Compute relative character frequencies:

>>> freqs = seq.frequencies(relative=True)
>>> pprint(freqs)
{'A': 0.5, 'C': 0.25, 'G': 0.25}

Compute relative frequencies of characters A, C, and T:

>>> freqs = seq.frequencies(chars={'A', 'C', 'T'}, relative=True)
>>> pprint(freqs)
{'A': 0.5, 'C': 0.25, 'T': 0.0}

Note that since character T is not in the sequence we receive a relative frequency of 0.0. The relative frequencies of A and C are relative to the number of characters in the sequence (8), not the number of A and C characters (4 + 2 = 6).