TabularMSA.
conservation
(metric='inverse_shannon_uncertainty', degenerate_mode='error', gap_mode='nan')[source]¶Apply metric to compute conservation for all alignment positions
State: Experimental as of 0.4.1.
Parameters: | metric : {‘inverse_shannon_uncertainty’}, optional
degenerate_mode : {‘nan’, ‘error’}, optional
gap_mode : {‘nan’, ‘ignore’, ‘error’, ‘include’}, optional
|
---|---|
Returns: | np.array of floats
|
Raises: | ValueError
ValueError
ValueError
|
Notes
Users should be careful interpreting results when
gap_mode = "include"
as the results may be misleading. For example,
as pointed out in [R98], a protein alignment position composed of 90%
gaps and 10% tryptophans would score as more highly conserved than a
position composed of alanine and glycine in equal frequencies with the
"inverse_shannon_uncertainty"
metric.
gap_mode = "include"
will result in all gap characters being
recoded to TabularMSA.dtype.default_gap_char
. Because no
conservation metrics that we are aware of consider different gap
characters differently (e.g., none of the metrics described in [R98]),
they are all treated the same within this method.
The inverse_shannon_uncertainty
metric is simply one minus
Shannon’s uncertainty metric. This method uses the inverse of Shannon’s
uncertainty so that larger values imply higher conservation. Shannon’s
uncertainty is also referred to as Shannon’s entropy, but when making
computations from symbols, as is done here, “uncertainty” is the
preferred term ([R99]).
References
[R98] | (1, 2, 3) Valdar WS. Scoring residue conservation. Proteins. (2002) |
[R99] | (1, 2) Schneider T. Pitfalls in information theory (website, ca. 2015). https://schneider.ncifcrf.gov/glossary.html#Shannon_entropy |