skbio.sequence.Protein

class skbio.sequence.Protein(sequence, metadata=None, positional_metadata=None, lowercase=False, validate=True)[source]

Store protein sequence data and optional associated metadata.

Only characters in the IUPAC protein character set [R215] are supported.

Parameters:

sequence : str, Sequence, or 1D np.ndarray (np.uint8 or ‘|S1’)

Characters representing the protein sequence itself.

metadata : dict, optional

Arbitrary metadata which applies to the entire sequence.

positional_metadata : Pandas DataFrame consumable, optional

Arbitrary per-character metadata. For example, quality data from sequencing reads. Must be able to be passed directly to the Pandas DataFrame constructor.

lowercase : bool or str, optional

If True, lowercase sequence characters will be converted to uppercase characters in order to be valid IUPAC Protein characters. If False, no characters will be converted. If a str, it will be treated as a key into the positional metadata of the object. All lowercase characters will be converted to uppercase, and a True value will be stored in a boolean array in the positional metadata under the key.

validate : bool, optional

If True, validation will be performed to ensure that all sequence characters are in the IUPAC protein character set. If False, validation will not be performed. Turning off validation will improve runtime performance. If invalid characters are present, however, there is no guarantee that operations performed on the resulting object will work or behave as expected. Only turn off validation if you are certain that the sequence characters are valid. To store sequence data that is not IUPAC-compliant, use Sequence.

Notes

Subclassing is disabled for Protein, because subclassing makes it possible to change the alphabet, and certain methods rely on the IUPAC alphabet. If a custom sequence alphabet is needed, inherit directly from GrammaredSequence.

References

[R215](1, 2) Nomenclature for incompletely specified bases in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. May 10, 1985; 13(9): 3021-3030. A Cornish-Bowden

Examples

>>> from skbio import Protein
>>> Protein('PAW')
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Convert lowercase characters to uppercase:

>>> Protein('paW', lowercase=True)
Protein
--------------------------
Stats:
    length: 3
    has gaps: False
    has degenerates: False
    has definites: True
    has stops: False
--------------------------
0 PAW

Attributes

values Array containing underlying sequence characters.
metadata dict containing metadata which applies to the entire object.
positional_metadata pd.DataFrame containing metadata along an axis.
alphabet Return valid characters.
gap_chars Return characters defined as gaps.
default_gap_char Gap character to use when constructing a new gapped sequence.
stop_chars Return characters representing translation stop codons.
definite_chars Return definite characters.
degenerate_chars Return degenerate characters.
degenerate_map Return mapping of degenerate to definite characters.

Methods

bool(protein) Returns truth value (truthiness) of sequence.
x in protein Determine if a subsequence is contained in this sequence.
copy.copy(protein) Return a shallow copy of this sequence.
copy.deepcopy(protein) Return a deep copy of this sequence.
protein1 == protein2 Determine if this sequence is equal to another.
protein[x] Slice this sequence.
iter(protein) Iterate over positions in this sequence.
len(protein) Return the number of characters in this sequence.
protein1 != protein2 Determine if this sequence is not equal to another.
reversed(protein) Iterate over positions in this sequence in reverse order.
str(protein) Return sequence characters as a string.
concat(sequences[, how]) Concatenate an iterable of Sequence objects.
copy([deep]) Return a copy of this sequence.
count(subsequence[, start, end]) Count occurrences of a subsequence in this sequence.
definites() Find positions containing definite characters in the sequence.
degap() Return a new sequence with gap characters removed.
degenerates() Find positions containing degenerate characters in the sequence.
distance(other[, metric]) Compute the distance to another sequence.
expand_degenerates() Yield all possible definite versions of the sequence.
find_motifs(motif_type[, min_length, ignore]) Search the biological sequence for motifs.
find_with_regex(regex[, ignore]) Generate slices for patterns matched by a regular expression.
frequencies([chars, relative]) Compute frequencies of characters in the sequence.
gaps() Find positions containing gaps in the biological sequence.
has_definites() Determine if sequence contains one or more definite characters
has_degenerates() Determine if sequence contains one or more degenerate characters.
has_gaps() Determine if the sequence contains one or more gap characters.
has_metadata() Determine if the object has metadata.
has_nondegenerates() Determine if sequence contains one or more non-degenerate characters
has_positional_metadata() Determine if the object has positional metadata.
has_stops() Determine if the sequence contains one or more stop characters.
index(subsequence[, start, end]) Find position where subsequence first occurs in the sequence.
iter_contiguous(included[, min_length, invert]) Yield contiguous subsequences based on included.
iter_kmers(k[, overlap]) Generate kmers of length k from this sequence.
kmer_frequencies(k[, overlap, relative]) Return counts of words of length k from this sequence.
lowercase(lowercase) Return a case-sensitive string representation of the sequence.
match_frequency(other[, relative]) Return count of positions that are the same between two sequences.
matches(other) Find positions that match with another sequence.
mismatch_frequency(other[, relative]) Return count of positions that differ between two sequences.
mismatches(other) Find positions that do not match with another sequence.
nondegenerates() Find positions containing non-degenerate characters in the sequence.
read(file[, format]) Create a new Protein instance from a file.
replace(where, character) Replace values in this sequence with a different character.
stops() Find positions containing stop characters in the protein sequence.
to_regex() Return regular expression object that accounts for degenerate chars.
write(file[, format]) Write an instance of Protein to a file.