J Gregory Caporaso
GitHub/Twitter: @gregcaporaso
from __future__ import division, print_function
from IPython.core import page
page.page = print
import warnings
warnings.simplefilter('always')
Bioinformatics, as I see it, is the application of the tools of computer science (things like programming languages, algorithms, and databases) to address biological problems (for example, inferring the evolutionary relationship between a group of organisms based on fragments of their genomes, or understanding if or how the community of microorganisms that live in my gut changes if I modify my diet). Bioinformatics is a rapidly growing field, largely in response to the vast increase in the quantity of data that biologists now grapple with. Students from varied disciplines (e.g., biology, computer science, statistics, and biochemistry) and stages of their educational careers (undergraduate, graduate, or postdoctoral) are becoming interested in bioinformatics.
I teach bioinformatics at the undergraduate and graduate levels at Northern Arizona University. This repository contains some of the materials that I've developed in these courses, and represents an initial attempt to organize these materials in a standalone way. If you'd like to read a little more about the project, see my blog post on microbe.net.
This project is in very early development stage. It's not ready for prime-time by any means, but I fall firmly into the "publish early, publish often" mindset, hence its public availability. I am very interested in feedback in the form of email (gregcaporaso@gmail.com) or pull requests.
The code in the iab module is not sufficiently tested, documented, or optimized for production use. As code reaches those quality standards it will be ported to scikit-bio. I do not recommend using the code in the iab module outside of these notebooks. In other words, don't import iab
outside of the notebooks - if you want access to the functionality in your own code, you should import skbio
.
Currently, the best example of where I'm hoping to go with these materials is the multiple sequence alignment chapter.
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide
local_pairwise_align_nucleotide?
Type: function String form: <function local_pairwise_align_nucleotide at 0x1045585f0> File: /Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py Definition: local_pairwise_align_nucleotide(seq1, seq2, gap_open_penalty=5, gap_extend_penalty=2, match_score=2, mismatch_score=-3, substitution_matrix=None) Docstring: Locally align exactly two nucleotide seqs with Smith-Waterman Parameters ---------- seq1 : str or BiologicalSequence The first unaligned sequence. seq2 : str or BiologicalSequence The second unaligned sequence. gap_open_penalty : int or float, optional Penalty for opening a gap (this is substracted from previous best alignment score, so is typically positive). gap_extend_penalty : int or float, optional Penalty for extending a gap (this is substracted from previous best alignment score, so is typically positive). match_score : int or float, optional The score to add for a match between a pair of bases (this is added to the previous best alignment score, so is typically positive). mismatch_score : int or float, optional The score to add for a mismatch between a pair of bases (this is added to the previous best alignment score, so is typically negative). substitution_matrix: 2D dict (or similar) Lookup for substitution scores (these values are added to the previous best alignment score). If provided, this overrides ``match_score`` and ``mismatch_score``. Returns ------- skbio.Alignment ``Alignment`` object containing the aligned sequences as well as details about the alignment. See Also -------- local_pairwise_align local_pairwise_align_protein skbio.core.alignment.local_pairwise_align_ssw global_pairwise_align global_pairwise_align_protein global_pairwise_align_nucelotide Notes ----- Default ``match_score``, ``mismatch_score``, ``gap_open_penalty`` and ``gap_extend_penalty`` parameters are derived from the NCBI BLAST Server [1]_. References ---------- .. [1] http://blast.ncbi.nlm.nih.gov/Blast.cgi
from skbio.core.alignment.pairwise import local_pairwise_align_nucleotide
s1 = "ACTAAGGCTCTCTACCCCTCTCAGAGA"
s2 = "AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA"
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)
<class 'skbio.core.alignment.alignment.Alignment'> >0 ACTAAGGCTCTCTACCCCTC-TCAGAGA >1 ACTAAGGCTCTCTACCCCTCTTCAGAGA
/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw. EfficiencyWarning)
from skbio import DNA
s1 = DNA("ACTAAGGCTCTCTACCCCTCTCAGAGA", "query")
s2 = DNA("AAAAAACTCTCTAAACTCACTAAGGCTCTCTACCCCTCTTCAGAGAAGTCGA", "target")
r = local_pairwise_align_nucleotide(s1, s2)
print(type(r))
print(r)
<class 'skbio.core.alignment.alignment.Alignment'> >query ACTAAGGCTCTCTACCCCTC-TCAGAGA >target ACTAAGGCTCTCTACCCCTCTTCAGAGA
/Users/caporaso/code/skbio/skbio/core/alignment/pairwise.py:300: EfficiencyWarning: You're using skbio's python implementation of Smith-Waterman alignment. This will be very slow (e.g., thousands of times slower) than skbio.core.alignment.local_pairwise_align_ssw. EfficiencyWarning)
from skbio.core.alignment import local_pairwise_align_ssw
local_pairwise_align_ssw?
Type: builtin_function_or_method String form: <built-in function local_pairwise_align_ssw> Docstring: Align query and target sequences with Striped Smith-Waterman. Parameters ---------- sequence1 : str or BiologicalSequence The first unaligned sequence sequence2 : str or BiologicalSequence The second unaligned sequence Returns ------- ``skbio.core.alignment.Alignment`` The resulting alignment as an Alignment object Notes ----- For a complete list of optional keyword-arguments that can be provided, see ``skbio.core.alignment.StripedSmithWaterman``. The following kwargs will not have any effect: `suppress_sequences` and `zero_index` If an alignment does not meet a provided filter, `None` will be returned. See Also -------- skbio.core.alignment.StripedSmithWaterman
r = local_pairwise_align_ssw(s1, s2)
print(type(r))
print(r)
<class 'skbio.core.alignment.alignment.Alignment'> >query ACTAAGGCTCTCTACCCCTC-TCAGAGA >target ACTAAGGCTCTCTACCCCTCTTCAGAGA
from skbio.parse.sequences import parse_fasta
from skbio import SequenceCollection
from random import choice
gg_path = "/Users/caporaso/data/gg_13_8_otus/rep_set/73_otus.fasta"
s = SequenceCollection.from_fasta_records([(i, s) for i, s in parse_fasta(gg_path) if set(s) == set('ACGT')], DNA)
%timeit local_pairwise_align_ssw(choice(s), choice(s), gap_open_penalty=5,\
gap_extend_penalty=2, match_score=2, mismatch_score=-3)
100 loops, best of 3: 4.25 ms per loop
warnings.simplefilter('ignore')
%timeit local_pairwise_align_nucleotide(choice(s), choice(s), gap_open_penalty=5,\
gap_extend_penalty=2, match_score=2, mismatch_score=-3)
1 loops, best of 3: 18.8 s per loop
import pandas as pd
sample_md = {
'A': {'body_site': 'gut', 'subject': '1'},
'B': {'body_site': 'skin', 'subject': '1'},
'C': {'body_site': 'tongue', 'subject': '1'},
'D': {'body_site': 'gut', 'subject': '2'},
'E': {'body_site': 'tongue', 'subject': '2'},
'F': {'body_site': 'skin', 'subject': '2'}}
sample_md = pd.DataFrame.from_dict(sample_md, orient='index')
sample_md
subject | body_site | |
---|---|---|
A | 1 | gut |
B | 1 | skin |
C | 1 | tongue |
D | 2 | gut |
E | 2 | tongue |
F | 2 | skin |
6 rows × 2 columns
data = [[23, 64, 14, 0, 0, 3, 1],
[0, 3, 35, 42, 0, 12, 1],
[0, 5, 5, 0, 40, 40, 0],
[44, 35, 9, 0, 1, 0, 0],
[0, 2, 8, 0, 35, 45, 1],
[0, 0, 25, 35, 0, 19, 0]]
table = pd.DataFrame(data,
columns=['Species 1', 'Species 2', 'Species 3', 'Species 4', 'Species 5', 'Species 6', 'Species 7'],
index=list('ABCDEF'))
table
Species 1 | Species 2 | Species 3 | Species 4 | Species 5 | Species 6 | Species 7 | |
---|---|---|---|---|---|---|---|
A | 23 | 64 | 14 | 0 | 0 | 3 | 1 |
B | 0 | 3 | 35 | 42 | 0 | 12 | 1 |
C | 0 | 5 | 5 | 0 | 40 | 40 | 0 |
D | 44 | 35 | 9 | 0 | 1 | 0 | 0 |
E | 0 | 2 | 8 | 0 | 35 | 45 | 1 |
F | 0 | 0 | 25 | 35 | 0 | 19 | 0 |
6 rows × 7 columns
from skbio.math.diversity.beta import pw_distances
bc_dm = pw_distances(table, table.index, "braycurtis")
print(bc_dm)
6x6 distance matrix IDs: A, B, C, D, E, F Data: [[ 0. 0.78787879 0.86666667 0.30927835 0.85714286 0.81521739] [ 0.78787879 0. 0.78142077 0.86813187 0.75 0.1627907 ] [ 0.86666667 0.78142077 0. 0.87709497 0.09392265 0.71597633] [ 0.30927835 0.86813187 0.87709497 0. 0.87777778 0.89285714] [ 0.85714286 0.75 0.09392265 0.87777778 0. 0.68235294] [ 0.81521739 0.1627907 0.71597633 0.89285714 0.68235294 0. ]]
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
def scatter_3d(ord_results, df, column, color_map, title='', axis1=0,
axis2=1, axis3=2):
"""Adapted from Matplotlib Gallery:
http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
"""
coord_matrix = ord_results.site.T
ids = ord_results.site_ids
colors = [color_map[df[column][id_]] for id_ in ord_results.site_ids]
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
xs = coord_matrix[axis1]
ys = coord_matrix[axis2]
zs = coord_matrix[axis3]
plot = ax.scatter(xs, ys, zs, c=colors, s=150)
ax.set_xlabel('PC %d' % (axis1 + 1))
ax.set_ylabel('PC %d' % (axis2 + 1))
ax.set_zlabel('PC %d' % (axis3 + 1))
ax.set_xticklabels([])
ax.set_yticklabels([])
ax.set_zticklabels([])
ax.set_title(title)
return fig
from skbio.math.stats.ordination import PCoA
bc_pc = PCoA(bc_dm).scores()
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'subject', {'1': 'yellow', '2': 'purple'},
'Samples colored by subject')
# This function adapted from Matplotlib Gallery's:
# http://matplotlib.org/examples/mplot3d/scatter3d_demo.html
fig = scatter_3d(bc_pc, sample_md, 'body_site',
{'gut': 'b', 'skin': 'r', 'tongue': 'g'},
'Samples colored by body site')
from skbio.math.stats.distance import ANOSIM
anosim = ANOSIM(bc_dm, sample_md, column='subject')
results = anosim(999)
print(results.statistic)
print(results.p_value < 0.05)
-0.407407407407 False
anosim = ANOSIM(bc_dm, sample_md, column='body_site')
results = anosim(999)
print(results.statistic)
print(results.p_value < 0.1)
1.0 True
(name of language, spelling of phrase for ordering a beer, phonetic phrase for ordering a beer)
[Source]
languages = [("Afrikaans", "'n Bier, asseblief", "A beer ah-suh-bleef"),
("Basque", "Garagardo bat, mesedez", "Gara-gardo bat mese-des"),
("Breton", "Ur banne bier am bo, mar plij", "Oor bah-ne beer am boh mar pleezh"),
("Catalan", "Una cervesa, si us plau", "Oona servayzeh see oos plow"),
("Croatian", "Jedno pivo, molim", "Yed-no pee-vo, mo-lim"),
("Czech / Slovak", "Pivo, prosím", "Pee-vo, pro-seem"),
("Danish", "Jeg vil gerne have en øl", "Yay vil geh-neh heh en url"),
("Dutch", "Een bier, alsjeblieft", "Un beer, ahls-yer-bleeft"),
("English", "One beer, please", "Wun beer, pleez"),
("Esperanto", "Unu bieron, mi petas", "Oo-noo bee-airon, mee peh-tahs"),
("Estonian", "Üks õlu, palun", "Ooks ur-loo, pah-lun"),
("Finnish", "Olut mulle, kiitos", "O-loot moolek kee-tos"),
("French", "Une bière, s'il vous plaît", "Oon bee-air, seel voo pleh"),
("German", "Ein Bier, bitte", "Ine beer, bitt-uh"),
("Hungarian", "Egy pohár sört kérek", "Edj pohar shurt kayrek"),
("Icelandic", "Einn bjór, takk", "Ay-dn byohr tahk"),
("Irish", "Beoir amháin, le do thoil", "Byohr awoyn, lyeh doh hull"),
("Italian", "Una birra, per favore", "Oo-na beer-ra, pair fa-vo-re"),
("Latin", "Cervisiam, sodes", "Ker-wi-see-am, soh-dehs"),
("Latvian", "Vienu alu, lū-dzu", "Vyeh-noo ah-loo, loo dzoo"),
("Lithuanian", "Prašau viena alaus", "Pra-shau vie-na al-lows"),
("Maltese", "Wiehed birra, jekk jghogbok", "Wee-het bir-ra yek yoh-dzbok"),
("Norwegian", "En øl, takk", "Ehn url tahk"),
("Occitan", "Una cervesa, se vos plai", "Oo-no serbeh-zo se bus ply"),
("Polish", "Jedno piwo, proszę", "Yed-no peevo proshe"),
("Portuguese", "Uma cerveja, por favor", "Oo-ma ser-vay-ja, poor fa-vohr"),
("Romansch Ladina", "Üna biera, per plaschair.", "Oo-nuh bee-air-uh per plah-chair"),
("Sardinian", "Una birra, po piaghere", "Oo-na beer-ra po pia-gehre"),
("Scots Gaelic", "Leann, mas e do thoil e", "Lyawn mahs eh doh hawl eh"),
("Slovene", "Eno pivo, prosim", "Eno pee-vo pro-seem"),
("Spanish (Lat. Am.)", "Una cerveza, por favor", "Oo-na ser-veh-sa, por fa-vor"),
("Spanish (Spain)", "Una cerveza, por favor", "Oo-na thair-veh-tha, por fa-vor"),
("Strine", "Foster's, mate", "Faw-stuhz, mayt"),
("Swedish", "En öl, tack", "Ehn irl, tahk"),
("Twi", "Mame beer baako, mi pawokyew", "Mah-me bee-ye bah-ko mee pow-che-oo"),
("Turkish", "Bir bira, lütfen", "Beer beer-ah luht-fen"),
("Welsh", "Cwrw os gwelwch in dda", "Koo-roh ohs gwel-ookh-un-thah")]
language_to_pron = {e[0]: e[2] for e in languages}
all_pron_chars = []
for e in language_to_pron.values():
all_pron_chars.extend(e)
all_pron_chars = set(all_pron_chars)
pron_substitution_matrix = {}
for c in all_pron_chars:
row = {}.fromkeys(all_pron_chars, -2.0)
row[c] = 5.0
pron_substitution_matrix[c] = row
from skbio.core.alignment.pairwise import global_pairwise_align
alignment = global_pairwise_align(language_to_pron["Swedish"],
language_to_pron["Norwegian"],
gap_open_penalty=5, gap_extend_penalty=2,
substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])
>0 Ehn irl, tahk >1 Ehn url- tahk Hamming distance: 0.154
alignment = global_pairwise_align(language_to_pron["Swedish"],
language_to_pron["Icelandic"],
gap_open_penalty=5, gap_extend_penalty=2,
substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])
>0 --Ehn ---irl, tahk >1 Ay-dn byohr-- tahk Hamming distance: 0.556
alignment = global_pairwise_align(language_to_pron["Spanish (Spain)"],
language_to_pron["Italian"],
gap_open_penalty=5, gap_extend_penalty=2,
substitution_matrix=pron_substitution_matrix)
print(alignment.to_fasta())
print("Hamming distance: %1.3f" % alignment.distances()[0,1])
>0 Oo-na thair-veh-tha, p-or fa-vo-r- >1 Oo-na be-----er-r-a, pair fa-vo-re Hamming distance: 0.353
warnings.simplefilter('ignore')
from skbio.core.alignment.pairwise import global_pairwise_align
from skbio import DistanceMatrix
languages = language_to_pron.keys()
distances = np.zeros((len(languages), len(languages)))
for i, language1 in enumerate(languages):
language1_phrase = language_to_pron[language1]
for j in range(i):
language2 = languages[j]
language2_phrase = language_to_pron[language2]
alignment = global_pairwise_align(language1_phrase, language2_phrase,
gap_open_penalty=5, gap_extend_penalty=2,
substitution_matrix=pron_substitution_matrix)
distances[i, j] = distances[j, i] = alignment.distances()[0,1]
dm = DistanceMatrix(distances, languages)
print(dm)
37x37 distance matrix IDs: Swedish, Icelandic, Estonian, Turkish, Twi, Sardinian, Romansch Ladina, Dutch, ... Data: [[ 0. 0.5 0.6 ..., 0.6 0.76923077 0.76470588] [ 0.5 0. 0.8 ..., 0.66666667 0.76923077 0.76470588] [ 0.6 0.8 0. ..., 0.66666667 0.73076923 0.72727273] ..., [ 0.6 0.66666667 0.66666667 ..., 0. 0.59375 0.61111111] [ 0.76923077 0.76923077 0.73076923 ..., 0.59375 0. 0.65714286] [ 0.76470588 0.76470588 0.72727273 ..., 0.61111111 0.65714286 0. ]]
from scipy.cluster.hierarchy import average, dendrogram, to_tree
lm = average(dm.condensed_form())
def format_dendrogram(tip_count):
import matplotlib.pylab as plt
ax = plt.gca()
fig = plt.gcf()
height = tip_count * 0.4
if height < 3:
height = 3
fig.set_size_inches(7, height)
font = {'family' : 'normal',
'weight' : 'normal',
'size' : 18}
matplotlib.rc('font', **font)
return ax
format_dendrogram(dm.shape[0])
d = dendrogram(lm, labels=dm.ids, orientation='right',
link_color_func=lambda x: 'black')
Adam Robbins-Pianka (@adamrp) | Antonio Gonzalez (@antgonza) | Daniel McDonald (@wasade) | Evan Bolyen (@ebolyen) | Greg Caporaso (@gregcaporaso) | Jai Ram Rideout (@ElBrogrammer) | Jens Reeder (@jensreeder) | Jorge Cañardo Alastuey (@Jorge-C) | Jose Antonio Navas Molina (@josenavas) | Joshua Shorenstein (@squirrelo) | Yoshiki Vázquez Baeza (@ElDeveloper) | @charudatta-navare | John Chase (@johnchase) | Karen Schwarzberg (@karenschwarzberg) | Emily TerAvest (@teravest) | Will Van Treuren (@wdwvt1) | Zech Xu (@RNAer)
Rob Knight (@rob-knight) | Gavin Huttley (@gavin-huttley) | Micah Hamady | Sandra Smit | Cathy Lozupone (@clozupone) | Mike Robeson (@mikerobeson) | Marcin Cieslik | Peter Maxwell | Jeremy Widmann | Zongzhi Liu | Michael Dwan | Logan Knecht (@loganknecht) | Andrew Cochran | Jose Carlos Clemente (@cleme) | Damien Coy | Levi McCracken | Andrew Butterfield | Justin Kuczynski (@justin212k) | Matthew Wakefield (@genomematt)
import skbio
print(skbio.title)
print(skbio.art)
_ _ _ _ _ _ (_) | (_) | | | (_) ___ ___ _| | ___| |_ ______| |__ _ ___ / __|/ __| | |/ / | __|______| '_ \| |/ _ \ \__ \ (__| | <| | |_ | |_) | | (_) | |___/\___|_|_|\_\_|\__| |_.__/|_|\___/ Opisthokonta \ Amoebozoa \ / * Euryarchaeota \ |_ Crenarchaeota \ * \ / * / / / * / \ / \ Proteobacteria \ Cyanobacteria