These are notebooks that support a blog post, describe how to accomplish some data science task, or apply a methodology.
Link to Notebook
What’s in this notebook? This is an updated version of my TSNE to Bokeh Scatterplot workflow. I found that UMAP is faster and is able to handle larger datasets where TSNE would previously fail, so I’ve switched over to UMAP as my dimensionality reduction default. I’ve been building a lot of dashboards and visualizations in the plotly ecosystem, so I’ve also switched from Bokeh to Plotly....
Links Link to Notebook
What’s in this notebook? This is the notebook behind my blog post Holy NLP! Understanding Part of Speech Tags, Dependency Parsing, and Named Entity Recognition. It’s an exploration into three common NLP tasks applied on the Bible as a corpus, with some pandas aggregations and seaborn plotting to round things out.
Links Link to Notebook
What’s in this notebook? This is the notebook behind my blog post An Exploration in Earth & Word Movers Distance. It’s an exploration into Earth/Word Movers Distance algorithm that includes a there’s a lot of great matplotlib plots and some pandas-fu.
Links Link to Notebook
What’s in this notebook? This is a worked example derived from my blog post on Making the Most of spaCy’s Rule-Based Matcher. It works through developing a matching algorithm to identify reasons people purchase products from a dataset of Amazon product reviews. For example, you can automatically generate a list like this for any product:
Customers buy this product ... ... as a replacement for a Salton model....
Links Link to Notebook
What’s in this notebook? This is the notebook behind my blog post The Impact of Model Output Transformations on ROC. It contains some seaborn plots, some pandas-fu with method chaining, a simulation of analyzing model results, and some plots with seaborn.
Links Link to Notebook
What’s in this notebook? The Receiver Operating Characteristic (ROC) curve is helpful in evaluating model performance, especially since Area Under the Curve (AUC ROC) has a several friendly interpretations. I use ROC curves in evaluating models I have to explain the model performance to non-technical folks. I was reading through Machine Learning: The Art and Science of Algorithms that Make Sense of Data and stumbled upon this nice visual and interpretation of ROC (tied to AUC):...
Links Link to Notebook
What’s in this notebook? Biclustering is an unsupervised learning algorithm that clusters both the rows and columns. This notebook contains a workflow for completing biclustering in a network analysis context as well as a clean export of the data to an excel spreadsheet.
Link to Notebook
What’s in this notebook? This is a workflow I use often in data exploration. TSNE gives a good representation of high-dimensional data, and Bokeh is helpful in creating a simple interactive plots with contextual info given by colors and tooltips.
This workflow has been extremely helpful for:
text analytics/NLP tasks if text data is passed through a TfidfVectorizer or similar from scikit-learn understanding word2vec or doc2vec vectors by passing them to TSNE getting an idea of separability in doing prediction / classification by passing the outcome variable to bokeh This example uses the Australian atheletes data set, which contains 11 numeric variables....