Basic overview of the rmongodb package for R

I have been playing around with MongoDB quite a bit over the last few months.  Because I am much better at coding in R, I decided to write up my notes on how to use the rmongodb package.

This is not a comprehensive tutorial by any stretch, but I wanted to share my notes as I walked through the various tasks that I would need for my projects.

The tutorial and R Markdown file are located here:

If I missed anything, please let me know.  The package has it’s quirks, but just like R, once you work through the functions, it’s pretty powerful.

NOTE:  The tutorial is embedded below.  It doesn’t render that well, so I would recommend hopping over to the gist at the link above.


Mimic Bigger is not Better Post by Scannel & Kurz

The following Github repo has the R code I put together that allows us to follow along and modify the great idea from Scannel & Kurz on app growth and it’s impact on yield.

Take a look at the repo and feel free to clone it if you want to learn R or add to this idea.

Happy Coding!

Basic Proof of Concept: Save an R dataframe as a Tableau Data Extract

I love Tableau, as it is a huge part of my data workflow. Not only is it super easy to slice and dice data, but the company recently released an API for developers.  In short, we can use a few languages (python and C++/Java) to build Data Extracts, the super fast back-end that makes using Tableau on large data-sets a breeze.

With that said, it would have been nice to have an API for R.  I am not the only person who wants to see this.

As noted in my Github Repo, my ideal workflow would allow me to save an R DataFrame as a Tableau Data extract.  This way, I could do  a lot of the heavy lifting (data cleaning, recodes, modeling) in R prior to my reporting in Tableau.

In the end, I am sure its possible for someone with C++ skills to write an R package, but my skills are not there just yet.

The ipython notebook represents a very crude and naive way to get data from R into a Tableau Data Extract.  To follow along:

  1. Clone the repo
  2. Run the code in Model-in-R.R.  This will save an R data file in your current directory
  3. Fire up ipython notebook in the same directory
  4. Run the code

I am trying to grow my skills in my python, so I make no claims that the code presented in the repo is elegant or free of bugs.  Also, it requires the rpy2 package for python, which admittedly, was a pain to get up and running on windows.

Lastly, it’s worth noting that one major limitation (and oversight) in my eyes is that Tableau only works on Windows.  It seems like the API requries the .dll files that are installed with the software.  While the option of an API is awesome, I have to believe that the majority of developers that would leverage this option are not writing code in a Windows environment.


Cluster NHL Teams Based on 2012/13 Regular Season Performance

Since tonight kicks off Game 1 of the Stanley Cup Finals, I thought it would be fun to do a very quick and dirty cluster analysis of the league based on regular season performance.

Tonight, the Chicago Blackhawks square off against my hometown team, the Boston Bruins.  Even though it was a lockout-shortened season, the Blackhawks started off by playing 24 consecutive games without a loss.  Given this incredible start, I was eager to see how statistically similar the Bruins were relative to their opponent and other teams they faced in the playoffs.

The process is as follows:

  • Crawl the 2012-13 regular season data for each team
  • Normalize the statistics and create a distance matrix
  • Use hierarchical clustering to group the teams

Of course, all of this will be completed in my language of choice, R.


The image above shows 3 dendrograms using 3 different methods.

I will let you draw your own conclusions, but I find it interesting that:

  • Chicago and Pittsburgh (the team Boston defeated to go the Stanley Cup) are basically isolated in 2 of the trees
  • Using Average linkage, Chicago/Pittsburgh stand alone from the pack, but so does Boston from the group of other playoff teams
  • By and large, the techniques were able to isolate the majority of teams that did not make the playoffs

Just in case you are trying to learn R, here is the code.

US News National Rankings – Yield Rates and R

Today I saw a post about the yield rates at institutions considered in the “National” Rankings for US News.  Above all else, I was interested in the data table contained within the post.  I wrote a quick R script to grab the table and plot where a particular school might find themselves given the data reported.  The script generates a plot and highlights where the school would fall within the distribution.

If you want to run the R code, all you need to do is ensure that you have the XML package and change the BM variable to a value of interest.  The image below highlights what one institution might see.

The code for this analysis can be found below: