Use R to view and manipulate the File System

One of the best ways to learn how to code in R is to view sample scripts that people share.

I recently came across this post where Michael uses R to scrape twitter and collect all sorts of great data on the current unrest in the Middle East and northern Africa.  While the post is worthy of a read itself, what I found useful was the function ?file.exists.  The usage is obvious, but the help file exposed a collection of functions that I was not previously aware of.

For instance, I have used ?unlink before, but I had no idea that you could use R to create directories, walk file paths, etc.  I can’t speak for SAS, but I am quite certain you can’t do this natively in SPSS.

I have to admit that I probably should have already known about this, but the ability to programmatically manipulate files and directories are the sorts of things that keep affirming that time learning R is well spent.



Visualize NHL Play-by-Play using Tableau Public and R

Nothing like a little Sunday morning data hacking before a big game!  I have been wanting to play with the NHL play-by-play event files for some time now.  The JSON datasets provide a wealth of information about each event in the game including the location, as defined by the fields xcoord and ycoord.

I am pretty excited about today’s rematch between the Boston Bruins and the Detroit Redwings.  It’s a national game, and the Bruins got destroyed on Friday, so it should make for an interesting contest.  I have never used Tableau Public before, but do leverage Professional version at work. It is great, and in my humble opinion, it is the best BI tool for small to medium size businesses that I have seen.  The public version is geared towards bloggers and is somewhat limited in features, but still very robust.  I am going to attempt to update the event dataset during each intermission and update this post with the published workbook.  Needless to say, this is doomed to fail, but I am going to give it a shot.

One of Tableau’s strengths is that it is fairly straightforward to use, especially if you have crunched data in Excel, SPSS, etc.  The link below outlines how to use a background image.  Don’t have the image on your computer?  Not a problem, as Tableau can grab an image from the web, just as I did.

http://www.tableausoftware.com/support/knowledge-base/background-image-coordinates

From the event files, I seems that a reasonable setting for the image should be x(-100,100) and y (40,-40).   Yes, the positive should come first as I believe the y-coordinates are inverted relative to what you see displayed on the gamecenter page for a given game.  I could be wrong, though.

NOTE:  I am not certain of my settings for the image, because when you do a summary on the min/max of the coordinates, it isn’t exactly 100, but it seems good enough to my eye. Not to mention, most of the hits should take place along the boards, and even with this setting, some take place “outside” the boards on my image.

I am using my tool of choice, R, to grab and parse the dataset into a CSV file for Tableau to read.  The code to grab the data can be found here.

The image below is an example of 100 randomly selected games from this season.  I plotted the shots and goal events on top of an image of an NHL rink.  Pretty cool huh?  You can set the washout of the image so you can focus on the plotted data.   I am sure that you can do this easily in R as well, but since I can barely debug my own code, any help on that front will be more than appreciated.

And because I can’t get this up and running and it’s close to game time, here is the link for the 100 random games below. It looks far better when they host it.

UPDATE:  Here is the link to the event data for the entire game.  You should be able to filter the data by period.  For the slideshow, it appears as if you have to manually page through each play.


 

Create a Web Crawler in R

Admittedly I am not the best R coder, and I certainly have a lot to learn, but the code at the link below should provide you with an example of how easy it is to create a very (repeat: very) basic web crawler in R.  If you wanted to do this in SPSS, and I have, , you would have to step outside of the normal syntax language and leverage the Python plugin.

I simply use a function to do my data manipulation, and access that function within a loop before saving out a CSV of the dataset.  Granted this only works when you know the structure of the URLs that you want to scrape, but this is pretty easy and hopefully straightforward.

The code is located here.

Clustering NHL Skaters

I have been sitting on this post for some time now and wanted to get it out there.  The goal is to simply show how easy it is to pull live data from the web into R, massage it, and perform some analytics on it.  I am not sure how useful this analysis really is in practice, but the larger point is to show you how powerful R is for very quick analysis.

I admit that I am a somewhat sloppy coder, but hopefully my comments may help you out, especially if you are new to R and are interested in things like:

  • How to sample data (both rows and columns)
  • Recode values
  • Re-order factors
  • Reduce the data using Principal components
  • Cluster the data using these components
  • Basic plotting and how can control everything you want on the plot

The code can be found here.  The plots below show you some of the output.

As mentioned above, this wasn’t aimed at being a in-depth review of team performance or skater ability, but I think you can see where this analysis could go.  The aim of the team distribution plot is to show the team distribution by their skaters, with reference lines that would break up the teams into 4 equal size groups.

If you follow the NHL, take a look at New Jersey or Toronto.  These two teams are not having the best seasons, and using this plot, more than half of their team is comprised of skaters who fall into the lower 2 performing clusters.  In addition, look at Philadelphia, one of the better teams in the league.  More than 25% of their team was clustered into the top performing group.

Teach Yourself How to Create Functions in R

As you can tell from my previous posts, I am diving in head first into learning how to program (and simplify) my analytical life using R.  I have always learned by example and have never really prospered from the “learn from scratch” school of thought.  As I follow along with some other fellow R programmers, I find that they often use functions.  Intuitively I understand what they are and why the are awesome, yet I rarely find myself every thinking to employ them.

One reason is that I simply don’t yet fully grasp how to catch errors and debug them efficiently.  In addition, I am a very sloppy coder and don’t want to create anything “too complicated” since I am just starting out.  However, I recently was introduced to a neat little trick and I wanted to share.

Say there is a function that you often use but finding yourself wanting to make a personal change to.  Simply type the function name into the command line without any arguments!

For example:

[sourcecode language="r"]
library(gmodels)
CrossTable
[/sourcecode]

For example, I currently am trying to find a quick way to automate professional- looking crosstabs that I would feel comfortable using in the appendix of a survey research project on campus.  I really like the CrossTable function inside the gmodels package, but when Sweaving the output, the size is rediculous!  Even when it print’s to the console, you can see how large it prints to other things.  Not that I am capable of figuring out how to change this quite yet, it is still good as a newbie to R to see how the function was programmed, and presumably, how to improve my R coding skills.

As an aside, I recently added my blogs post to the R-bloggers feed.  If you stumble across my posts and are too trying to learn R, you would be well off to subscribe to the RSS feed.  The link to the feed is http://www.r-bloggers.com/.