US News National Rankings – Yield Rates and R

Today I saw a post about the yield rates at institutions considered in the “National” Rankings for US News.  Above all else, I was interested in the data table contained within the post.  I wrote a quick R script to grab the table and plot where a particular school might find themselves given the data reported.  The script generates a plot and highlights where the school would fall within the distribution.

If you want to run the R code, all you need to do is ensure that you have the XML package and change the BM variable to a value of interest.  The image below highlights what one institution might see.

The code for this analysis can be found below:

EPS Market Map in R

There are a few minor tweaks renaming on this map before it is complete, but I wanted to share the EPS Market Map I put together.  It can be downloaded using this link.

This file is meant to be used with R and divides the lower 48 states into the CollegeBoard’s Enrollment Planning Service markets. To build the territories, I used the crosswalk file provided on the EPS search site (in the appendix) and ‘dissolved’ the zip codes into markets. Help on how I performed this task can be found here and here.

As you will see below, there are still a few gaps in the map that I need to fill in.  Ideally, the Collegeboard would have provided the necessary GIS files to us, but currently that is not an option.

My end game is to use this file to geocode Lat/Long data to EPS territories in addition to basic choropleth mapping for enrollment planning.  If you want to contribute to this project, please don’t hesitate to reach out!

The Rdata file currently includes 3 objects.  This will change as I finalize the map files.

  1. eps.missing which is a data frame of zip codes that still need to be associated with an EPS territory
  2. myzip which is a SpatialPolysDataFrame object.  It is the map of the lower 48 by zip code.  To plot, simply use the command plot(myzip) but note it will take a minute or so depending on your machine
  3. eps.markets is the working draft of the eps markets map and is the same type as myzip
Here are two quick plots of the map.  The image on the top simply plots each market as red, which helps in finding the gaps.  The image on the bottom uses a random color for each market.
> plot(eps.markets, col=sample(colors(), 301, replace=T))
> plot(eps.markets, col="red")


 

Using FAFSA Data to study Competitors – Part 2

I wanted to build upon my previous post and dive a little deeper into the sorts of questions we can answer using the FAFSA data supplied to us by applicants.

As a quick overview, students completing the FAFSA for student aid can list up to ten institutions on the form. I consider this the student’s consideration set. When aggregating these data, we can start to get a sense of the most frequently listed schools and how these institutions may be related.

With these data, you can manipulate the structure to answer a wide range of questions. One approach would coerce the data into a network. For this task, I am going to use the statistical programming language R and the library igraph. The resulting network includes all schools listed (excluding the host institution) with weighted edges representing the # of co-occurences.

Listed below are some quick stats on my undirected network from the last few years:

  • Graph density: 0.05108093
  • Diameter: 5
  • Average Path Length: 2.418751
  • Transitivity (clustering coefficient): 0.3390529

Graph density is the ratio of edges related to the total number of possible edges. For context, an edge is a connection between two schools. If you think of Facebook, you and your friends are connected by an edge. Diameter is a measure of how many steps (edges) are required to connect the two farthest nodes in the network. The Average Path Length is basically an average of how many steps it would take for all schools to be connected. The clustering coefficient is a measure of how well the nodes tend to cluster together (listed on the same FAFSA form).

Shown below is a plot of the graph, with each school sized by pagerank score (included function in igraph).

It’s easy to see that there are few key players in the FAFSA network; I consider these “core” competitors. More interesting to me, however, are the schools at the outer edge, as they are less common and speak to the choice set of an applicant.

In summary, this post was intended to be a quick overview of how one might employ network analysis to study the schools commonly listed on the FAFSA form for your institution. In the future, I will take the same data and use association rules to find common patterns of school listings.

EDIT: Here are the code snippets that I used to generate the data and plot above:

## basic stats:
## density (graph.density)
graph.density(g)
## diamter
diameter(g, directed=F)
## average path length (shortest.paths)
average.path.length(g, directed=F)
## transivity (clustering coeffecient)
transitivity(g)
## radius
radius(g)
## degree distribution
plot(1-degree.distribution(g, cumulative=T), type="l",
xlab="degree", ylab="Cume Distribution", main="FAFSA Network")
g$layout pagerank plot(g,
vertex.size= pagerank*150,
vertex.label=NA,
vertex.color= "red",
vertex.frame.color="black",
edge.arrow.size=0,
edge.color=colors()[239],
edge.width=.5,
edge.curved=TRUE,
layout=layout.auto(g))

 

Using FAFSA Data to Define Competitor Density

I have been thinking a lot about how to define and discuss competition at the undergraduate level.   I will save the chat on which dataset is better (ASQ, Student Clearinghouse, social media, etc.) for another day.

One common question I get as an analyst in Enrollment Management is how to “define” competition. While it’s never an easy question, from a marketing perspective we often have to subset competition into a few levels: core, secondary, aspirant, regional, etc. Even before this, though, I believe it is critically important to understand “Competitor Density.”

Using a statistical lens, Competitor Density is rather straight forward. Simply, it is the cumulative density of students covered by “N” schools.  For illustration, refer to the chart below, which is filtered on domestic + admitted students over the last 3 applicant pools.

The plot above reveals two very interesting facts:

  1. A small set of competitors represent a large share of the “core” competition.  While the plot above assumes that a student was admitted at every institution they listed on the form, this basic assumption allows us to broadly define the consideration set for an applicant.
  2. After appending on other information from our student information systems (aggregated), we can start to answer some pretty complex questions about how students finalize their list of schools to which they eventually apply.
In a future post, I intend to highlight how analysts in highered can manipulate FAFSA data using association rules and network theory.

 

In the interim, I will leave you with some basic stats on the plot above.  If you stumble across this post and you work in highered, feel free to comment and post comparable stats.  I would love to see how these data vary across different institutions.

Please remember that the “host” institution was removed.  Only competitor schools were included in the plot.

  • 652 distinct schools were included over 3 applicant terms (fall only)
  • Top 2 schools = 10.6% of all admitted students
  • Top 10 schools = 34.3%
  • Top 25 schools = 52.1%
  • Top 50 schools = 67%
  • Top 75 schools = 76%
  • Top 100 schools = 82%
  • Top 228 schools = 95%

Stepping back, 72 schools account for 75% of the competition.  That’s a pretty “easy” way to define a set of schools considering that there are over 3,000 highered institutions listed in IPEDS.

 

 

ACT to SAT M+V Concordance Chart in R

For those of who work in Enrollment Management and routinely analyze higher ed data, I wanted to share an easy way to convert ACT to equivalent SAT M+V scores in R. I am dynamically building a dataset that uses the concordance chart located here. Simply, use this data frame and merge it onto your existing data (?merge) to calculate the “best standardized test score” for a given recruit, applicant, etc.

If you aren’t using R, give it a shot, it’s worth the effort and undoubtedly you will begin to find SAS and SPSS are too much work.

Let’s save the debate on the validity of standardized tests for another day…….