R is for Running Running, stats, pretty plots and some R learnin'!!
I recently ran a small 5k race in Ann Arbor, MI called the UA Plumbers and Pipefitters 5k. It raised money for the Semper Fi Fund, which is a great cause. It also had an amazing logo of a running U-shaped trap pipe, and I really wanted the t-shirt and medal with that logo on it.
This race had a 6:50pm start, which is unusual, but sort of a nice time, if you ask me, and it was a nice evening—warm, but not too hot, and humid, but not too humid.
I ended up having a nice race, which prompted me to look up my past times:
|Plumbers and Pipefitters||8/12/2013||19:19|
Noticing that every time was faster, I thought I'd make a plot, since it would show a trend—a trend that I liked, since I got faster every time. This might be a different blog post if there were other trends—one with ``sample data''.
Figure 1: 5k Times Trend
That's a nice graph, if I do say so myself.
And my interpretation of the trend and spread is that I ran faster than expected, which means I can run slower in my next 5k and still maintain the trend. Yay for running slower!
Making Graphs of Running Times with R
Because I like to make plots with the R software for statistical computing and graphics, that's what I used to make that plot, and because this would be an even more self-centered blog post if I didn't share something with you, following are the steps to make that plot with your own running times.
The one non-standard library and our data
The first step is to get the library we need—
load the data:
library(ggplot2) r<-read.table("file.txt", header=TRUE, sep="|")
Representing dates in R is pretty simple, but representing times is a little trickier.
Getting the data just so
The next two lines convert the dates in the table into dates that R understands and converts the times to seconds for the sake of the plot.
r$Date<-as.Date(r$Date,format='%m/%d/%Y') r$Times<-(as.numeric(as.POSIXct(strptime(r$Time, format="%M:%OS"))) - as.numeric(as.POSIXct(strptime("0", format="%S"))) )
The second line is the result of some Google searching and
StackExchange finding, but in the end it converts the
formatted times into seconds and stores it in
Times (note the
s to denote seconds).
Setting up the y-labels
We want the y-labels back in our
MM:SS format, and it would be
nice, for a small amount of data, to label the y-axis of every
secs<-c(r$Times,seq(from=18*60, to=max(r$Times)+120, by=60*1)) labels<-paste((as.integer(secs/60)), formatC(round((secs/60 - as.integer(secs/60)) * 60), width=2, flag="0"), sep=":")
First we make a vector called
secs that has my run times,
converted to seconds, and then some ``normal'' times (19:00,
20:00, etc) converted to seconds. The line:
seq(from=18*60, to=(max(r$Times)+120, by=60*1))
makes a sequence of numbers starts at eighteen minutes (because I'm
confident I'll never run a 18:00 5k) and ends at two minutes more
than my slowest time (this leaves room on the plot for labels and
frames the times). The labels will be every one minutes
by=60*1). That sequence defines the y-axis points, but would
make for non-intuitive labels.
The next line creates a vector called
labels that converts the
seconds into the format
paste-ing together minutes
and seconds separated by a colon (sep=":"). To get minutes, we
simply take the integer part of
secs divided by 60, and that's
the first half of our paste. The second half of the paste also
needs to be padded with leading zeros if it isn't long enough
(otherwise your time might be 20:9 instead of 20:09), so we use
formatC function with the options: our number, width=2 (pad
to two characters), and flag="0" (pad with 0s). Our number is the
decimal part of (
secs divided by 60), multiplied by 60 to get
seconds and rounded to the nearest integer.
At this point we have two vectors:
labels that match
each other—one has seconds and one has
MM:SS, each in the same
location in the vector.
Using the data to make a pretty plot
At this point, we have all the data we need in the R data frame (a
data frame is like one sheet in an Excel spreadsheet) called
some labels in
labels and all we have left to do is use
ggplot2 to plot it.
ggplot2 builds a plot piece by piece, which is nice for making
incremental changes, and also nice for explaining, since each piece
stands on its own.
plot = ggplot(r, aes(x=Date,y=Times,label=r$Event)) + geom_step()
This line creates the
plot object (although you can call it
whatever you want, it's a normal R variable) and starts the
ggplot process by telling it ``We're using the
r data frame and
aesthetically we are going to use
Date for the x-data and
(our Time converted to seconds) for the y-data and we're going to
label it with the
Event name. To draw a line, we want steps, not
a series of slopes, so we add
geom_step() to the plot.
Next we add the text for the
label= we specified above and set a
size (3) and a vertical adjustment so they are above the point
plot = plot + geom_text(size=3,vjust=-0.5)
The x-axis can be a little too short to leave room for the long
race names in the labels, so we add a little on each end, by
subtracting from the minimum date (
min(r$Date)) and adding to the
maximum date (
max(r$Date)). The amount added is a guess based on
the size of the labels of the first and latest races.
plot = plot + xlim((min(r$Date)-60),max(r$Date)+90)
Then we add points to each race along the step line and also a smoothing range (the gray area in the plot) to get some sort of prediction of the range.
plot = plot + geom_point() + stat_smooth(method="glm")
Lastly, we use the
labels from above to make y-axis
labels, and set the range of the y-axis to be between 18 minutes
60*18) (since I don't think I'll break an 18-minute 5k) and the
slowest time (
max(r$Times)); turn of the x-axis label, since the
fact they it is dates is pretty evident, and then call
draw the plot.
plot = plot + scale_y_continuous(breaks=secs, labels=labels, limits=c(60*18,max(r$Times))) plot = plot + xlab("") plot
The final R script
library(ggplot2) r<-read.table("file.txt", header=TRUE, sep="|") r$Date<-as.Date(r$Date,format='%m/%d/%Y') r$Times<-(as.numeric(as.POSIXct(strptime(r$Time, format="%M:%OS"))) - as.numeric(as.POSIXct(strptime("0", format="%S"))) ) secs<-c(r$Times,seq(from=18*60, to=max(r$Times)+120, by=60*1)) labels<-paste((as.integer(secs/60)), formatC(round((secs/60 - as.integer(secs/60)) * 60), width=2, flag="0"), sep=":") plot = ggplot(r, aes(x=Date,y=Times,label=r$Event)) + geom_step() plot = plot + geom_text(size=3,vjust=-0.5) plot = plot + xlim((min(r$Date)-60),max(r$Date)+90) plot = plot + geom_point() + stat_smooth(method="glm") plot = plot + scale_y_continuous(breaks=secs, labels=labels, limits=c(60*18,max(r$Times))) plot = plot + xlab("") plot