16 August 2013

I recently ran a small 5k race in Ann Arbor, MI called the UA Plumbers and Pipefitters 5k. It raised money for the Semper Fi Fund, which is a great cause. It also had an amazing logo of a running U-shaped trap pipe, and I really wanted the t-shirt and medal with that logo on it.

## 5k

This race had a 6:50pm start, which is unusual, but sort of a nice time, if you ask me, and it was a nice evening—warm, but not too hot, and humid, but not too humid.

I ended up having a nice race, which prompted me to look up my past times:

Table 1: 5k Times
Event Date Time
Turkey Trot 11/25/2010 23:09
Turkey Trot 11/24/2011 22:46
Turkey Trot 11/22/2012 21:09
Gallup Gallop 7/14/2013 20:37
Plumbers and Pipefitters 8/12/2013 19:19

Noticing that every time was faster, I thought I'd make a plot, since it would show a trend—a trend that I liked, since I got faster every time. This might be a different blog post if there were other trends—one with ``sample data''. Figure 1: 5k Times Trend

That's a nice graph, if I do say so myself.

And my interpretation of the trend and spread is that I ran faster than expected, which means I can run slower in my next 5k and still maintain the trend. Yay for running slower!

## Making Graphs of Running Times with R

Because I like to make plots with the R software for statistical computing and graphics, that's what I used to make that plot, and because this would be an even more self-centered blog post if I didn't share something with you, following are the steps to make that plot with your own running times.

### The one non-standard library and our data

The first step is to get the library we need— `ggplot2` —and load the data:

```library(ggplot2)
```

Representing dates in R is pretty simple, but representing times is a little trickier.

### Getting the data just so

The next two lines convert the dates in the table into dates that R understands and converts the times to seconds for the sake of the plot.

```r\$Date<-as.Date(r\$Date,format='%m/%d/%Y')
r\$Times<-(as.numeric(as.POSIXct(strptime(r\$Time, format="%M:%OS"))) -
as.numeric(as.POSIXct(strptime("0", format="%S")))
)
```

The second line is the result of some Google searching and StackExchange finding, but in the end it converts the `MM:SS` formatted times into seconds and stores it in `Times` (note the extra `s` to denote seconds).

### Setting up the y-labels

We want the y-labels back in our `MM:SS` format, and it would be nice, for a small amount of data, to label the y-axis of every point.

```secs<-c(r\$Times,seq(from=18*60, to=max(r\$Times)+120, by=60*1))
labels<-paste((as.integer(secs/60)),
formatC(round((secs/60 - as.integer(secs/60)) * 60),
width=2,
flag="0"),
sep=":")
```

First we make a vector called `secs` that has my run times, converted to seconds, and then some ``normal'' times (19:00, 20:00, etc) converted to seconds. The line:

```seq(from=18*60, to=(max(r\$Times)+120, by=60*1))
```

makes a sequence of numbers starts at eighteen minutes (because I'm confident I'll never run a 18:00 5k) and ends at two minutes more than my slowest time (this leaves room on the plot for labels and frames the times). The labels will be every one minutes (`by=60*1`). That sequence defines the y-axis points, but would make for non-intuitive labels.

The next line creates a vector called `labels` that converts the seconds into the format `MM:SS` by `paste`-ing together minutes and seconds separated by a colon (sep=":"). To get minutes, we simply take the integer part of `secs` divided by 60, and that's the first half of our paste. The second half of the paste also needs to be padded with leading zeros if it isn't long enough (otherwise your time might be 20:9 instead of 20:09), so we use the `formatC` function with the options: our number, width=2 (pad to two characters), and flag="0" (pad with 0s). Our number is the decimal part of (`secs` divided by 60), multiplied by 60 to get seconds and rounded to the nearest integer.

At this point we have two vectors: `secs` and `labels` that match each other—one has seconds and one has `MM:SS`, each in the same location in the vector.

### Using the data to make a pretty plot

At this point, we have all the data we need in the R data frame (a data frame is like one sheet in an Excel spreadsheet) called `r`, some labels in `secs` and `labels` and all we have left to do is use `ggplot2` to plot it.

`ggplot2` builds a plot piece by piece, which is nice for making incremental changes, and also nice for explaining, since each piece stands on its own.

```plot = ggplot(r, aes(x=Date,y=Times,label=r\$Event)) + geom_step()
```

This line creates the `plot` object (although you can call it whatever you want, it's a normal R variable) and starts the `ggplot` process by telling it ``We're using the `r` data frame and aesthetically we are going to use `Date` for the x-data and `Times` (our Time converted to seconds) for the y-data and we're going to label it with the `Event` name. To draw a line, we want steps, not a series of slopes, so we add `geom_step()` to the plot.

Next we add the text for the `label=` we specified above and set a size (3) and a vertical adjustment so they are above the point (`vjust=-0.5`):

```plot = plot + geom_text(size=3,vjust=-0.5)
```

The x-axis can be a little too short to leave room for the long race names in the labels, so we add a little on each end, by subtracting from the minimum date (`min(r\$Date)`) and adding to the maximum date (`max(r\$Date)`). The amount added is a guess based on the size of the labels of the first and latest races.

```plot = plot + xlim((min(r\$Date)-60),max(r\$Date)+90)
```

Then we add points to each race along the step line and also a smoothing range (the gray area in the plot) to get some sort of prediction of the range.

```plot = plot + geom_point() + stat_smooth(method="glm")
```

Lastly, we use the `secs` and `labels` from above to make y-axis labels, and set the range of the y-axis to be between 18 minutes (`60*18`) (since I don't think I'll break an 18-minute 5k) and the slowest time (`max(r\$Times)`); turn of the x-axis label, since the fact they it is dates is pretty evident, and then call `plot` to draw the plot.

```plot = plot + scale_y_continuous(breaks=secs,
labels=labels,
limits=c(60*18,max(r\$Times)))
plot = plot + xlab("")
plot
```

### The final R script

```library(ggplot2)
r\$Date<-as.Date(r\$Date,format='%m/%d/%Y')
r\$Times<-(as.numeric(as.POSIXct(strptime(r\$Time, format="%M:%OS"))) -
as.numeric(as.POSIXct(strptime("0", format="%S")))
)
secs<-c(r\$Times,seq(from=18*60, to=max(r\$Times)+120, by=60*1))
labels<-paste((as.integer(secs/60)),
formatC(round((secs/60 - as.integer(secs/60)) * 60),
width=2,
flag="0"),
sep=":")
plot = ggplot(r, aes(x=Date,y=Times,label=r\$Event)) + geom_step()
plot = plot + geom_text(size=3,vjust=-0.5)
plot = plot + xlim((min(r\$Date)-60),max(r\$Date)+90)
plot = plot + geom_point() + stat_smooth(method="glm")
plot = plot + scale_y_continuous(breaks=secs,
labels=labels,
limits=c(60*18,max(r\$Times)))
plot = plot + xlab("")
plot
```