# R is for Running Running, stats, pretty plots and some R learnin'!!

I recently ran a small 5k race in Ann Arbor, MI called the UA Plumbers and Pipefitters 5k. It raised money for the Semper Fi Fund, which is a great cause. It also had an amazing logo of a running U-shaped trap pipe, and I really wanted the t-shirt and medal with that logo on it.

## 5k

This race had a 6:50pm start, which is unusual, but sort of a nice time, if you ask me, and it was a nice evening—warm, but not too hot, and humid, but not too humid.

I ended up having a nice race, which prompted me to look up my past times:

Event | Date | Time |
---|---|---|

Turkey Trot | 11/25/2010 | 23:09 |

Turkey Trot | 11/24/2011 | 22:46 |

Turkey Trot | 11/22/2012 | 21:09 |

Gallup Gallop | 7/14/2013 | 20:37 |

Plumbers and Pipefitters | 8/12/2013 | 19:19 |

Noticing that every time was faster, I thought I'd make a plot, since it would show a trend—a trend that I liked, since I got faster every time. This might be a different blog post if there were other trends—one with ``sample data''.

Figure 1: 5k Times Trend

That's a nice graph, if I do say so myself.

And my interpretation of the trend and spread is that I ran faster than expected, which means I can run slower in my next 5k and still maintain the trend. Yay for running slower!

## Making Graphs of Running Times with R

Because I like to make plots with the R software for statistical
computing and graphics, that's what I used to make that plot, and
because this would be an *even more* self-centered blog post if I
didn't share something with you, following are the steps to make
that plot with your own running times.

### The one non-standard library and our data

The first step is to get the library we need— `ggplot2`

—and
load the data:

library(ggplot2) r<-read.table("file.txt", header=TRUE, sep="|")

Representing dates in R is pretty simple, but representing times is a little trickier.

### Getting the data just so

The next two lines convert the dates in the table into dates that R understands and converts the times to seconds for the sake of the plot.

r$Date<-as.Date(r$Date,format='%m/%d/%Y') r$Times<-(as.numeric(as.POSIXct(strptime(r$Time, format="%M:%OS"))) - as.numeric(as.POSIXct(strptime("0", format="%S"))) )

The second line is the result of some Google searching and
StackExchange finding, but in the end it converts the `MM:SS`

formatted times into seconds and stores it in `Times`

(note the
extra `s`

to denote seconds).

### Setting up the y-labels

We want the y-labels back in our `MM:SS`

format, and it would be
nice, for a small amount of data, to label the y-axis of every
point.

secs<-c(r$Times,seq(from=18*60, to=max(r$Times)+120, by=60*1)) labels<-paste((as.integer(secs/60)), formatC(round((secs/60 - as.integer(secs/60)) * 60), width=2, flag="0"), sep=":")

First we make a vector called `secs`

that has my run times,
converted to seconds, and then some ``normal'' times (19:00,
20:00, etc) converted to seconds. The line:

seq(from=18*60, to=(max(r$Times)+120, by=60*1))

makes a sequence of numbers starts at eighteen minutes (because I'm
confident I'll never run a 18:00 5k) and ends at two minutes more
than my slowest time (this leaves room on the plot for labels and
frames the times). The labels will be every one minutes
(`by=60*1`

). That sequence defines the y-axis points, but would
make for non-intuitive labels.

The next line creates a vector called `labels`

that converts the
seconds into the format `MM:SS`

by `paste`

-ing together minutes
and seconds separated by a colon (sep=":"). To get minutes, we
simply take the integer part of `secs`

divided by 60, and that's
the first half of our paste. The second half of the paste also
needs to be padded with leading zeros if it isn't long enough
(otherwise your time might be 20:9 instead of 20:09), so we use
the `formatC`

function with the options: our number, width=2 (pad
to two characters), and flag="0" (pad with 0s). Our number is the
decimal part of (`secs`

divided by 60), multiplied by 60 to get
seconds and rounded to the nearest integer.

At this point we have two vectors: `secs`

and `labels`

that match
each other—one has seconds and one has `MM:SS`

, each in the same
location in the vector.

### Using the data to make a pretty plot

At this point, we have all the data we need in the R data frame (a
data frame is like one sheet in an Excel spreadsheet) called `r`

,
some labels in `secs`

and `labels`

and all we have left to do is use
`ggplot2`

to plot it.

`ggplot2`

builds a plot piece by piece, which is nice for making
incremental changes, and also nice for explaining, since each piece
stands on its own.

plot = ggplot(r, aes(x=Date,y=Times,label=r$Event)) + geom_step()

This line creates the `plot`

object (although you can call it
whatever you want, it's a normal R variable) and starts the
`ggplot`

process by telling it ``We're using the `r`

data frame and
aesthetically we are going to use `Date`

for the x-data and `Times`

(our Time converted to seconds) for the y-data and we're going to
label it with the `Event`

name. To draw a line, we want steps, not
a series of slopes, so we add `geom_step()`

to the plot.

Next we add the text for the `label=`

we specified above and set a
size (3) and a vertical adjustment so they are above the point
(`vjust=-0.5`

):

plot = plot + geom_text(size=3,vjust=-0.5)

The x-axis can be a little too short to leave room for the long
race names in the labels, so we add a little on each end, by
subtracting from the minimum date (`min(r$Date)`

) and adding to the
maximum date (`max(r$Date)`

). The amount added is a guess based on
the size of the labels of the first and latest races.

plot = plot + xlim((min(r$Date)-60),max(r$Date)+90)

Then we add points to each race along the step line and also a smoothing range (the gray area in the plot) to get some sort of prediction of the range.

plot = plot + geom_point() + stat_smooth(method="glm")

Lastly, we use the `secs`

and `labels`

from above to make y-axis
labels, and set the range of the y-axis to be between 18 minutes
(`60*18`

) (since I don't think I'll break an 18-minute 5k) and the
slowest time (`max(r$Times)`

); turn of the x-axis label, since the
fact they it is dates is pretty evident, and then call `plot`

to
draw the plot.

plot = plot + scale_y_continuous(breaks=secs, labels=labels, limits=c(60*18,max(r$Times))) plot = plot + xlab("") plot

### The final R script

library(ggplot2) r<-read.table("file.txt", header=TRUE, sep="|") r$Date<-as.Date(r$Date,format='%m/%d/%Y') r$Times<-(as.numeric(as.POSIXct(strptime(r$Time, format="%M:%OS"))) - as.numeric(as.POSIXct(strptime("0", format="%S"))) ) secs<-c(r$Times,seq(from=18*60, to=max(r$Times)+120, by=60*1)) labels<-paste((as.integer(secs/60)), formatC(round((secs/60 - as.integer(secs/60)) * 60), width=2, flag="0"), sep=":") plot = ggplot(r, aes(x=Date,y=Times,label=r$Event)) + geom_step() plot = plot + geom_text(size=3,vjust=-0.5) plot = plot + xlim((min(r$Date)-60),max(r$Date)+90) plot = plot + geom_point() + stat_smooth(method="glm") plot = plot + scale_y_continuous(breaks=secs, labels=labels, limits=c(60*18,max(r$Times))) plot = plot + xlab("") plot