18 June 2016

I want to visualize how many concurrent events exist in a time period along with how frequently they start and end. I don’t need to read numbers off the visualization, I just want to get a relative sense of how many events are starting, ongoing, and ending over a time period with some resolution. Something that looks like this:

Sorry, your browser does not support SVG.

Looking at the plot, you can immediately see when:

  • the most events were starting (about in the middle of the time range)
  • the most events were happening (about in the first third of the time range)
  • the most events were ending (about at the end of the first third of the time range).

With that information the reader can ask the next questions in more useful ways:

  • “why did we stop starting events about half way through the time range?”
  • “why did we stop so many events after the first third of the time range?”
  • “why was nothing at all happening for the last 5–10% of the time range?”

Those questions aren’t about the data directly, but about the application of the data, which is what data are for (despite people loving it for its own sake sometimes) and they aren’t obvious from the input data (Table 1).

Practice Data

To start, I create some fake data with this Python script where all time is between 1 and 100, there are 20 events, and the longest event duration is 30. If it helps you can think of these numbers as seconds after 4:15am on Thursday, June 16th, 2016. Or days after January 1st, 2000. It doesn’t matter.

import random
from tabulate import tabulate
data = []
for m in range(1,20):
    start = random.randint(1,70)
    end = start + random.randint(1,30)
    data.append((start,end))

data.sort()
print tabulate(data, tablefmt="orgtbl", headers=(["Start","End"]))
Table 1: Sample Event Start/End Data
Start End
6 11
7 27
8 35
10 11
13 37
14 35
22 34
24 36
28 51
31 59
33 34
36 47
36 58
42 51
42 51
44 66
53 74
69 95
69 96

Organizing the Data

The next step is to see how many events are active, starting, and ending at each time over all time (1–100 in our case).

This next bit of Python simply bins the data from the table above into our 100 example time bins, which I won’t make you read through, but you’ll need to bin your data in a similar way. The format of the data is:

Time Number of Events Number of Events Number of Events
  Ending at this time Ongoing at this time Starting at this time

For example, if the frequency of your events is a few every minute, your binned data might look like:

Time Ending Ongoing Starting
13:50 4 10 3
13:51 2 11 1
13:52 0 12 4
13:53 8 8 2
13:54 1 9 4

although, since there is no data displayed for the x-axis (the time), it is a lot easier to convert the time into relative time. In this example, the times could be 49800, 49860, 49920, etc. Or if you have a date, using the Unix epoch time (seconds since 00:00:00 UTC 1 January 1970) makes things easy.

timebin = dict()
startbin = dict()
endbin = dict()
for timeincr in range(1, 101):
    timebin[timeincr] = 0
    startbin[timeincr] = 0
    endbin[timeincr] = 0
    for s, e in timedata:
        if s == timeincr:
            startbin[timeincr] += 1
        if e == timeincr:
            endbin[timeincr] += 1
        if s <= timeincr and e >= timeincr:
            if timeincr in timebin:
                timebin[timeincr] += 1
for m in sorted(timebin):
    print "|{} | {} | {} | {}".format(m, endbin[m],
                                      timebin[m], startbin[m])

Plotting the Density of the Bins

Once we have our bins, then it’s a matter of makeing a density plot over time for each of the three events (starting, ongoing, and ending).

import matplotlib.pyplot as plt


def makebarplot(bins):
    time = [b[0] for b in bins]   # extract the x-axis data
    fig = plt.figure()            # get the matplotlib plot figure
    fig.set_size_inches(8, 1)     # set the size of the plot
    ax = fig.add_subplot(1, 1, 1) # add a plot to the figure; Subplot
    # is confusing, though.  The magical "(1, 1, 1)" here means there
    # will be one row, one column, and we are working with plot number
    # 1, all of which is the same as just one plot.  There is a little
    # more documentation on this at:
    # http://matplotlib.org/api/pyplot_api.html#matplotlib.pyplot.subplot
    fig.patch.set_visible(False)  # make the background transparent
    # turn off the borders (called spines)
    ax.spines['top'].set_visible(False)
    ax.spines['bottom'].set_visible(False)
    ax.spines['right'].set_visible(False)
    ax.spines['left'].set_visible(False)
    # set all of the ticks to 0 length
    ax.tick_params(axis=u'both', which=u'both',length=0)
    # hide everything about the x-axis
    ax.axes.get_xaxis().set_visible(False)

    barwidth = 1                  # remove gaps between bars
    color = ["red", "blue", "green"] # set the colors for
    for row in range(1, len(color)+1): # make as many rows as colors
        # extract the correct column
        ongoing = [b[row] for b in bins]
        # scale the data to the maximum
        ongoing = [c/float(max(ongoing)) for c in ongoing]

        # draw a black line at the left end
        left = 10
        border_width = 20
        d = border_width
        ax.barh(row, d, barwidth, color="black",
                left=left, edgecolor="none",
                linewidth=0)
        left += d
        # fill in the horizontal bar with the right color density
        # (alpha)
        for d, c in zip(time, ongoing):
            ax.barh(row, d, barwidth,
                    alpha=0.9*c+.01,
                    color=color[row-1],
                    left=left,
                    edgecolor="none",
                    linewidth=0)
            left += d

        # draw a black line at the right end
        d = border_width
        ax.barh(row, d, barwidth,
                color="black",
                left=left, edgecolor="none",
                linewidth=0)
    # label the rows
    plt.yticks([1.5, 2.5, 3.5], ['stopping', 'ongoing', 'starting'], size=10)
    # return the plot to __main__
    return plt

# do some housekeeping that makes it all go in OrgMode (and hence PDF
# and HTML)
if __name__ == "__main__":
    plt = makebarplot(bins)
    # The file extension controls the output format; .png and .pdf are
    # good choices along with .svg
    filename="edplot.svg"
    plt.savefig(filename)
    return filename

Sorry, your browser does not support SVG.

And now you can see the number of starting events in green, the number of ongoing events in blue, and the number of ending events in red. The darker the color, the more events of that type are happening at that time, hence the name, event density plot.

The Future

This could pretty readily be a Python class, and may be that someday, but for now the makebarplot function is sufficient and hopefully easy to understand and translate to the language of your choice.

I would also like to include more examples, but thought that would be as likely to add confusion as clarity.