### Introduction

A few weeks ago, some of my friends ran in the Boston marathon. While I was following their progress though the race, I found an interesting data set on the marathon website that I wanted to explore to see if I could understand more about how elite runners run the race. This post explains the data I collected and the resulting visualization I created to map the kinetics of the top finishers in the marathon. The full data visualization can be found here.

### Description

The Boston Marathon follows the same route every year. Starting in the suburb of Hopkinton, the race winds though several neighborhoods before climbing over Heartbreak Hill and then descending into Boston to finish in Copley Square. Since the course is the same from year to year, I wanted to see how runners from different years ran the marathon compared to one another.

The Boston Athletics Association provides data on the finishing time and split times for participants in the race from 2010–2014.1 The split times of each runner are recorded at multiples of 5 kilometers up to 40k, as well as at the half-way split of 21.1k. Using this data, I could infer the place of each runner at each split. This feature allowed me to visualize the race kinetics of runners from different marathons across years. Here’s the resulting visualization I created of the top 10 finishers in the women’s Boston marathon from 2010–2014. Click the Race! button to animate the viz:

Each line in the visualization above corresponds to one runner where the color of a line denotes the year each runner ran the race. The vertical position of each line represents the overall place of a runner at a specific split compared to other runners. Hovering over a line with the mouse reveals metadata about each runner, including the final overall place of each runner, which is shown next to the finish line. The kinetics of the marathon are captured in the animation of the race. The speed at which the animation is executed is proportional to the speed at which each runner ran the race.

To provide some understanding for the data, consider the Kenyan runner, Rita Jeptoo. This year, she set a course record for the Boston Marathon with a time of 2:18:57. Although Jeptoo won the race, she fell as low as fifth place at the 10k split before slowly moving into 1st place at 25k where she held the lead for the remainder of the race. Her race is depicted below as a bright blue line. Jeptoo’s 2014 marathon result contrasts with her 2012 and 2013 marathons. In 2012 and 2013, she ran comparatively slow, finishing overall in 46th place and 21st place, respectively (6th in 2012 and 1st in 2013). Each of the last three year she has improved her time by a substantial margin:

### Results

One of the interesting features apparent in the visualization was the way in which runners from the same year clustered together through the race. Runners from the same year ran the race in a similar pattern, crossing each split and the finish line in similar times compared to runners from other years. For example, in 2014, the year Jeptoo set the course record, all of the top 10 runners ran extremely fast races compared to the rest of the field. In contrast, 2012 was a very slow year overall—all of the women ran comparatively slow races. The slowest runner in 2014 ran almost 8 minutes faster than the fastest runner in 2012.

I was curious to learn why the data exhibited such strong intraclass correlation. My first hypothesis was weather. If it was unusually warm, rainy, or windy on race day, it might impact the performance of the racers in the marathon. I examined the role of weather in the race by visualizing the analogous data from the men’s marathons. If the weather influenced the race from year to year, I reasoned that the clustering pattern apparent in the women’s data should also be visible in the men’s data. Furthermore, the weather for a given year should affect both the men’s and women’s races from that year in a similar way. Here’s the visualization I created for the top 10 finishers of the men’s Boston marathon from 2010-2014:

The men’s data also showed an intra-year clustering pattern, but the data did not seem to support weather as the casual factor. For instance, the fastest year for the women’s race was 2014, but 2014 was a slow year for the men’s race. If weather had an appreciable influence on the speed of the race, it should affect both races similarity. The weather likely has some influence on race performance, but alone, it did not seem to explain the intra-class clustering pattern. Since marathons are tactical races, perhaps each year the tempo of the marathon differs because of strategic choices played out by the runners in the field.

Another interesting feature of the data was the disparity between the kinetics of the first and second half of the race. In the first half of the marathons, the intra-year clusters of racers were very tight with few changes in place. However, after around the 25k mark, the runners start to diverge from one another and many of racers change places dramatically between different splits. This place shuffling correlates with the point at which the runners encounter a series of hills culminating with an ascent over Heartbreak Hill at around the 32k mark. For example, in the men’s race, Mathew Kisorio ran through the hills at a blistering pace, rising from 45th place overall at the 25k split to 32nd at 35k, only to crashed at the end of the race falling from 32nd place to finish in 50th all within only 7 kilometers.

### Visualization

The visualizations I created for the Boston marathon are inspired by one of my favorite quantitative graphics:

The image above comes from one of the great data visualization compendiums of all time—the 1890 United States Statistical Atlas. The early Statistical Atlases were authored by an array of scientific luminaries and brought pioneering quantitative visualization techniques to the general populous. One of these techniques is depicted in Gannett’s visualization of US population change. This style of visualization is commonly called a bumps chart due to its popularity in depicting bumps races such as the Oxford Torpids and Summer Eights. Bumps charts are multi-attribute slope graphs typically used to convey interval or ratio ranks. Gannett used the technique to show how the population of states and territories changed relative to one another over time. Since the marathon data was composed of similar variables, it was an obvious choice to model with bumps charts.

The full visualization for this post can be found here. The code for this project can be found on GitHub.

1. Unfortunately, prior to 2010, split time data for the Boston marathon was not available through the BAA website.