The astronomical winter starts in a few weeks here in the Northern Hemisphere. In the coming months, almost every region in the United States will experience its coldest weather of the year. As I’ve been monitoring winter approach, I started to wonder: what day is the coldest day of the year? This post outlines my attempt to answer this question and the surprising results I found in my analysis. An interactive data visualization summarizing my results can be found here.

The largest and most comprehensive climatological data source I could find to estimate the coldest day of the year was the North American Land Data Assimilation System (NLDAS). The data contained temperature measurements in every county in the contiguous United States for every day from 1979 to 2011. The NLDAS website provided a simple interface for small data queries over HTTP, but I was unable to find an option to bulk download the entire data set through their website. To get around this limitation, I wrote a simple scraping utility using Phantom to automate form submission for small data queries and download the resulting data in pieces to assemble the full raw NLDAS data set.

Once I had the raw data, I needed to determine how I would estimate the coldest day of the year. For each county, for each day, for each year, I had $\approx 3.8 \times 10^7$ minimum temperature estimates across the United States at county level resolution. Across each year, I assumed statistical independence and stationarity. The temperatures in a given location from one year did not affect the temperature in the same location in other years, and the sample measurements across all 32 years were assumed to be from a stochastic process.1 I further refined these estimators by bootstrapping the mean minimum temperate for each day in each county using a similar approach to what I’ve previously discussed, marginalized by day.

The assumption of independence across years makes the laborious process of bootstrapping embarrassingly parallel. I originally wrote the analysis pipeline for Hadoop using Amazon’s Elastic Map Reduce architecture, but after some prototyping, I realized this approach would be prohibitively expensive. As an alternative approach, I parallelized my reducer across all of the cores in a spare computer and ran the analysis pipeline locally. This procedure took 27 hours running on 8 cores.2 Once the pipeline had completed, I was left with a 3.5Mb JSON blob containing county data and bootstrap statistics.

Using the generated JSON data, I overlaid 2 separate chloropleth maps onto an Albers projection of the contiguous United States. In the interactive version of the data visualization, the 2 map states can be toggled by clicking one of the corresponding radio buttons next to the map. Hovering over any county in the visualization with the mouse reveals additional data, including an inset time series of the sampling distribution statistics I calculated through the bootstrap. The plots show estimated minimum temperatures and confidence intervals from November 15 to April 15th for each selected county.

The first chloropleth map (Figure 1) shows the day of the year with the estimated minimum temperature for each county. From the map, it’s clear that counties toward the West Coast tend to experience their coldest day relatively early in the winter (oranger colors), while counties toward the East Coast tend to experience their coldest day more towards spring (greener colors). The distribution is markedly bimodal with a very clear diagonal line bisecting the United States from Southern Washington State to Western Florida.

At first, I thought the bisection might be a quantization artifact from using an interval scale to bin consecutive dates into common colors on the map, but that was not the case; the bimodality is a real property of the data. Mousing around the dividing line shows that adjacent counties differ markedly from one side of the line to the other with respect to their coldest days.3

The second map (Figure 2) colors each county by minimum estimated temperature. From this map, it is easy to see the influence of the Rocky Mountains and the temperature buffering properties of the oceans and Great Lakes. One thing I found interesting in this map is the lateral banding patterns of temperature extending from the East Coast into the Midwest. These bands are remarkable uniform across counties with similar longitudes. For example, look at the bottom edge of the clearly defined 13–19°F band extending westward from Southern Pennsylvania, through Ohio, Indiana, Illinois, Missouri, and Nebraska toward the Rocky Mountains in Colorado.

1. Independence and stationarity are reasonable assumptions in my view, but it does elide some of the intricacies of weather patterns. Volcanos, Polar Vortices, global warming, sea surface temperatures, and other weather patterns can violate these assumptions; for example, in The Year Without A Summer a volcanic eruption dropped global temperatures for multiple years across Europe.

2. My computer has 4 physical cores, but with hyper-threading there are 8 virtual cores.

3. If there are any climatologist out there that can explain why this phenomena occurs, please contact me. I’ll try to update this post if I find an explanation for the bimodal pattern of the coldest day.