Coffee Experiments
statistics data review beverage

Introduction

A few years ago, I started using house guests as subjects in an experiment.1 My experiment was designed to test what variables in the coffee brewing process produce a perceptible improvement in coffee flavor. A frequent assertion is that numerous variables must be carefully considered to brew a good cup of coffee. I wanted to know if this premise was true as humans are really good at creating their own reality distortion fields.2 My main motivation for this experiment was to determine how I could brew the best coffee with minimal time and monetary investment. I didn’t want to buy a $11,000 Blossom One if I could avoid it. This posts outlines the results of my coffee experiments and the brewing setup I now use based on the results.

Experimental Design

For analyzing beer and whisky flavor preferences, I turned to several comprehensive datasets. For coffee, I couldn’t find an equivalent resource, so I generated my own data. My experimental design was simple.3 I’d give guests two cups of coffee prepared in exactly the same way except for one important difference. I’d then ask guests to choose their favorite of the two cups without them knowing which cup was which. Then, I’d record the results.

To avoid Groupthink, I gave one of the two cups, blinded and at random to each subject. I also wouldn’t inform the subjects what the independent variable was until after the experiment was completed. These steps helped avoid introducing bias into the experiment which may have distorted the results.

In total, 3 rounds of randomized experiments were conducted. Two of the three rounds were actual experiments, while the other round served as a control. As a control, house guests were given two cups of the exact same coffee and asked to choose their favorite. The control served as a baseline to calibrate the experiment. In all control experiments, subjects’ preferences between the two cups of coffee were consistent with a random coin flip.

Grinders

My longest running experiment centers on coffee grinders. A common belief among coffee pundits is that good coffee depends on good grinding. Specifically, coffee ground with a burr grinder purportedly tastes better because it grinds the beans more uniformly and doesn’t over-heat the grounds like traditional blade grinders. The experiment that I setup tested this claim by brewing coffee using a burr grinder or a blade grinder and scored which of the two cups subjects preferred.

I don’t own a burr grinder, so I conducted these experiments using the burr grinders provided by my house guests. I’ve done this experiment on several occasions using several different burr grinders—Baratza Virtuoso grinders, several Breville grinders, and a Cuisinart grinder. All of these burr grinders were pitted against my Mr. Coffee blade grinder. The aggregated results of these experiments is plotted in the stacked bar plot below:

In total, 24 data samples were collected in these experiments. Each of the 3 burr grinder models performed comparably. Surprisingly, $\frac{13}{24}$ or, $\approx 54\%$ of subjects actually preferred the blade grinder. This data suggested that blade grinders might actually produce better tasting coffee than burr grinders. To test this claim, the following analysis was conducted.

Let $X$ denote an i.i.d Bernoulli random variable. This variable represented the number of successes in a sequence of Bernoulli trials where $n$ denotes the number of trials. In this experiment, success represented the number of individuals who preferred the blade grinder to the burr grinder. The probability of success in these experiments was assumed to be fixed, so the probability that the blade grinder is preferred to the burr grinder is given by the Binomial distribution:

$$ \begin{eqnarray} \Pr(X \ge 13) &=& \Pr(X = 13), \Pr(X = 14), \dots, \Pr(X = 24) \\ &=& \sum_{k=x}^n {{n} \choose {k}} p^k (1 - p)^{n-k} \\ &=& 0.27 \end{eqnarray} $$

Using the Binomial distribution, the obtained p value is, $p = 0.27$. From the control experiments, I calculated the probability of selecting the blade grinder to be 46%. The null and alternative models for a one-sided Binomial test are:

$$ \begin{eqnarray} \mathcal{H}_0: p = 0.46 \\ \mathcal{H}_1: p > 0.46 \end{eqnarray} $$

The p-value represents the probability of obtaining a result that is at least as extreme as what was observed given that the null hypothesis is true. The decision rule for the test can be stated in terms of the p-value and its relation to the significance level which I set to $\alpha = 0.05$ before the experiment was conducted. If $\mathcal{H_0}$ were true, that is, if $p = 0.46$, obtaining 13 or more successes in a sample of size 24 would happen about 27% of the time. Although the proportion of successes, $\frac{13}{24}$ provided evidence that pointed in the direction of the alternative hypothesis, it was not statistically significant at an alpha level of $\alpha = 0.05$ and I failed to reject the null hypothesis.

Based on this analysis, subjects did not show a statistically significant preference in coffee brewed from a blade grinder or burr grinder. It’s important to point out that this analysis had low statistical power and would have had difficulty in detecting subtle differences between grinder preferences because the number of subjects in the experiment was relatively small. If there was a slight preference to one grinder or the other, this analysis might not detect the effect. However, the results still failed to produce evidence that coffee from a burr grinder was superior to coffee from a blade grinder. For this reason, I opted to save a few hundred dollars and not invest in a burr grinder until the data provides a compelling reason to do so.

Brewing Methods

Another variable I was interested in understanding was the impact of different brewing methods on flavor preferences. I wanted to know if house guests preferred coffee brewed via drip-extraction versus coffee brewed with the Aeropress. I setup my experiment as previously described and scored the number of house guest who preferred each brewing method. The objective of this analysis was similar to the burr grinder experiment, but this time I wanted to compare preferences between two different brewing methods.

I chose to analyze this problem using a different approach. In this experiment, I selected the Uniform prior to model my belief about the brew method preferences of my guests. A main benefit of using the Uniform is that it leads to a discrete distribution for the predicted number of successes in $n$ future trials. For a series of independent samples on $\mathcal{U}(0, 1)$ and a fixed probability, $U_1 \le p$, $U_2 \le p$, $\dots U_n \le p$ are like $n$ independent trials where the number of successes of such events is a Binomial distribution. In a series of these samples, the probability of the $k$th order statistic is given by the Beta integral. This trick allows the re-expression of Binomial probabilities as integrals. Framed in this experiment, successes are events where the Aeropress is selected over drip-brew coffee. The hypothesis test for this experiment is presented below where $H_1$ denotes preference for the Aeropress, $H_2$ indicated preference for drip-extraction, and $O$ represents the posterior odds ratio:

$$ \begin{eqnarray} O &=& \frac{\Pr(H_1 \,|\, D)}{\Pr(H_2 \,|\, D)} \\ &=& \frac{\Pr(D \,|\, H_1)\Pr(H_1)}{\Pr(D \,|\, H_2)\Pr(H_2)} \\ &=& \frac{\int_{0.5}^1 p^k(1-p)^{n-k} \, \mathrm{d}x}{1 - \int_{0.5}^1 p^k(1-p)^{n-k} \, \mathrm{d}x} \\ &=& \frac{I_{0.5}(n - k + 1, k +1)}{1 - I_{0.5}(n - k + 1, k + 1)} \\ &=& 74.17 \end{eqnarray} $$

I observed that $\frac{15}{20}$ or $75\%$ of subjects preferred coffee from the Aeropress. Since the prior probabilities are equivalent, the hypothesis test reduces to computing the posterior odds ratio, i.e., the prior model probabilities $\Pr(H_1)$ and $\Pr(H_2)$ need not be considered. In this calculation, $I_x(\alpha, \beta)$ is the incomplete beta function. A ratio of $\approx 74$ is considered strong evidence against the null hypothesis. This conclusion is also supported by the exact Binomial test, where the null hypothesis would have been rejected. Thus, there is reason to believe that the Aeropress may, in fact, produce better tasting coffee than drip-extraction.

Setup

I’ve examined other coffee variables using a similar experimental approach and found only a few factors that had any measurable effect on coffee flavor in isolation. For example, variables like bean freshness or bean purveyor has little effect on flavor. As a result of these experiments, my brewing setup is simple, quick, and inexpensive. I buy the cheapest whole-bean shade-grown coffee I can find in my preferred roast. To brew a cup of coffee, I grind the beans with a blade grinder and brew with the Aeropress. The Aeropress and blade grinder can be found on Amazon for around $25. The entire brewing process takes about 5 minutes and produces great coffee. Until I obtain convincing evidence that support investing additional money in brewing accouterments, I see little reason to deviate from this system. In the words of Carl Sagan, extra-ordinary claims require extra-ordinary evidence.

Update

A few people asked for some additional details regarding my experiments. I’ve addressed these questions below:

  • What Aeropress technique did you use in your brewing experiments?

    I used the conventional methods outlined in the Aeropress instruction manual and adopted a consensus brew method from the World Aeropress Championship. I used 18 grams of coffee ground a little coarser than filter and ~250-270mL of water for brewing sans inversion.

  • Were subjected allowed to supplement their coffee with additives like sugar, milk, or cream?

    No. All the coffee in my experiments was prepared with no additives. Allowing guests to randomly add their own supplements would have obfuscated the results. When possible, I tried to eliminate extraneous and confounding variables in my experiments.

  • Were your guests simpleton coffee dilettantes incapable of differentiating good coffee from bad? If so, their inexperience may have distorted the results and made it appears as if the blade and burr grinder were equivalent when, in fact, they were not.

    I expected my guests would actually prefer the burr grinder. Most of the subjects in the grinder experiment owned burr grinders, thus they were likely pre-conditioned to favor this grinding method. Despite the presupposed bias toward burr grinders, the empirical data showed that guested did not prefer this method. It’s fun to generate various hypotheses post hoc as to why these experiments could have generated the outcomes that they did; however, in most cases, it is good practice to adopt Occam’s razor, until evidence suggests otherwise.


  1. House guest are great fun because they’re indebted to you for letting them stay in your home; this means you can cajole them into doing all sorts of things they wouldn’t otherwise do. 

  2. A hat tip to Bud Tribble

  3. I use the term experiments loosely throughout this post. I didn’t have the resources to setup a more thorough experimental design. Ideally, I would have liked to use better control conditions, larger sample sizes, more thorough subject randomization, and a more consistent testing environment.