I’ve noticed in the last few year an increasing range of beer choices available at the store. With the success of microbreweries and now the rise of nanobrewing, the breadth of beer options available is overwhelming. When I go to the store, I want to make an informed buying decision, but I don’t want to spend 30 minutes choosing a beer to buy. My method for selecting beer relies on an iOS app called BeerBuddy that I first learned about on The Zymurgist’s Dilemma. The app allows me to scan the barcode of a beer and examine reviews for that beer on RateBeer. The app is helpful, but barscanner apps don’t scale. My local store has over 200 beer choices available and I don’t want to manually scan and read reviews for dozens of labels. Consequently, I overlook some beers simply because certain labels do not look appealing and I never look up the reviews.^{1}
A few weeks ago I found a paper detailing two datasets of beer reviews from BeerAdvocate and RateBeer. BeerAdvocate and RateBeer are communities of beer zealots who review, rate, and compare tasting notes on craft beers. The combined dataset contained close to 5 million reviews on 170,000 different beers. Although the reviews are not conducted under rigorous double-blinded conditions, I generally find the user reviews to be accurate and in concordance with my own beer preferences. I decided to explore the data to learn what characteristics of beer are associated with the highest review scores and then use this information to pre-screen the number of beers that I needed to manually lookup in BeerBuddy. The goal of this analysis was to learn what features of beer are associated with the highest and lowest reviews so I could make better beer selections and to get in and out of the store faster.
The review data I analyzed was initially on different scales. Some reviews were on a 5 point scale while other where on a 10 or 20 point scale. I normalized all ratings to a 5 point scale with 5 being a more favorable review. I then removed incomplete records and loaded the data into a Pandas DataFrame. Here is the structure of the first few data records after the data was parsed, cleaned, and normalized:
Beer_id Brewer_id Abv Style Appear. Aroma Palate Taste Overall Time Profile_name 0 47986 10325 5.0 Hefeweizen 2.5 2.0 1.5 1.5 1.5 1234817823 stcules 1 47969 10325 5.0 German Pilsener 3.5 3.0 2.5 3.0 3.0 1234725145 stcules 2 64883 1075 7.7 Imperial IPA 4.0 4.5 4.0 4.5 4.0 1293735206 johnmichaelsen 3 10789 1075 7.2 Oatmeal Stout 2.5 1.5 2.5 2.0 2.0 1103502339 RedDiamond 4 12386 1075 5.6 American Pale Lager 4.0 3.0 4.0 4.0 4.0 1062311123 beerguy101 5 58046 1075 7.4 Rauchbier 3.5 4.5 4.0 4.5 4.5 1316405909 Klym 6 49669 6102 4.5 Dortmunder/Helles 2.0 3.0 3.0 3.5 3.5 1132790400 tiggmtl 7 31035 1028 4.8 Premium Bitter/ESB 3.0 2.5 3.0 2.5 3.0 1076544000 imdownthepub 8 12240 1028 6.3 Stout 3.0 3.0 2.0 3.0 2.5 1019520000 OlJuntan64
Each beer is rated on several qualities—appearance, aroma, palate, taste, and overall. I suspected that there was redundant information in these reviews. It was hard to envision instances where a beer review would get a low aroma or palate score and also receive a high overall score. I plotted the correlation matrix of these features to see how they were related. All the features were positively correlated, albeit to differing degrees:^{2}
I decided to further explore the distributions of taste and overall ratings to get a better sense of the reviews. I suspected that both sets of ratings would be Gaussian approximations of the Binomial distribution. If I reviewed an average beer, I would rate it at a 2.5 on a 5 point scale. Interestingly, the overall ratings and taste ratings distributions both exhibited a characteristic negative skew:
I wanted to learn a set of rules that would allow me to make better and faster beer selections. Although I had a large sample of user reviews, I wanted to make inferences from an infinitely large population of reviewers based on this sample data. I focused on three variables in the data—alcohol concentration, style, and overall rating. I quantized alcohol concentration into groups to create a dataset, $X$ of categorical variables composed of $k$ groups of observed reviews where each beer had a given alcohol concentration and style associated with it.
For each group, $x_1, x_2,…,x_k$, likelihood is a function of the parameter vectors, $\theta_1^{x_1}, \theta_2^{x_2}, …, \theta_k^{x_k}$, where $\theta_i^{x_i}$ represents the unknown parameters for a fixed set of data, $x_i$. Using the observed reviews in each group, the objective was to find the parameters, $\theta_i^{x_i}$ that best explains the evidence, $x_i$. In a Bayesian setting, $\Pr(\theta)$ is a probability function that is a representation of belief. Using Bayes’ Rule, a prior probability distribution can be assigned over the parameters and then used to compute the posterior distribution and fit the parameters. This allows the prior be linked to updated posterior probabilities, $\Pr(X \mid \theta)$.
The posterior probability is proportional^{3} to the product of the prior and the likelihood. I used a Dirichlet prior to compute the posterior distribution because it’s the conjugate prior of the multinomial distribution. If the prior is a Dirichlet distribution, for any set of sample values, the posterior distribution of $\theta$ will also be a Dirichlet distribution:
$$ \Pr( \theta \mid X) \propto \Pr(X \mid \theta) \Pr(\theta) $$
$$ Dir \leftarrow Multi \times Dir $$
Below I’ve plotted the bottom 10 and top 10 beer groups sorted by posterior probability of receiving a 5/5 rating, respectively. An interesting observation from this analysis is the positive correlation between alcohol concentration and posterior probability:
RateBeer and BeerAdvocate do not standardize style names. A given beer can be labeled with a slightly different style names on each site. For example, Delirium Tremens is a Belgian Strong Pale Ale on BeerAdvocate and a Belgian Strong Ale on RateBeer. Consequently, some of the groups are essentially identical, but with slightly different names. Variants of Wild Ale, Imperial Stout, and Quadrupel appear multiple times in the top 10. Finding the same groups bubbling to the top and bottom of the plot suggested that RateBeer and BeerAdvocate have similar views of beer ratings.
When I go to the store I now focus on beer groups in the top 10 and avoid those in the bottom 10.^{4} Paying close attention to style and alcohol concentration has greatly improved my beer selection ability despite knowing very little about beer. These simple heuristics have allowed me to choose better beer at the store.
I’ve placed the code used to generate the data and results below and also on GitHub. To generate the figures, just run the Makefile:^{5}
SHELL := /bin/bash BA = 'http://snap.stanford.edu/data/beeradvocate.txt.gz' RB = 'http://snap.stanford.edu/data/ratebeer.txt.gz' DATA := beer-data.txt SCRIPT := beer-selection.py FIGURES := *.svg .PHONY: all clean distclean all: $(FIGURES) $(FIGURES): $(SCRIPT) $(DATA) @echo Running analysis ... python $(SCRIPT) < $(DATA) $(DATA): @echo Downloading data ... wget -qO- $(BA) $(RB) | gunzip -c > $@ touch $@ clean: @- $(RM) $(DATA) $(FIGURES) distclean: clean
Here’s the Python script, beer-selection.py
—note that this code requires Python 2.7x and several dependencies—Pandas, SciPy, NumPy, and Matplotlib:
#!/usr/bin/env python # -*- encoding: utf-8 -*- """ Author: Seth Brown Description: Beer Selection Analysis Date: 2013-03-24 Python 2.7x """ from __future__ import division import sys from itertools import groupby, chain from collections import Counter, namedtuple import pandas as pd import numpy as np from numpy.random.mtrand import dirichlet import scipy as sp import matplotlib.pyplot as plt def maybe_float(s): """ Attempt to convert a string to a float """ try: val = float(s) except ValueError: val = s return val def normalize_scale(s): """ Normalize a rating string, s to a 5 points scale, (0-5] Parameters: -------------- s: rating string, eg. '8/20', '10', etc. Output: ----------- normalized score: float, (0-5] """ if '/' in s: num, denom = map(int, s.split('/')) fctr = float(denom) / 5 score = num / fctr else: score = s return 0.5 * round(float(score) / 0.5) def abv_group(abv): """ Convert ABV into a common categories """ if 0 <= abv < 5: abv_cat = '0-5%' elif 5 <= abv < 10: abv_cat = '5-10%' elif 10 <= abv < 15: abv_cat = '10-15%' elif 15 <= abv < 20: abv_cat = '15-20%' else: abv_cat = '>20%' return abv_cat def ratings_pmf(df, col_key): """ Construct a PMF from a Pandas DataFrame column """ pmf = lambda count, total: count / float(total) values = df.groupby(col_key).size() prs = values.apply(pmf, total=values.sum()) return prs def beer_df(lines): """ Process beer data into a Pandas dataframe Parameters ------------------ lines: generator of lines from beer data file Output -------------- dataframe: cleaned and normalized data. cols: Beer_id, Brewer_id, Abv, Style, Appear., Aroma, Palate, Taste, Overall, Time, Profile_name """ cols = ('Beer_id', 'Brewer_id', 'Abv', 'Style', 'Appear.', 'Aroma', 'Palate', 'Taste', 'Overall', 'Time', 'Profile_name') data = [] txt = 'review/text' for (key, group) in groupby(lines, lambda t: t.startswith('beer/name')): if key: next(group).strip() else: rating_data = [i.strip() for i in group if not i.startswith(txt)] abv = rating_data[2] # some abv fields are malformed--ensure abv contains decimal pt. if '.' in abv: # remove blank lines rating_data = filter(lambda l: l != '', rating_data) rating_data = [i.split(':') for i in rating_data] rating_data = [i[1].strip() for i in rating_data] scale = map(normalize_scale, rating_data[4:9]) rating_data = rating_data[:4] + scale + rating_data[9:] rating_data = [maybe_float(s) for s in rating_data] if len(rating_data) == 11: abv = abv_group(rating_data[2]) data.append(rating_data) return pd.DataFrame(data, columns=cols) def correlation_matrix(dataframe, filename, bar_label=''): """ Plot a correlation matrix from a Pandas DataFrame """ plt.figure(figsize=(7, 9)) col_labels = dataframe.columns.tolist() tick_indices = np.arange(0.5, len(col_labels) + 0.5) plt.pcolormesh(dataframe.values, cmap='RdBu', vmin=-1, vmax=1) cbar = plt.colorbar(drawedges=False, orientation='horizontal') cbar.solids.set_edgecolor('face') cbar.set_label('Spearman\'s ' + r'$\rho$') plt.xticks(tick_indices, col_labels) plt.yticks(tick_indices, col_labels) plt.savefig(filename, bbox_inches='tight') plt.clf() def pmfs(df, filename): """ Plot overall and taste distributions """ plt.figure(figsize=(10, 10)) ax1 = plt.subplot(211) ax2 = plt.subplot(212, sharex=ax1, sharey=ax1) idx = sp.arange(0, 5.5, 0.5)[::-1] taste_prs = ratings_pmf(df, 'Taste') ax1.bar(idx, [0] + list(taste_prs.values), width=0.5, align='center') ax1.set_xlim([-0.5, 5.5]) ax1.set_xticklabels([6, 5, 4, 3, 2, 1, 0]) ax1.set_xlabel('Taste Ratings') ax1.set_ylabel('Probability\n') overall_prs = ratings_pmf(df, 'Overall') ax2.bar(idx, overall_prs.values, width=0.5, align='center') ax2.set_xlabel('Overall Ratings') ax2.set_ylabel('Probability\n') plt.subplots_adjust(right=1.2) plt.tight_layout() plt.savefig(filename) plt.clf() def filter_groups(df): data = {} for k, v in df.iterrows(): key = (v['Abv'], v['Style']) data.setdefault(key, []).append(v['Overall']) # exclude groups with < 30 representative beers & < 2000 reviews data = {k: [len(v)] + list(chain(v)) for k, v in data.items()} data = {k: v[1:] for k, v in data.items() if v[0] > 30} data = {k: v for k, v in data.items() if len(v) > 2000} data = {k: map(round, v) for k, v in data.items()} return data def prob5(data, iters=5000): post = [] post_temp = namedtuple('post', 'style, abv, post_prob') for label, vals in data.items(): counts = Counter(vals) freq_data = [(i, j) for i, j in counts.items()] keys = [i[0] for i in freq_data] obs = [i[1] for i in freq_data] results = [] for n in xrange(iters): samp = dirichlet([x for x in obs]) results.append(samp) results = np.array(results) probs = np.mean(results, axis=0) n_data = zip(keys, probs) for key, prob in n_data: if prob is not None and key == 5: datum = post_temp(style=label[1], abv=label[0], post_prob=prob) post.append(datum) return post def terminal_plot(data, filename, n=10): """ Plot the terminal (highest/lowest) posterior probability ratings """ label_strs = [' '.join((i.style, i.abv)) for i in data] labels = [unicode(i.decode('latin-1')) for i in label_strs] probs = [i.post_prob for i in data] idx = range(n) fig, ((ax1, ax2)) = plt.subplots(2, 1, sharex=True, figsize=(10, 10)) ax1.yaxis.get_majorticklocs() ax1.yaxis.set_ticks(idx) ax1.set_yticklabels(labels[:n], fontsize=15) ax2.yaxis.get_majorticklocs() ax2.yaxis.set_ticks(idx) ax2.set_yticklabels(labels[-n:], fontsize=15) ax2.set_xlabel('Probability') ax1.barh(idx, probs[:n], height=0.8, align='center') ax2.barh(idx, probs[-n:], height=0.8, align='center') plt.subplots_adjust(hspace=100) plt.tight_layout() plt.savefig(filename) plt.clf() def main(): lines = (line for line in sys.stdin) df = beer_df(lines) corr = df.corr(method='spearman') corr_df = corr.ix[2:-1, 2:-1] correlation_matrix(corr_df, 'beer-attributes.svg') pmfs(df, 'overall-ratings-pmf.svg') df['Abv'] = df['Abv'].map(lambda t: abv_group(t)) data = filter_groups(df) post = prob5(data) groups = sorted([i for i in post], key=lambda t: t.post_prob) terminal_plot(groups, 'abv-probs.svg') if __name__ == '__main__': main()
I overlooked one of my all-time favorite beers—Dortmunder Gold for years until a friend recommended I try it. ↩
Using PCA, the first two dimensions captured ~ 85% of the explained variance in the data. ↩
The term proportional is used to indicate that the product of the likelihood function and the prior must be scaled to integrate to one over the range of plausible values of $\theta$. ↩
Although this analysis helps in choosing good beer, there is no substitute for individual reviews. There are clear outliers like Dortmunder Gold, which has a relatively low ABV and scores as a poor beer style—Dortmunder/Helles. ↩
Note that the dataset is several Gigabytes. ↩