This post is about gambling and how to reason about the odds of winning wagers in multiplayer games.

### Background

Last week my mom found herself in an intriguing gambling situation during a game of Mahjong. There were 3 other players in the game—I’ll call them Alice, Betty, and Clara. At one point, Alice proposed a wager; if she won the next game, each player had to pay her \$20, otherwise she would pay each player \$20. When Alice purposed the wager, the number of games won by each player was as follows:

Alice’s bet seems reasonable; she has won over half of the total games played. Her perceived dominance likely prompted her into proposing the wager, believing that she had a likely chance of winning the next game. The question is, should the other players take Alice’s bet? Who does this bet favor, Alice or the other players?

I was interested in this problem because it resembled a famous gambling puzzle called the problem of points worked out in the 17th century through a series of correspondences between Pascal and Fermat. Pascal and Fermat were interested in determining how to divvy up a pot of winnings between two players if the game was suddenly stopped. Their work on this problem is widely regarded as the birth of modern probability theory.

Alice’s bet is a more complex variant of the problem of points with an added twist. When the Mahjong game is suddenly terminated after the next game, instead of splitting a pot of earnings, each player is interested in estimating the probability that Alice will win the game, so that they can decide whether to take the bet. If I were trying to maximize my chances of winning this wager, I would model the situation as follows.

### Model

With the data in Figure 1, I had one realization of a sample drawn from a discrete distribution—$N \sim Discrete(\theta)$, where $N \in \{1, … , K\}$ is a categorical random variable of $K$ players. Let $N_j$ denote the number of games won by the $j$th player and $\theta_j$ denote the probability of the $j$th player winning a game. The maximum likelihood estimator for $\theta_j$ is simply, $\hat{\theta}_j = \frac{N_j}{N}$.

The number of games won by each player can be thought of as an outcome from a multinomial distribution. The multinomial is a generalization of the binomial distribution and models the probability of $K$ mutually exclusive outcomes—in this case, the number of wins in games of Mahjong. The multinomial distribution is comprised of two components—the number of ways in which a fixed number of wins can be assigned to $K$ players and the corresponding probability of each of these outcomes. Assuming outcomes are independent events, the probability of an outcome is given by the product of these two components, which is the probability mass function of the multinomial:

$$f(n_1, n_2, \dots, n_k | \theta_1, \theta_2, \dots, \theta_k) = \frac{\left(\sum_{j=i}^{K}n_j\right)!}{\prod_{j=1}^{K}n_j!} \prod_{j=1}^K \theta_j^{n_j}$$

The multinomial is a member of the exponential family, and accordingly, is conjugate with another distribution, the Dirichlet. Together, this conjugate pair forms the Dirichlet-multinomial model. This relationship is desirable for modeling in a Bayesian framework because the product of the likelihood and the prior produces a recognizable posterior kernel that is integrable. The posterior distribution of the multinomial, when normalized, forms a Dirichlet:1

$$\begin{eqnarray} \Pr(\theta | N) &\propto& Multi(N|\theta)Dir(\theta|\alpha) \\ &\propto& \left( \prod_{j=1}^K \theta_j^{n_j} \right) \left( \prod_{j=1}^K \theta_j^{\alpha_j-1} \right) \\ &\propto& \prod_{j=1}^K \theta_j^{n_j+\alpha_j - 1} \\ &=& Dir(N+\alpha) \\ \end{eqnarray}$$

The Dirichlet distribution is a generalization of the beta distribution over multinomial parameter vectors, sometimes called concentration parameters, $\alpha = \alpha_1, \alpha_2, …, \alpha_k$. The concentration parameters in the model can be thought of as pseudo-counts representing the number of games won by each player. The pseudo-counts regularize the prior in a way similar to a weighted average between the prior and the likelihood by adding $\alpha$ pseudo-counts to $N$, where $\hat{\theta}_j$ is our prior on $\theta_j$. The expected value of $\theta_j$ is then given by:

$$E[\theta_j | N_j] = \frac{\alpha}{N + \alpha}\theta_j + \frac{N}{N+\alpha} \hat{\theta}_j$$

Noting again that the posterior is a Dirichlet allows the derivation of the full probability distribution of the expected values because the marginals of each $\theta$ form a beta distribution under the posterior:

$$Beta(\alpha_j, \alpha - \alpha_j)$$

With the full posterior of each player in hand, it is possible to calculate the variance, maximum a posteriori estimation, credible intervals, or almost anything that is of interest in this model. The only remaining piece of information I needed for my model was to choose a prior. What is my prior belief that each player will win the next game? To answer this, I consulted my mom as she has played Mahjong with the other players enough to be able to form a reasonable estimate. She estimated the prior probabilities to be: $\hat{\theta}_{Alice}=0.20$, $\hat{\theta}_{Betty}=0.35$, $\hat{\theta}_{Mom}=0.30$, and $\hat{\theta}_{Clara}=0.15$. The question remaining was how to weight these priors in the model. How strongly do I believe that my mom’s estimates are accurate?

### Visualization

In most situations, Bayesian reasoning proceeds by gathering data, building a model, and introducing data that sets the model variables in known states where probabilities of interest are conditioned on the data. As more and more data is introduced, the model is updated to reflect the new evidence. This recalibration is sometimes referred to as Bayesian updating.

In my model, I had to worked in reverse. The data was fixed; I had only observed the outcome in Figure 1. In this situation, the data was held constant and the choice of prior was the sole variable affecting inference. As the size and confidence of the prior grows, it pushes the expected values of the posterior toward the probabilities of the prior. To understand how my model was affected by the choice of the prior, I built a visualization:

The above visualization shows the posterior distributions of the expected value for each player winning the next game. The x-axis shows the relative probability that a given player will win the next game and the y-axis shows relative density. The size of the prior is adjustable by moving the slider below the main visualization. Hovering over a player’s name in the legend shows the 95% credible interval for the specified player at the selected prior.

### Conclusions

When the slider is used to set $N=0$, there is no prior information and the resulting probabilities of each player winning the next game are simply the maximum likelihood estimates derived from the data. This is essentially what Alice probably used to conclude she would likely win the wager. From the data alone, Alice has a $53\%$ chance of winning the next game. Under these conditions, she is making a reasonable bet. However, the 95% credible interval is $0.29 > \hat{\theta}_{Alice} > 0.77$, indicated a large degree of uncertainty around this estimate.

I was skeptical that the observed data alone would be a good estimate of the true underlying probability that Alice would win the next game. In a small sample of only 15 games, the observed probabilities could deviate wildly from the true odds that each player would win the next game. The wide credible intervals confirm this notion. I was willing to place more certainty on my Mom’s ability to generate reasonable estimates; however, the observed data was valuable information. Thus, I wanted a model that balanced the influence of both the data and prior together in a way that reflected this logic.

I weighted my Mom’s estimate at 4x the weight of the data, $N=60$. Setting the prior to this value, suggested that Alice would win the next game $26.6\%$ of the time with a credible interval of $[17\%, 37\%]$. Interestingly, this model also suggests that the probability of each player winning a game is close to random chance, $\hat{\theta}_{Alice}=0.27$, $\hat{\theta}_{Betty}=0.33$, $\hat{\theta}_{Mom}=0.27$, and $\hat{\theta}_{Clara}=0.13$.

In the actual game, the other players took Alice’s bet and subsequently won the wager. My mom won the next game and Alice had to pay each player \$20. With an$\approx 3:1\$ odds against Alice winning, I would have also taken this bet if I were one of players in the game. Based on my model, I would conclude that Alice likely committed a base rate fallacy. She placed too much credence in her short term winning streak and too little weight in her past performance.

1. The full derivation of the conditional distribution is presented on the corresponding Wikipedia page