### December 4, 2019

Beta distributions, Dirichlet distributions and Dirichlet processes

Today I am writing about some of my favorite distributions: the beta distribution, the Dirchlet distribution, and the Dirichlet process. All three are handy options to have at your disposal in specifying priors in certain Bayesian models, and they're also rather lovely in their own right.

One thing that makes these distributions really neat is that they can be thought of as distributions of distributions. That is, if we take a draw from one of these distributions, we get a result that itself fully describes a probability distribution. If we take another draw, we get another probability distribution. If we take a lot of draws, we get a whole bunch of distributions, with our beta distribution/Dirichlet distribution/Dirichlet process governing what kinds of distributions we might expect to see more or less commonly represented in the bunch.

My goal today is to give some visual intuition for how these distributions of distributions work. Along the way, we'll also see examples of simple conjugate Bayesian analysis and hyperparameter interpretation for these distributions. We will begin with beta and Dirichlet distributions and build our way up to Dirichlet processes and beyond.

The probability distribution of a beta distribution is as follows:

The beta distribution has support on the interval from 0 to 1, meaning a draw from a beta distribution will be a real number between 0 and 1. The parameters a and b control the shape of the distribution. The mean of a beta distribution is . For high values of a and b, the probability distribution will have a bump shape. If a and b are below 1, it will have a U shape. If a = b = 1, the beta distribution reduces to a uniform distribution.

Because a sample drawn from the beta distribution will be between 0 and 1, we can think of each draw as the probability of success in a trial or sequence of trials with two possible outcomes: success and failure. For example, it could represent the probability of heads (success) on each independent toss of biased coin. So, once we have a draw from a beta distribution, we can draw from the new distribution that our beta draw specifies. As we generate a number of draws, we'll have success or failure, with the probability of success on each independent draw equal to the sample from the beta distribution that we began with.

Let's get a feel for that visually and interactively. Here is a graph of the probability distribution function of a beta distribution (with a few different options for hyperparameter values. Try sampling from the beta distribution, then using that sample to draw from the distribution it represents. Here, the green draws on the left represent success, the blue failure.

.

The beta distribution is useful as a prior over the probability of for each trial in a bernoulli or binomial experiment. In this case the parameters a and b have a very elegant interpretation.

Note that we can also refer to a and b in this context as hyperparameters, because they are parameters of the prior distribution.

Let's say we are going to flip a coin that might be biased. We'll count a 'success' as flipping heads. We'd like to represent our prior knowledge about the probability of heads (success), a probability we'll denote x. For example, if we think that most coins are fair and so have some idea that ours will be too, we might select a prior that puts more weight on values near 0.5, for example by choosing the following beta distribution with a = b = 5. If we have reason to believe that the coin will be biased towards heads, we could choose a beta distribution that reflects that; for example we could let a = 2 and b = 5. So different values of a and b help us encode our prior beliefs.

Our plan is to flip this coin a few times, to observe some data. Then, we'll use Bayes rule to compute the distribution that gives new idea of what x. That is, we'll combine our prior distribution with the likelihood of some observed data, to get a posterior distribution for x, the parameter of interest. We flip the coin n times and see k heads, remembering that the probability for success on any given flip is our unknown x. The probability of seeing k heads comes from the binomial distribution and is:

This probability of seeing certain data (here, the data is that we saw k successes in n trials) given the assumptions of our model is the likelihood. Bayes' rule tells us that the posterior distribution for x should be proportional to the product of the likelihood and the prior. That is:

Multiplying the binomial expression for likelihood with the beta prior for x, we obtain:

If we include the normalization constant, we have as the posterior distribution for x:

"Wow, that's another beta distribution!" you say, looking at the posterior distribution we just derived. And you're right!

We say that the beta distribution is the conjugate prior for the binomial distribution, because when we combine the likelihood and prior, we get a posterior that has the form of the prior again: another beta distribition.

Not only that, we see that playing the role of the first parameter of the beta distribution in the posterior is a + k, that is a plus the number of successes we saw in the data. Playing the role of the second parameter is b + n - k, hyperparameter b plus the number of failures in the data. If we increased a by one in setting the prior, that would affect the posterior in the same way as seeing one more success in the data would have. If we increased b by one in setting the prior, that would affect the posterior in the same way as seeing one more failure in the data would have.

Thus we can think of a and b as pseudocounts of successes and failures, respectively. We can think of our prior idea about the fairness of the coin is the same as the idea we would have had if our prior came from observing a binomial trial where we saw a successes and b failures. In light of this interpretation, those hyperparameters a and b are not so mysterious after all. Pretty neat!

To tie these ideas back to a concrete example, let's say we put a beta prior over the probability of heads in our coin toss with a = b = 5. Our prior idea of the fairness of the coin is as if all we knew about it was that on 10 independent flips, we had seen 5 heads and 5 tails - the pseudocounts. Then, suppose we toss the coin n = 10 times and see k = 9 heads. Now we might have a little more suspicion that x, the probability of heads is more than 0.5. In fact our uncertainty about x after seeing the data is captured by a beta distribution with parameters a + k = 14 and b+ n - k = 6. We plot these example prior (green) and posterior (blue) distributions below.

A draw from beta distribution is great for describing the probability of outcomes when there are only two possible outcomes (e.g. success/failure or heads/tails). If we'd like to have a distribution whose draws specify probability distributions over more than two outcomes, we can use the Dirichlet distribution, which is a generalization of the beta distribution. In Bayesian modeling, Dirichlet distributions are handy for priors over multinomial or categorical variables, for instance, you might use them to describe some prior idea above the relative weights of different mixing components in a mixture model.

Each draw from a Dirichlet distribution is a vector of prespecified length M that sums to one. For example, (0.2, 0.3, 0.4, 0.1) could be a draw from a Dirichlet distribution with M = 4. We can view this vector a probability distribution over a discrete random variable with M = 4 possible outcomes, the first occurring with probability 0.2, the second with probability 0.3, and so on. In place of the two parameters a and b, we have a vector of M parameters that control the shape of a Dirchlet distribution, and thus what kinds of vectors of probabilities are likely or unlikeley to be drawn from it.

The probability density function for a Dirichlet distribution is a follows:

Here are some illustrations of Dirichlet probability distributions for M = 3. (Higher values of M get difficult to visualize on a screen!)

We can play the same game as before, sampling from a Dirichlet distribution and then using that sample to draw new samples from a discrete distribution, this time with more than two possible outcomes.

.

Once again we have conjugacy, and a helpful interpretation of hyperparameters as pseudocounts. Just as the beta distribution is the conjugate prior for the Bernoulli and binomial distributions, the Dirichlet distribution is the conjugate prior for the categorical and multinomial distributions.

Suppose we have n independent trials, where the probability of each of the M possible outcomes occuring in each trial is given by the corresponding element of the x, where x is still M-dimensional and its elements sum to one. We use a Dirichlet distribution to represent our prior knowledge about x. We will then observe data and update our idea of x. We represent a specific set of outcomes by the M-dimensional vector k, where each element of k is a whole number that gives the number of appearances of the corresponding outcomes in the n trials (so that the sum of all elements in k is n). The probability of observing k follows a multinomial distribution (which generalizes the binomial distribution):

If we set a Dirichlet prior over the vector of probabilities x, and then combine it with some vector k representing data about the outcomes of n trials, we can multiply the two to get the following posterior distribution:

Including the constant of normalization, we have:

So the posterior distribution is a Dirichlet distribution, just like the prior - we have conjugacy. Moreover, if we look at how the hyperparameters of a add with counterparts in the vector of counts k from the data, we see the elements of a playing the same role of pseudocounts as the hyperparameters a and b did in the beta distribution. Delightful!

The theme of this post is distributions of distributions. A sample from a Dirichlet process itself almost surely (in the technical sense of that expression) specifies a discrete probability distribution. However, while a Dirichlet distribution yielded distributions over M outcomes, a Dirichlet process (DP) can yield distributions over infinitely many outcomes. As we did with beta distributions and Dirichlet distributions, we want to break down how we get to this sample that itself specifies a distribution, and then how we generate samples from that distribution.

A Dirchlet process is specified using two parameters, a positive real number α, and a probability distribution H. "Whoa," you say, "one of the 'parameters' is a whole probability distribution?" That's right. It's called the base distribution.

H can be continuous or discrete. To generate the visualization below, we've used a standard normal distribution for H, but you could just as easily use something else. You get to choose α, though:

α =

One way to conceptualize a draw from a Dirichlet process is as a stick-breaking process. We start with a stick of length one:

Next, we draw a value from H. We're going to pair that value with some probability mass. The probability mass comes from a portion of the stick. What portion of the stick? A random portion, determined by a draw from a Beta(1, α) distribution.

Draw from Beta(1, α):

Now we draw another value from H. We pair it with a portion of the probability that is not yet allocated (the portion of the stick that is not yet colored in, above). What portion of what's left? A random portion, determined by a draw from a Beta(1, α) distribution.

Draw from Beta(1, α): .
The probability mass that gets associated with the second draw from H is (1-(first beta draw)(second beta draw) .

We continue in this way for infinitely many discrete steps, drawing a value from H and associating it with a certain probability mass. The result is a discrete distribution over our (countably infinite) draws from H. The support is given by set of draws we made from H, which we'll denote by {}, and the and the probability for each element in the support comes from the corresponding portion of the broken stick, which we will denote {} (noting that they sum to one). We can thus represent the result of this procedure, which represents one draw from the Dirichlet process DP(α,H) with the following mathematical expression:

Here δ is the dirac delta. Because each of the portions of the stick were sampled from Beta(1, α), we can see that decreasing α will cause us to allocate larger chunks of the stick to begin with. Thus, we can get a sense for how α controls the concentration of a draw from the Dirichlet process: smaller values of α lead to higher concentrations of probability mass on a few elements.

Incidentally, the distribution over sequences of β generated by this stick breaking procedure, parametrized by α, is called the GEM distribution. The name GEM comes from the initials of the authors Griffiths, Engen, and McCloskey.

During our discussions of the the beta and Dirichlet distributions, we used a sample from the beta or Dirichlet distribution to draw from the distribution it specified. We could imagine doing that with the stick breaking process - using the stick breaking process to tell us the probability given to each of the infinitely many dicrete draws from H, and generating samples from there. But, we can actually generate samples the distribution given by a draw from a Dirichlet process without ever specifying what that draw was. Bypassing an explicit representation of the initial draw can help us deepen our understanding, and it is also a useful view for a number of computations (e.g. using Gibbs sampling for inference).

To begin a metaphor, imagine an empty restaurant. A person walks in, and sits at a table. They find out that everyone at that table is to be served a random "dish". It's not a real dish, but actually just a random draw from the distribution H. Yum.

More people walk in, one at a time, and settle in at more tables, each of which is associated with its own independently drawn sample from H. The key is in how each person decides where to sit. With probability proportional to α, a person will boldly strike out on their own and choose a new table. Alternatively, they will choose to sit at an occupied table with probability proportional to the number of people who are already there (we imagine them reasoning that if other people chose a table, it must be good, so joining an already popular table seems appealing). Normalizing these probabilities properly, the nth person choose a new table (associated with a new draw from H) with a probability of and choose an existing table with probability , where is the number of people already at table t. You'll see this process described as the "Chinese Restaurant Process" in some of the literature.

Below, we visualize this generation of samples based on a single draw from a DP. The top graph shows H. We'll mark up the graph of H as we make independent draws from it to associate with each new table. Below, we see the simulation of people choosing a table, where each table is associated with an independent draw from H. The number of tables will continue to grow, and we can't show all of them, so here we're only picturing the first few. (Any samples from the base distribution that are marked in gray are from tables that are not shown.)

What about conjugacy? Like the beta and Dirichlet distributions, Dirichlet processes also have some nice conjugacy properties. In particular, if G ~ DP(α, H), and we have a set of observations ~ G (i.i.d.), then the postierior distribution for G is also a Dirichlet process:

We can see that each observation updates the base distribution with additional probability mass according to the observation, in the same way that we updated the psuedocount parameters of beta and Dirichlet distributions by adding in counts of observations.

One can futher derive the predictive distribution, which should remind us of the table selection process above:

When might you use a DP? To touch on the practical utility of Dirichlet processes in modeling, it can be helpful to note the limitations of a Dirichlet distribution. We noted that Dirichlet distributions are useful in specifying priors for, e.g. the weights of different components in a mixture model, when we know in advance the number of components (M, in the notation above). However, it is very often the case that we do not know in advance how many mixture components we have, and that in fact as we observe more data, we expect to continue to accumulate more components. For example, suppose we are modeling mixtures of topics in some natural language corpus; we do not know how many topics to expect, and as we see more data, we might need to add more topics to adequately descibe it. Or, perhaps we are looking for subtypes of a disease based on clusters in the disease characteristics, but we do not know how many subtypes there might be. In such cases, a Dirichlet process may fit nicely with our intuition that as we observe more data, we may also accumulate more mixture components (tables, in the restaurant metaphor), and the draws from base distribution may be used in parameterizing the components themselves. It's a handy tool to have at your disposal!

Phew! That's it on DPs for now. But in thinking about DPs as priors, we just dipped our toe into Bayesian nonparametrics - there is a whole world out there to explore if you want!

Look, it took a real battle against my instincts as a mathematician not to just start with the formal definition and assume everything was clear from there. But I'm so glad you asked.

Let (X, B) be a measurable space, with H a probability measure on that space. Then the stochastic process G indexed by the sets in B is a Dirichlet process distributed as DP(α, H) if for any finite measurable partition of X, the random vector has a Dirichlet distribution with parameter vector .

"Just writing down a definition like that doesn't guarantee that G exists," you might say. Fantastic point! The existence of the Dirichlet process was proven in [Ferguson, 1973].

The explicit connection to the Dirichlet distribution given by the definition can help us better understand how both the stick-breaking process and the Chinese restaurant process represent a Dirichlet process. A full proof is beyond the scope of this post, but to point to some rough intuition, we note that the relationship to the Dirichlet distribution can be used to derive the conjugacy relations and predictive distribution in the previous section, which gives rise to the table-choosing proportions of the restaurant process. For the stick-breaking construction, note that since a Dirichlet process corresponds to a Dirichlet distribution over any finite partition of X, then if we partition X into, say, the set containing our first draw from H and its complement, that should follow a Dirichlet distribution with M=2, i.e. a beta distribution. For rigor and a deeper understanding, we refer the reader to [Sethuraman 1994], which establishes the stick-breaking construction of a beta process, [Blackwell and MacQueen, 1973], which presents a Polya urn scheme closely related to the Chinese Restaurant process, and the exposition in Yee Why Teh's excellent [Tutorial].

Another foundational paper in the realm of DPs is [Antoniak, 1974], which explored theory and applications of mixtures of Dirichlet processes.

For a somewhat more recent exposition about Dirichlet processes, see, for example [Teh et al., 2006]. This paper introduces hierarchical Dirichlet process, which will be the topic of an upcoming blogpost.

Acknowledgements: Thank you to George Ho for helpful feedback on this post.

These distributions are cool in their own right, and I've found them useful in applied Bayesian modeling. The interpretation of the hyperparameters in terms of pseudocounts has a lovely logic to it. I know I found this interpretation soothingly concrete when I was getting started with Bayesian modeling, back when choosing hyperparameters seemed quite mysterious. I hope you find some elegant and useful applications of these distributions too.

Lastly, stay tuned. An additional motivation for writing this post was to lay the groundwork for a discussion of a related but more complex topic: hierarchical Dirichlet processes. HDPs have some interesting applications in topic modeling and natural language processing, as well as time series modeling with hidden Markov model variants. If you like beta distributions, Dirichlet distributions, and Dirichlet processes, you'll love hierarchical Dirichlet processes, so get psyched!