### January 5, 2020

Poisson distributions, gamma distributions, and Poisson processes
This post is meant to introduce some friendly and useful distributions — the *Poisson distribution* and the *gamma distribution* —
and to get some intuition for the related *Poisson process*. As a bonus, we'll see
the *exponential* and *negative binomial distributions* appear along the way, with the negative binomial appearing in the role of *prior predictive distribution*.

Poisson distributions are useful when you have whole number counts of events. Gamma distributions are a flexible family of distributions that generalize some other distributions that might be familiar to you: the exponential and chi-squared distributions. Poisson processes can be used to model events occuring with certain independence properties in space and/or time. Exponential distributions show up in modeling wait times between events, among other applications. If these sound like a useful bunch of distributions for modeling in many settings, they are.

As we work with these distributions, we will take advantage of the opportunities that arise along the way to become more familiar with ideas about working with priors, likelihoods, and posterior distributions, recognizing and using conjugacy properties when it is reasonable to do so, and looking at the role of hyperparameters. When we encounter the negative binomial distribution it will be as a prior predictive distribution in our model, which is a useful-to-understand role for a distribution. We'll also take an optional detour to look at different parametrizations of the gamma distribution in more detail, which might sound boring until you — say — spend a bunch of time that you will never get back on failing to replicate someone's methods, only to realize that you and they were parametrizing the gamma distribution differently.

Ready? Let's get started, beginning with the Poisson distribution.

The Poisson distribution is a distribution over the space of whole number outcomes: 0, 1, 2, 3,... . Because a draw from a Poisson distribution can be any whole number, the Poisson distribution is a useful option when you want to model a variable that represents the count of something: car accidents in a month, cases of a disease in a region, customer visits to a store. I first encountered the Poisson distribution in the context of modeling the number of times a neuron fires in a given time period.

The probability mass function for the Poisson distribution gives probability to whole number outcomes 0, 1, 2, 3,... as follows:

The Poisson distribution is governed by one parameter, a positive number denoted λ in our notation. Below you can see the probability mass function for different values of λ.

λ =

Notice that as we increase λ, the center of the distribution takes on higher values and so does its spread. In fact, the mean and variance of a Poisson distribution are both exactly λ.

**Side note:** Did you notice a bell shaped distribution emerging as λ increases? Nice! For large values of λ, we can approximate
a Poisson distribution with a normal (Gaussian) distribution, taking a little care due to the fact that a Poisson distribution is discrete and
the normal distribution is continuous. We can justify the appropriateness of this approximation using the central limit theorem.

Knowing that the parameter λ is both the mean and the variance of a Poisson distribution, we can easily imagine wanting to infer the value of λ.
Suppose, for example, that we have a number of observed count values that we assume are drawn independently and identically distributed (i.i.d.)
from a Poisson distribution. Perhaps the *y* are counts of occurrances of a disease in *N* regions with similar population size and characteristics.
We might like to know λ, the expected count. Multiplying together the Poisson probabilities for each of the *N* independent observation in **y**, we get:

Suppose that we also have some prior information, perhaps due to previous epidemiological studies,
about what might be a reasonable value of λ. We can represent this prior knowledge (and associated uncertainty) of λ with
a *prior distribution* over λ.

What kind of distribution should we use to represent the prior over λ? It would be convenient if the prior distribution p(λ) played nicely with the
Poisson distributed *likelihood* p( **y**| λ), in the sense that if we multiplied the two, we got another recognizable and reasonably friendly distribution over λ.
We recall from Bayes theorem that, as a function of λ, the *posterior* distribution of λ, p(λ | **y** ) is proportional to the product of the
prior and likelihood:

A family of prior distributions is called *conjugate* to a likelihood if, when you combine the two, you obtain a posterior distribution that is in the
same family as the prior. Since our expected counts shouldn't be negative, we would like to choose a prior that gives probability only to positive values of λ.
Moreover, we see that our Poisson choice of likelihood has both a power of λ and an exponentiated λ, which would combine nicely with a prior distribution
that also includes those components. We may conclude that a gamma distribution would
fit the bill nicely as a conjugate prior, as a gamma distribution has the following probability density function (pdf):

The parameters α and β are positive real numbers. Since for us they will play the role of parameters of a prior distribution, we can also call them

*hyperparameters*. Multiplying the likelihood and prior we see that the posterior distribution p( λ |

**y**) is proportional to: Comparing to the pdf of a gamma distribution, we see that:

Looking at how incorporating the data **y** updates the prior distribution also gives us insight into one possible interpretation of the hyperparameters
α and β. Going from the prior to the posterior distribution, α is updated by adding to it the total number of observed events (the sum of the *y*), and β is added to
*N*, the number
of observations (counts) *y* , we can interpret a Gamma(α, β) prior for the parameter of our Poisson distributed observations as encoding
prior information equivalent to having observed α total events over β previous trials. This interpretation helps us understand the relative
strength of a particular choice of Gamma prior in the context of a Poisson likelihood.

**Side note: ** Because they can yield friendly posterior distributions, conjugate priors can ease computation and interpretation in Bayesian modeling.
However, we are lucky enough now to live in an age of computational power that is often sufficient to help us deal effectively with decidedly less friendly
posterior distributions, by, for example, using Markov Chain Monte Carlo (MCMC) methods to sample from them.
So, don't feel like you need to limit yourself to a conjugate prior if it is not appropriate! If opting for conjugacy is appropriate, though, enjoy the ease it provides.

Great, let's look at it. Consider our previous setup, where we assume that each observation *y* is Poisson distributed with Poisson parameter λ, and that
our prior distribution for λ follows a Gamma(α, β) distribution, and we have chosen some values for α and β. Suppose we have made those assumptions but not yet observed any data (i.e., we do
not have any observations *y* yet). What do these assumptions tell us about the probability that our first observation will be *y*?

We don't know what λ is exactly, but we can *marginalize* it out - we can look at the probability of *y* when we incorporate all that prior uncertainty about λ.
This will give us the *prior predictive distribution* for an observation *y*.

To see the marginalization we are performing, where we are "integrating out" λ, and to make explicit the role of the hyperparameters, we can write down the prior predictive distribution as:

To calculate the desired quantity, we can use the following rearrangement of Bayes' rule (where we have dropped the hyperparameters from our notation for simplicity):

Substituting in a Gamma (α + y , β +1) distribution for p(λ |y ), as derived in the previous section, a Gamma(α, β) pdf for p(λ), and a Poisson(λ) distribution for p(y| λ) we obtain:

This distribution for *y* is another friendly neighborhood discrete distribution that is common enough to have a name: it is a *negative binomial*
distribution.

The prior predictive distribution is a useful concept to get your head around: it is the distribution of the data we expect to see just given our
model assumptions and prior, and before we have seen any data. Exploring this distribution can be helpful in checking your model. In
**[Gabry et al. 2019]**, for example, the authors include a discussion of the role of visualizing a prior predictive distribution within a broader Bayesian workflow.

**Optional negative binomial note:** The negative binomial distribution can be characterized as the distribution for the
count of successes \(y\) observed before some specified number of failures \( \alpha \) is reached in a series of independent
experiments each with probability of success \(p\). We can parameterize in terms of \(\alpha, p\) instead of \(\alpha, \beta\) by translating with \(p = \frac{1}{\beta + 1}\), so that
$$p(y) = \binom{\alpha + y - 1}{y} (1-p)^\alpha p^y.$$
The negative binomial distribution has several parametrizations in common use, though, so
exercise caution.

First, note that even if the the gamma distributions are very new to you, they still might be more familiar than they seem: the exponential distribution and chi-squared distribution are special cases of gamma distributions. The pdf for an exponential distribution with parameter θ > 0 is:

We note that this is the same as the above pdf of a gamma distribution with α=1, β=θ.

The pdf for a chi-squared distribution with *k* degrees of freedom is given by:

We note that this is the same as the above pdf of a gamma distribution with α=k/2, β=1/2.

Second, let's take some time to get more familiar with how the parameters of a gamma distribution adjust its shape:

α = β =

Third, when dealing with the gamma distribution, you should be aware that they are a number of different parametrizations of the gamma distribution in use, including those that yield the following pdfs:

The parameter α or *k* is called the *shape* parameter. θ is called the *scale* parameter,
and β is called the *inverse scale* or *rate* parameter. μ is the mean of the distribution, and σ is the standard deviation.
We can translate between the parametrizations with, e.g., the following relationships:

*Bayesian Data Analysis*[Gelman et al. 2013] and

*Pattern Recognition and Machine Learning*[Bishop 2006] use the first parametrization listed above. The JavaScript statistical library

**jStat**uses the second. The Python probabilistic programming and Bayesian statistics library

**PyMC3**gives options to parameterize using the first or third. The Python library

**SciPy**'s

`stats.gamma`

adapts the second parametrization above.
In SciPy, functions like `stats.gamma.pdf`

require a shape parameter
(α or *k*above - called

*a*in the scipy documentation), and by default set θ = 1. You can set θ with the optional

`scale`

argument, and you can also replace the gamma distributed random variable *x*with a shifted version of itself by setting the

`loc`

parameter. Thus, the value of `stats.gamma.pdf(x, 2, loc=3, scale=4)`

is given by second equation above with α = 2, θ = 4, evaluated at λ = x - 3.
The take-home message here is to carefully to note *which* parametrization of the gamma distribution you
are working with. If you are reading or trying to reproduce someone else's work, and they just say, "we use a Gamma(3, 2)
prior", then you might not be able to recreate their work exactly without some detective work into which
parametrization they are using.

On the flip side, if *you* are the one sharing statistical work or writing statistical code that
includes the gamma distribution, I hope you'll make sure it is clear to your readers or users how you are parameterizing.

We can use the Poisson distribution to model the count of events in a certain time period or region of space.
What if we would like to describe not just the counts of these events, but their precise locations in space or time?
For example, suppose we want to simulating the timing of a firing (spiking) neuron, where the spikes happen randomly in time.
A *point process* generates a sequence of events like this, and the most common example of a point process is a *Poisson process*,
which is closely related to the Poisson distribution.

A Poisson process with rate *r* has independent increments between
events, and the count of events on any finite interval *T* follows a Poisson distribution with mean and variance *λ = rT *.
(If you are familiar, compare this relationship of Poisson processes to Poisson distributions to the relationship of Dirichlet Processes to
Dirichlet distributions, where a Dirichlet process over a finite partition of the sample space follows a Dirichlet distribution.
Dirichlet processes were the topic of **another post** on this blog.)

Below, choose a rate *r* to correspond to the expected number of events per unit time,
to visualize a set of points generated from Poisson process with rate *r*. Each event is an orange dot, located at some point on the time axis
The count of the total number of events *y*
on this interval will follow a Poisson(*rT*) distribution.

r =

T = 10

We can imagine the time interval
extending infinitely forward, but here we only visualize a finite interval of size *T* = 10. Moreover, while we could
work with time or space, for simplicity, we'll continue to assume that we're dealing with time.

We won't prove it here, but another way to characterize the Poisson process is to specify that intervals are independently drawn from an exponential distribution
with mean of 1/*r*. Thus, one easy way to generate a series of points following a Poisson process is to generate the points sequentially, drawing
independent exponential random variables for each of the intervals between them. If you replace the exponential distribution with some
other choice of distribution, you won't have a Poisson process anymore, but you will have a generalization that is called a
*renewal process*.

To get more intuition for the independence of the events in a Poisson process, we'll derive the Poisson distribution that characterizes a Poisson process from
the following setup. Break up a time interval of length *T* into subintervals of length Δ*t*. The probability of an event
per unit time is *r*, so the probability of an event happening in a subinterval is given by *rΔt*. Take Δ*t* to be small enough
relative to *r* that we can assume that there is no more than one event happening in each subinterval, and assume that whether or not an event occurs in a subinterval is independent of what happens in all the other subintervals. Now consider the probability of seeing *y*
total events in the whole interval *T*

In the preceding paragraph, we created a setup with M = T/Δt independent trials, each with a probability of success rΔ*t* (where success for a subinterval means containing an event).
The count of successes thus obtained follows a binomial distribution. The binomial distribution approximates the true distribution for the count of events *y*
over the interval of length *T* governed by a Poisson process (with
better approximations as we take the bin size Δ*t* to be smaller.)

Now consider the limiting case as we let the number of subintervals *M = T*/Δ*t* increase towards infinity and the
length Δ*t* of subintervals shrinks towards zero.

We are considering the following (recalling, yet again, that *M = T*/Δ*t* ):

As *M* gets large relative to y, we can approximate M - y by M, and we can approximate M!/(M-y)! = M(M-1)...(M-y+1) by which leaves us with:

On a good day, we also recall the following identity for *e*:

Combining this observation with the fact that , we arrive at our conclusion:

Thus, we showed that the count of events in a finite interval of length T has a Poisson distribution, which is the main characterizing property of a Poisson process. This derivation shows how we can arrive at a Poisson process from assumptions about the independence of events.

We have so far described a homogeneous Poisson process, where the rate *r* is constant. Note that we could also imagine a more general, inhomogenous case where *r*
as varying in time, so that *r = r(t)*. In our example of using a Poisson process to describe a firing (spiking) neuron, we could perhaps imagine *r*, the base rate of spikes,
per unit time, as being a function of some sensory input that the neuron is reacting to, so that different inputs cause higher or lower firing rates.
This is a powerful modeling tool.

We should celebrate the power and elegance of Poisson processes, but we should also note that the independence properties that give the Poisson process its simplicity can be a crucial limitation. Because each event is independent from the others, the Poisson process may be inadequate in situations where you might want interactions and dependencies between events. For example, the Poisson process might provide an approximate description of a neuron firing, but by itself it does not capture something like bursting, where one firing event begins a cluster of closely spaced firings. So use the Poisson process with enthusiasm, but also care and attention to whether it is appropriate.

Thanks for reading, and I wish you much happy modeling of count data (and more) with Poisson distributions and their friends!

I encountered the derivation in the previous section of the Poisson distribution as the limit of a binomial in [Dayan and Abbot 2001]. The textbooks [Gelman et al. 2013] and [Bishop 2006] explain uses of Poisson distributions in Bayesian contexts — and are two of my favorite resources for learning Bayesian statistics in general.

- Bishop, C.M., 2006. Pattern recognition and machine learning. Springer.
- Dayan, P. and Abbott, L.F., 2001. Theoretical neuroscience (Vol. 806). Cambridge, MA: MIT Press.
- Gabry, J., Simpson, D., Vehtari, A., Betancourt, M. and Gelman, A., 2019. Visualization in Bayesian workflow. Journal of the Royal Statistical Society: Series A (Statistics in Society), 182(2), pp.389-402.
- Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A. and Rubin, D.B., 2013. Bayesian data analysis. Chapman and Hall/CRC.