August 27, 2021

A little bit of ARIMA (AutoRegressive Integrated Moving Average models)

The goal of this post is to give basic background about how ARIMA models for time series work. Why? 1) For fun, and 2) because ARIMA models are useful— both on their own and as building blocks that show up in larger or fancier models.

We'll start with the picture this time. You can play around and see what kinds of time series you can get by generating from ARIMA models of various sorts and with various parameters.

Autoregressive order: Integration order: Moving average order:
\(y_t = \epsilon_t + \) \(y_{t-1} + \) \(y_{t-2}\)
\(\epsilon_t \sim N(0, \sigma =\) \( ) \)

We'll look at each piece of the ARIMA model in turn: the AR part, the I part, and the MA part.

The autoregressive (AR) part of ARIMA involves modeling each value in the time series as a linear combination of some of its previous values, plus some random noise (or "error" or "random shocks" in other terminology). An AR model of order p, which for short we call an AR(p) model, is given by: \[y_t = \epsilon_t + a_1 y_{t-1} + a_2 y_{t-2} + ... + a_p y_{t-p}\] Sometimes you'll see a constant tacked onto the sum as well.

The noise term \(\epsilon_t\) most often is assumed to be normally distributed with mean zero, but other choices are possible.

An ARIMA model has an order for each of its three parts (AR, I, and MA), and we write these as an ordered triple as in the title of the graph above, e.g. ARIMA(2,1,0). So, an AR(p) model is an ARIMA(p, 0, 0) model.

We can write the equation for an AR process in a slightly different way if we define the lag operator L (a.k.a. the backshift operator) that maps an element of a time series to the previous element: \(L^i y_t = y_{t-i}\) We rearrange the previous equation to get \(y_t - a_1 y_{t-1} - a_2 y_{t-2} - ... - a_p y_{t-p} = \epsilon_t\), and then rewrite with L as: \[(1-\sum_{i=1}^pa_iL^i)y_t = \epsilon_t\] This new form with L may seem opaque, but it actually has some advantages. The \(1-\sum_{i=1}^pa_iL^i\) term that appears on the left hand side looks like a polynomial in L (one that's so important it's called the characteristic polynomial of the AR process) and if we treat it that way, we can say some nifty things. For instance, we can say that the AR process is only wide-sense stationary if all of the real and complex roots of the polynomial have modulus greater than 1. (This fact is easiest to convince yourself in the case of an AR(1) process, where the characteristic polynomial is \(1-a_1L\), which has a root of \(1/a_1\). Consider what the model does when \(|a_1| > 1\)). It might be interesting to experiment with what kinds of AR(2) processes you can make in the plot above. If you want to keep the characteristic polynomial having roots of modulus less than 1, you'll need \( |a_2| \lt 1 \) and \(a_2 - 1 \lt a_1 \lt 1 - a_2\). Check out the different behavior between coefficients that give the characteristic polynomial two real roots, when \(a_1^2 + 4a_2 \gt 0 \), versus coefficients that give it two complex roots, when \(a_1^2 + 4a_2 \lt 0\).

Confession: I ignored the I part of ARIMA models for a pretty long time, assuming that it must be something complicated and a long leap from friendly old ARMA models. Turns out it's actually the simplest part of the model.

The idea is that sometimes it happens that an ARMA model isn't a suitable model for the sequence \(\{y_t\}\), but is a suitable model for the differences between subsequent values of \(y_t\). That is, it may make more sense to fit an ARMA model to the sequence \(\{y_t' := y_t - y_{t-1} \}\) than to \(\{y_t\}\). If we fit an ARMA model to this sequence of \(\{y'_t\}\), we can say that we're fitting an ARIMA(p, 1, q) model to the original data \(\{y_t\}\).

If taking the differences doesn't get us something that is well described by an ARMA model (e.g., it has a consistent trend), then sometimes it might help to take the differences of the differences — \(\{y_t''= y'_t - y'_{t-1} \}\)— and in doing so fit an ARIMA(p,2,q) model. The number of times we take differences is the I order of the ARIMA model.

If you enjoyed the lag operator from the previous section, you might note that we can use it to write \(y_t' = y_t - y_{t-1} = (1-L)y_t\). Similarly, \(y_t'' = (1-L)^2 y_t \). So, we can rewrite the equation for ARIMA(p, d, q) process as: \[(1-\sum_{i=1}^pa_iL^i)(1-L)^d y_t = (1+\sum_{i=1}^q L^i) \epsilon_t\] Considering the left hand side of this equation helps us see that we could consider this ARIMA(p, d, q) model an ARMA(p+d, q) model (we have a polynomial in L of degree \(p+d\) on the left hand side.

Note 1: Why is it I for integrated when I've been talking about differencing? Well, if we take differences to get from our original data to something we can model well with an ARMA process, then we'd integrate (sum) to get from the quantities we model (the differences) back to the quanitites of interest. Hence autoregressive integrated moving average model.

Note 2: Differencing certainly doesn't always yield something that you can model with an ARIMA process. For example, if you have a sequence where the variance is changing in time, that's an obstacle to ARIMA being a good fit, and differencing won't change that. (Although it's possible that some other transformation, such as taking the logarithm of each term, might help.)

Note 3: Seasonal differencing is another kind of differencing that works differently from the I part of the ARIMA model. Many time series that we care about show seasonal effects, e.g. where weather or production or people's behavior may consistently vary at different times of the year, or some other periodic behavior. An ARIMA model on its own can't handle that seasonality, but if it might be appropriate to take the seasonal differences and then fit an ARIMA model to those differences. For instance, if each time point is one month apart and we have a pattern in the data \(y_t\)that repeats every twelve months, we can try using an ARIMA process to model not \(y_t\) itself but another sequence \(z_t := y_t - y_{t-12} = (1-L^{12})y_t \).

An MA model of order \(q\) involves modeling the error as a linear combination of the previous \(q\) error terms — a bit of a different idea than the dependence on previous time series values in AR models, particularly since the error terms \( \epsilon_t \)are unobserved. You might imagine that the \(\epsilon_t\), as "random shocks", have a direct influence beyond the current time point, and MA models allow you to incorporate that influence in a linear way. Concretely, an MA(q) process is given by: \[y_t = \epsilon_t + b_1 \epsilon_{t-1}...+b_q \epsilon_{t-q}\]

The ARMA(p, q) model is given by: \[y_t = a_1 y_{t-1} +.... + a_p y_{t-p} + \epsilon_t + b_1 \epsilon_{t-1}...+b_q \epsilon_{t-q}\] We can write the full ARIMA(p,d,q) process, using the lag operator, as: \[(1-\sum_{i=1}^pa_iL^i)(1-L)^d y_t = (1+\sum_{i=1}^q b_i L^i) \epsilon_t\]

Note: Don't let the name fool you into thinking the coefficients \(b_1,..., b_q\) need to sum to one; they don't. Also, don't let the name make you think this is the same thing as smoothing with a moving average; it's not.

Time series themselves are everywhere (the news is really backing me up on this right now...). And, when the model is appropriate for the situation and data, fitting ARIMA models to data can help you on a couple of fronts. The first is forecasting. We so, so often want to know what's going to happen next, and a fitted ARIMA model gives a relationship between past and future events that can help us predict. The second is understanding the structure of the underlying system. Having a better idea about the influence that past values (or error terms) have on future ones might lead us to understand something about whatever real thing in the world is generating the data.

So, we're after prediction and/or discovering meaningful structure. I'd argue that or both of these is pretty much always the goal when you're modeling anything.

ARIMA models are a bit of a time series workhorse — simple enough to deal with, but adaptable to a reasonable range of situations. We can also use ARIMA models as components of larger models. I find Markov switching models an interesting and useful example of this. Markov switching models involve time series that alternately evolve according to one of several models. One place where I personally have found such a model useful is in working with wearable sensor data in the context of neurodegenerative diseases. We can arrive at a useful description of certain kinds of motion if we allow it to be described by different sets AR parameters at different times; which sets of of AR parameters tend to fit the submovements relates to the movement someone is making, and the status/severity of their neurodegenerative disease).

(*Yes, I've taught high school)

There's plenty more to be said that isn't in this post, from how to choose the orders \(p, d \) and \(q\) of the ARIMA process, to fitting the coefficients, to more about what coefficients yield what kind of behavior. We also have not touched on the idea of invertibility (roughly speaking, when you can write and AR process as an infinite-order MA process or vice versa), or the somewhat related notion of causality. Hopefully this post has given you some basic information about ARIMA models if they were new to you. References where you can read more are below.

I'm partial to the Prado and West time series text, perhaps because it's what I encountered first, and because its overall Bayesian approach fits my interests. If you want a readable, practically-oriented (and light on theory) introduction to time series and/or ARIMA, you might enjoy the Hyndman and Athanasopoulos. And, for tons and tons on ARIMA and the classic (but still relatively recently updated) source, check out Box, Jenkins Reinsel and Ljung.