Model Mis-specification: All The Ways Things Can Go Wrong…

March 19, 2020

This article cross-posted with
Model Mis-specification: All The Ways Things Can Go Wrong…
March 19, 2020  |  Sara Stoudt  |  Ecology for the Masses


Image Credit: Grand Velas Riviera Maya, CC BY-SA 2.0, Image Cropped

In ecological studies, the quality of the data we use is often a concern. For example, individual animals may be cryptic and hard to detect. Certain sites that we should really be sampling might be hard to reach, so we end up sampling more accessible, less relevant ones. Or it could even be something as simple as recording a raven when we’re really seeing a crow (check our #CrowOrNo if you have problems with that last one). Modeling approaches aim to mitigate the effect on our results of these shortcomings in the data collection.

However, even if we had perfect data, when we decide how to model that data, we have to make choices that may not match the reality of the scenario we are trying to understand. Model mis-specification is a generic term for when our model doesn’t match the processes which have generated the data we are trying to understand. It can lead to biased estimates of covariates and incorrect uncertainty quantification.

To walk through different types of model mis-specification, let’s pick a study species that doesn’t move and would be pretty hard to miss: palm trees. If we want to estimate the abundance of palm trees we would visit many sites, count the number of palm trees, and collect covariates about the sites (temperature, surrounding vegetation, distance to nearest tourist sipping a pina colada). Let’s suppose we have collected our data using the perfect protocol and nothing went wrong. Victory!

However, now we are at the modeling stage of our research, and we are faced with a lot of choices. What covariates could be related to abundance? Are they related linearly? Quadratically? Do some of the covariates interact? What functional form should we use to model the response? The list of ways we can make a mis-step really adds up! We’ll tackle each one in turn.

Missing a Covariate

We have collected a lot of information about each site, but what if we missed some property of the site that actually is related to abundance? If we missed an important covariate, we lose some predictive power. Additionally, our errors may be correlated with the missing covariate, which violates the often-made homoscedastic assumption. This means that the uncertainty surrounding our prediction is potentially much higher than it could have been.

Alternatively, if we missed a confounding covariate, one that is correlated with both the abundance of palm trees and one or more of our other covariates, we may be attributing predictive power to covariates that may not deserve it. For instance, palm tree abundance could be driven by soil moisture, which is in turn driven by rainfall. If we only account for soil moisture, we may attribute too much predictive power to it. This isn’t necessarily bad for prediction purposes, but if we want to interpret the coefficients on our covariates, we may misinterpret the significance of certain covariates.

Wrong Functional Form 

Functional form just means how the response or the covariate enters into the model. For example, in the case of abundance, we may use Poisson regression. In this case our collected counts y enter the model as log(y), making the assumption that the covariates are linearly related to log(y). In the case of the covariates, we may think log(y) is related to x or x^2, or the relationship between log(y) and x might depend on the value of z, requiring the interaction between x and z to appear in the model.

Like missing a covariate, having the wrong functional form of a covariate can make residuals heteroscedastic and impact uncertainty quantification. The wrong functional form of the response can lead to poor fit of the imposed model.

Bayesian Prior Mismatch

The Bayesian approach is a type of modelling that lets us insert our assumptions about certain parameters (called ‘priors’) into a model. There are a variety of resources that talk about choice of priors and their appropriateness for certain situations, but for the purpose of this example, a “bad” prior choice means that it excludes the true value of the parameter of interest. For example, if the true relationship between soil moisture and abundance of palm trees is negative, and we put a prior on the coefficient for soil moisture that only puts positive probability on positive values, then we are in trouble.

This may seem like an extreme example that you may not ever encounter, but a similar situation could arise in a part of the model that is less easily interpreted. For instance, we could also make a choice of prior on a variance term that restricts it from being large enough to match reality. This can be harder to diagnose.

Whilst lake area is often easy to calculate when modelling freshwater species abundances, lake depth is often harder to calculate. This can lead to higher uncertainty in models (Image Credit: Sergey Ashmarin, CC BY-SA 3.0)
Whilst lake area is often easy to calculate when modelling freshwater species abundances, lake depth is often harder to calculate. This can lead to higher uncertainty in models (Image Credit: Sergey Ashmarin, CC BY-SA 3.0)

Lack of Independence

Your model may assume independence between palm trees, but you may have reason to believe that they are clustered. We talk about spatial autocorrelation in another post, so I won’t get too into the weeds here. The good news is that in many of these cases our estimates will be unbiased but our estimated uncertainty will be incorrect.

How to Deal

Goodness-of-fit checks and diagnostic plots can help alert us to these model mis-specifications. Looking for patterns in the residuals is a good way to see if we need a different functional form of a covariate.

Model mis-specification due to missing covariates is a challenging type; it can be hard to convince ourselves that we have information on all possible covariates related to abundance. A diagnostic plot may reveal the existence of a missing covariate, but it won’t tell us what information to collect to mitigate the problem.

In the specific case of the Poisson regression example, checking for overdispersion, i.e. higher variance than expected given the model, is an important check that the assumed functional form of the response is appropriate.

Let’s close with some hope. None of these problems are insurmountable. The most important thing is that we detect model mis-specification. Then we can take action to mitigate the negative outcomes of the mis-specification. We should carefully interrogate each assumption we make and take advantage of some of the newer model checking frameworks for more complicated ecological scenarios.


Featured Fellows

Sara Stoudt

Statistics, UC Berkeley
Alumni - BIDS Data Science Fellow