Load the libraries first that we will use.

`## Loaded gbm 2.1.8.1`

`## Loading required package: terra`

`## terra 1.7.29`

Models are representations of a system of interest, used for the purposes of inference or prediction. To say ‘model’ in ecology is a bit too broad to be useful, and the odd delineation between “modelers” and field researchers is a bit ridiculous. To start, we can delineate between statistical models and theoretical or phenomenological models. Statistical models (e.g., ANOVA, generalized linear models) are what we will cover here, and attempt to gain inference or prediction about some response \(y\) given some number of inputs \(x_1...x_n\). Phenomenological models are those that are constructed independent of theory but are an attempt to construct a system which can explain some ecological process, often in a dynamic way (think of the SIR model in epidemiology). Theoretical models are those that directly arise from first principles independent of empirical data. The main distinction here is between statistical models and “other” models, as both are neat, but we’ll focus on simple statistical models here.

When you create a model of a system, you have two things you don’t understand; the system and the model.

The response variable should be the thing that you want to predict, or the thing whose variability you want to understand (also sometimes referred to as the dependent variable). Often, it is something that you think is the effect produced by some other cause. For example, in examining the relationship between gas usage and outdoor temperature, it seems clear that gas usage should be the response: temperature is a major determinant of gas usage. But suppose that the modeler wanted to be able to measure outdoor temperature from the amount of gas used. Then it would make sense to take temperature as the response variable.

The explanatory variables are used to explain or predict the response variable (also sometimes referred to as independent variables).

```
y <- runif(100)
x <- y ** runif(100, 1,4)
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col='dodgerblue', las=1)
m1 <- lm(y~x)
abline(m1, lwd=2)
```

But to minimize these residuals (or the sum of squares of the
residuals), we don’t necessarily *need* a model, right?

```
par(mar=c(4,4,0.5,0.5))
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col=adjustcolor(1,0.5), las=1)
abline(m1, lwd=2)
lines(smooth.spline(y=y,x=x), col='red', lwd=2)
```

And if we accept that we need a model, what’s stopping us from making the model incredibly complex to capture all the possible variation in the data?

```
par(mar=c(4,4,0.5,0.5))
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col=adjustcolor(1,0.5), las=1)
abline(m1, lwd=2)
m3 <- lm(y~poly(x,3))
lines(x=sort(x), fitted(m3)[order(x)], col='green', type='l', lwd=2)
m4 <- lm(y~poly(x,15))
lines(x=sort(x), fitted(m4)[order(x)], col='purple', type='l', lwd=2)
```

The model would quickly become uninformative. That is, the model could fit every data point perfectly, but tell us nothing about the general relationship between the two variables. This makes it impossible to use the model to predict. Adding in this level of complexity essentially minimizes the error to the fit, but in turn basically maximizes error when extrapolating to new data. This model would be overfit, and the process of adding in variables that minimize model fit error while making the model useless to other datasets, is called overfitting. Not ideal.

Linear models have the standard assumptions of independent and identically distributed (i.i.d.) normal random variables.

More explicitly, these standard assumptions are:

- independence: samples are not autocorrelated or suffer from some other non-independence
- identically distributed: samples come from the same probability distribution
- normality: samples come from a normal probabilty distribution (error distribution is gaussian)

We’ll focus on the last assumption here, since it’s the easiest and most common to have to work around. For instance, if the response variable is a binary outcome (e.g., dead or alive, present or absent).

To handle these issues, a common approach is to use generalized
linear models, which attempt to account for response data with
non-normal distribution. This is done by using a link function to
provide an appropriate *link* between the random error model (the
linear model would assume normally distributed errors) and the
predictors.

Probability distribution | Typical uses | Link name |
---|---|---|

Normal | linear-response data | Identity |

Poisson | count of occurrences in fixed amount of time/space | Log |

Bernoulli | outcome of single yes/no occurrence | Logit |

Binomial | count of # of “yes” occurrences out of N yes/no occurrences | Logit |

Logistic regression

Random component: The distribution of Y is assumed to be Binomial(n, \(\pi\)), where \(\pi\) is a probability of “success”.

Systematic component: X’s are explanatory variables (can be continuous, discrete, or both) and are linear in the parameters, e.g., \(\beta_0\) + \(\beta_1 x_{i}\) + ….

Again, transformation of the X’s themselves are allowed like in linear regression; this holds for any GLM.

**Link function**: Logit link

\[ \nu = logit(\nu) = log(\frac{\pi}{1-\pi})\]

```
##
## Call:
## glm(formula = y2 ~ x, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4376 0.2833 -1.545 0.122
## x 0.3739 0.6387 0.586 0.558
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.06 on 99 degrees of freedom
## Residual deviance: 135.72 on 98 degrees of freedom
## AIC: 139.72
##
## Number of Fisher Scoring iterations: 4
```

`glm`

is incredibly flexible and is the basis for more
complicated models like hierarchical models, ridge regressions,
generalized additive models, etc. etc. Some other forms of generalized
regression are:

Generalised additive models, e.g. mgcv::gam(), extend generalised linear models to incorporate arbitrary smooth functions. That means you can write a formula like y ~ s(x) which becomes an equation like y = f(x) and let gam() estimate what that function is (subject to some smoothness constraints to make the problem tractable).

Generalized linear mixed effects models, e.g.,. lme4::glmer extend generalized linear models to incorporate both fixed and random effects.

Trees, e.g. rpart::rpart(), attack the problem in a completely different way than linear models. They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees aren’t terribly effective by themselves, but they are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)

So we’ve gone over modeling largely in terms of linear models of some
arbitrary number of predictors, which our examples only included a
single predictor. When designing a model, the researcher may want to
incorporate known variable associations or interactions to estimate the
support for a hypothesized effect given the data and model. This can be
done by using the `*`

operator in the formulate
interface.

```
y <- runif(100)
x1 <- y ** runif(100, 1,4)
x2 <- x1 ** runif(100, 0.5,1.5)
dat <- data.frame(y,x1,x2)
m6 <- lm(y~x1+x2, data=dat)
m7 <- lm(y~x1*x2, data=dat)
plot(y=y,x=x1, pch=16,
ylab='y', xlab='x',
col=viridis::viridis(100)[as.numeric(cut(x2,100))], las=1)
```

```
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.239245 -0.085943 -0.009243 0.097867 0.236612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24861 0.01764 14.094 < 2e-16 ***
## x1 0.83780 0.16622 5.040 2.16e-06 ***
## x2 0.01662 0.16325 0.102 0.919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1237 on 97 degrees of freedom
## Multiple R-squared: 0.8206, Adjusted R-squared: 0.8169
## F-statistic: 221.9 on 2 and 97 DF, p-value: < 2.2e-16
```

```
##
## Call:
## lm(formula = y ~ x1 * x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24721 -0.08444 0.01473 0.05492 0.21657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19035 0.01856 10.256 < 2e-16 ***
## x1 1.31059 0.16767 7.816 6.94e-12 ***
## x2 0.24239 0.14796 1.638 0.105
## x1:x2 -0.82866 0.14763 -5.613 1.92e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1079 on 96 degrees of freedom
## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8607
## F-statistic: 204.9 on 3 and 96 DF, p-value: < 2.2e-16
```

What if we don’t necessary know the direct interaction, but still have some idea that a combination of variables is likely associated with some response \(y\)? We can do what is called feature creation (feature is just another word people use to mean variable, most often applied to creating informative predictors). For instance, if we were trying to predict the time until a bottled beer were to go skunky, we could include a measure of bottle opacity (controls the amount of light that can get in and affect the beer), and aspects of bottle dimensions (height, width). What may be important is some combination of these variables, perhaps starting to get at volumn differences or surface area to volume ratio which may influence light attenuation. So we can create a variable that represents total beer volume.

Another example that is often used is to take many many predictors which may be highly collinear and to engineer features from this set. This is often done through what is referred to as ‘feature reduction’, a form of feature creation that attempts to make a small set of often-independent predictor variables from a large set of collinear variables. Think of climate data, where we have temperature on each day of the year as a set of 364 predictor variables explaining ice cream sales (our response). This model would likely not fit so well with so many predictors. But we could compress the information using feature reduction approaches like principal component analysis (PCA) or just some ad hoc idea about temperature (monthly mean temperature is the most important, reducing our feature set to 12 variables).

In the previous example, we had two continuous predictors (\(x1\) and \(x2\)), but variables can also be categorical or factor variables.

```
y.1 <- runif(1000)
x.1 <- y * runif(1000, 0.6,1.4) + runif(1000)
x.2 <- c('a', 'b')[1+rbinom(length(x.1), 1, prob=x.1/max(x.1))]
dat2 <- data.frame(y.1,x.1,x.2)
m8 <- lm(y.1~x.1+x.2, data=dat2)
plot(y=y.1, x=x.1, pch=16,
ylab='y', xlab='x',
col=c(1, 'dodgerblue')[as.factor(x.2)],
las=1)
summary(m8)
```

```
##
## Call:
## lm(formula = y.1 ~ x.1 + x.2, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52063 -0.25312 0.00271 0.24198 0.51571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.51645 0.02291 22.542 <2e-16 ***
## x.1 -0.02878 0.02312 -1.245 0.213
## x.2b 0.02954 0.02027 1.458 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2864 on 997 degrees of freedom
## Multiple R-squared: 0.002613, Adjusted R-squared: 0.000612
## F-statistic: 1.306 on 2 and 997 DF, p-value: 0.2714
```

Here, we run into a cautionary tale of variable inclusion and when model builing takes a bit more knowledge of a system. In this simple example, borrowing most of the code from this user (https://commons.wikimedia.org/wiki/File:Simpson%27s_paradox_continuous.svg), we see that not including a categorical variable can result in a complete misinterpretation of the overall pattern. That is, both categorical groups have a positive relationship between \(y\) and \(x\), but ignoring the categorical variable results in an overall negative relationship between \(y\) and \(x\). This is referred to as “Simpson’s paradox”, and there are plenty of ecological systems where this is definitely a thing.

```
xs1 <- c(1,2,3,4)
ys1 <- xs1 + runif(4, 3, 5)
xs2 <- xs1 + runif(4, 5, 7)
ys2 <- xs2 - runif(4, 5, 7)
xs <- c(xs1,xs2)
ys <- c(ys1,ys2)
plot(xs,ys, cex=2, pch=16, las=1,
col=rep(c(1, 'dodgerblue'), each=4),
bg=rep(c("lightblue", "pink"), each=4),
xlim=range(xs)+c(-2,2), ylim=range(ys)+c(-2,2))
abline(lm(ys1 ~ xs1), col=1, lwd=2)
abline(lm(ys2 ~ xs2), col='dodgerblue', lwd=2)
abline(lm(ys ~ xs), lwd=2, lty=2)
```