Load the libraries first that we will use.
## Loaded gbm 2.1.8.1
## Loading required package: terra
## terra 1.7.29
Models are representations of a system of interest, used for the purposes of inference or prediction. To say ‘model’ in ecology is a bit too broad to be useful, and the odd delineation between “modelers” and field researchers is a bit ridiculous. To start, we can delineate between statistical models and theoretical or phenomenological models. Statistical models (e.g., ANOVA, generalized linear models) are what we will cover here, and attempt to gain inference or prediction about some response \(y\) given some number of inputs \(x_1...x_n\). Phenomenological models are those that are constructed independent of theory but are an attempt to construct a system which can explain some ecological process, often in a dynamic way (think of the SIR model in epidemiology). Theoretical models are those that directly arise from first principles independent of empirical data. The main distinction here is between statistical models and “other” models, as both are neat, but we’ll focus on simple statistical models here.
When you create a model of a system, you have two things you don’t understand; the system and the model.
The response variable should be the thing that you want to predict, or the thing whose variability you want to understand (also sometimes referred to as the dependent variable). Often, it is something that you think is the effect produced by some other cause. For example, in examining the relationship between gas usage and outdoor temperature, it seems clear that gas usage should be the response: temperature is a major determinant of gas usage. But suppose that the modeler wanted to be able to measure outdoor temperature from the amount of gas used. Then it would make sense to take temperature as the response variable.
The explanatory variables are used to explain or predict the response variable (also sometimes referred to as independent variables).
y <- runif(100)
x <- y ** runif(100, 1,4)
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col='dodgerblue', las=1)
m1 <- lm(y~x)
abline(m1, lwd=2)
But to minimize these residuals (or the sum of squares of the residuals), we don’t necessarily need a model, right?
par(mar=c(4,4,0.5,0.5))
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col=adjustcolor(1,0.5), las=1)
abline(m1, lwd=2)
lines(smooth.spline(y=y,x=x), col='red', lwd=2)
And if we accept that we need a model, what’s stopping us from making the model incredibly complex to capture all the possible variation in the data?
par(mar=c(4,4,0.5,0.5))
plot(y=y,x=x, pch=16,
ylab='y', xlab='x',
col=adjustcolor(1,0.5), las=1)
abline(m1, lwd=2)
m3 <- lm(y~poly(x,3))
lines(x=sort(x), fitted(m3)[order(x)], col='green', type='l', lwd=2)
m4 <- lm(y~poly(x,15))
lines(x=sort(x), fitted(m4)[order(x)], col='purple', type='l', lwd=2)
The model would quickly become uninformative. That is, the model could fit every data point perfectly, but tell us nothing about the general relationship between the two variables. This makes it impossible to use the model to predict. Adding in this level of complexity essentially minimizes the error to the fit, but in turn basically maximizes error when extrapolating to new data. This model would be overfit, and the process of adding in variables that minimize model fit error while making the model useless to other datasets, is called overfitting. Not ideal.
Linear models have the standard assumptions of independent and identically distributed (i.i.d.) normal random variables.
More explicitly, these standard assumptions are:
We’ll focus on the last assumption here, since it’s the easiest and most common to have to work around. For instance, if the response variable is a binary outcome (e.g., dead or alive, present or absent).
To handle these issues, a common approach is to use generalized linear models, which attempt to account for response data with non-normal distribution. This is done by using a link function to provide an appropriate link between the random error model (the linear model would assume normally distributed errors) and the predictors.
Probability distribution | Typical uses | Link name |
---|---|---|
Normal | linear-response data | Identity |
Poisson | count of occurrences in fixed amount of time/space | Log |
Bernoulli | outcome of single yes/no occurrence | Logit |
Binomial | count of # of “yes” occurrences out of N yes/no occurrences | Logit |
Logistic regression
Random component: The distribution of Y is assumed to be Binomial(n, \(\pi\)), where \(\pi\) is a probability of “success”.
Systematic component: X’s are explanatory variables (can be continuous, discrete, or both) and are linear in the parameters, e.g., \(\beta_0\) + \(\beta_1 x_{i}\) + ….
Again, transformation of the X’s themselves are allowed like in linear regression; this holds for any GLM.
Link function: Logit link
\[ \nu = logit(\nu) = log(\frac{\pi}{1-\pi})\]
##
## Call:
## glm(formula = y2 ~ x, family = binomial)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.4376 0.2833 -1.545 0.122
## x 0.3739 0.6387 0.586 0.558
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 136.06 on 99 degrees of freedom
## Residual deviance: 135.72 on 98 degrees of freedom
## AIC: 139.72
##
## Number of Fisher Scoring iterations: 4
glm
is incredibly flexible and is the basis for more
complicated models like hierarchical models, ridge regressions,
generalized additive models, etc. etc. Some other forms of generalized
regression are:
Generalised additive models, e.g. mgcv::gam(), extend generalised linear models to incorporate arbitrary smooth functions. That means you can write a formula like y ~ s(x) which becomes an equation like y = f(x) and let gam() estimate what that function is (subject to some smoothness constraints to make the problem tractable).
Generalized linear mixed effects models, e.g.,. lme4::glmer extend generalized linear models to incorporate both fixed and random effects.
Trees, e.g. rpart::rpart(), attack the problem in a completely different way than linear models. They fit a piece-wise constant model, splitting the data into progressively smaller and smaller pieces. Trees aren’t terribly effective by themselves, but they are very powerful when used in aggregate by models like random forests (e.g. randomForest::randomForest()) or gradient boosting machines (e.g. xgboost::xgboost.)
So we’ve gone over modeling largely in terms of linear models of some
arbitrary number of predictors, which our examples only included a
single predictor. When designing a model, the researcher may want to
incorporate known variable associations or interactions to estimate the
support for a hypothesized effect given the data and model. This can be
done by using the *
operator in the formulate
interface.
y <- runif(100)
x1 <- y ** runif(100, 1,4)
x2 <- x1 ** runif(100, 0.5,1.5)
dat <- data.frame(y,x1,x2)
m6 <- lm(y~x1+x2, data=dat)
m7 <- lm(y~x1*x2, data=dat)
plot(y=y,x=x1, pch=16,
ylab='y', xlab='x',
col=viridis::viridis(100)[as.numeric(cut(x2,100))], las=1)
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.239245 -0.085943 -0.009243 0.097867 0.236612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24861 0.01764 14.094 < 2e-16 ***
## x1 0.83780 0.16622 5.040 2.16e-06 ***
## x2 0.01662 0.16325 0.102 0.919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1237 on 97 degrees of freedom
## Multiple R-squared: 0.8206, Adjusted R-squared: 0.8169
## F-statistic: 221.9 on 2 and 97 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = y ~ x1 * x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.24721 -0.08444 0.01473 0.05492 0.21657
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.19035 0.01856 10.256 < 2e-16 ***
## x1 1.31059 0.16767 7.816 6.94e-12 ***
## x2 0.24239 0.14796 1.638 0.105
## x1:x2 -0.82866 0.14763 -5.613 1.92e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1079 on 96 degrees of freedom
## Multiple R-squared: 0.8649, Adjusted R-squared: 0.8607
## F-statistic: 204.9 on 3 and 96 DF, p-value: < 2.2e-16
What if we don’t necessary know the direct interaction, but still have some idea that a combination of variables is likely associated with some response \(y\)? We can do what is called feature creation (feature is just another word people use to mean variable, most often applied to creating informative predictors). For instance, if we were trying to predict the time until a bottled beer were to go skunky, we could include a measure of bottle opacity (controls the amount of light that can get in and affect the beer), and aspects of bottle dimensions (height, width). What may be important is some combination of these variables, perhaps starting to get at volumn differences or surface area to volume ratio which may influence light attenuation. So we can create a variable that represents total beer volume.
Another example that is often used is to take many many predictors which may be highly collinear and to engineer features from this set. This is often done through what is referred to as ‘feature reduction’, a form of feature creation that attempts to make a small set of often-independent predictor variables from a large set of collinear variables. Think of climate data, where we have temperature on each day of the year as a set of 364 predictor variables explaining ice cream sales (our response). This model would likely not fit so well with so many predictors. But we could compress the information using feature reduction approaches like principal component analysis (PCA) or just some ad hoc idea about temperature (monthly mean temperature is the most important, reducing our feature set to 12 variables).
In the previous example, we had two continuous predictors (\(x1\) and \(x2\)), but variables can also be categorical or factor variables.
y.1 <- runif(1000)
x.1 <- y * runif(1000, 0.6,1.4) + runif(1000)
x.2 <- c('a', 'b')[1+rbinom(length(x.1), 1, prob=x.1/max(x.1))]
dat2 <- data.frame(y.1,x.1,x.2)
m8 <- lm(y.1~x.1+x.2, data=dat2)
plot(y=y.1, x=x.1, pch=16,
ylab='y', xlab='x',
col=c(1, 'dodgerblue')[as.factor(x.2)],
las=1)
summary(m8)
##
## Call:
## lm(formula = y.1 ~ x.1 + x.2, data = dat2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.52063 -0.25312 0.00271 0.24198 0.51571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.51645 0.02291 22.542 <2e-16 ***
## x.1 -0.02878 0.02312 -1.245 0.213
## x.2b 0.02954 0.02027 1.458 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2864 on 997 degrees of freedom
## Multiple R-squared: 0.002613, Adjusted R-squared: 0.000612
## F-statistic: 1.306 on 2 and 997 DF, p-value: 0.2714
Here, we run into a cautionary tale of variable inclusion and when model builing takes a bit more knowledge of a system. In this simple example, borrowing most of the code from this user (https://commons.wikimedia.org/wiki/File:Simpson%27s_paradox_continuous.svg), we see that not including a categorical variable can result in a complete misinterpretation of the overall pattern. That is, both categorical groups have a positive relationship between \(y\) and \(x\), but ignoring the categorical variable results in an overall negative relationship between \(y\) and \(x\). This is referred to as “Simpson’s paradox”, and there are plenty of ecological systems where this is definitely a thing.
xs1 <- c(1,2,3,4)
ys1 <- xs1 + runif(4, 3, 5)
xs2 <- xs1 + runif(4, 5, 7)
ys2 <- xs2 - runif(4, 5, 7)
xs <- c(xs1,xs2)
ys <- c(ys1,ys2)
plot(xs,ys, cex=2, pch=16, las=1,
col=rep(c(1, 'dodgerblue'), each=4),
bg=rep(c("lightblue", "pink"), each=4),
xlim=range(xs)+c(-2,2), ylim=range(ys)+c(-2,2))
abline(lm(ys1 ~ xs1), col=1, lwd=2)
abline(lm(ys2 ~ xs2), col='dodgerblue', lwd=2)
abline(lm(ys ~ xs), lwd=2, lty=2)
Now that we have the ability to construct models, we need to be able
to assess how well they are fitting. We can use the
summary()
function to get information on both individual
predictor significance, as well as overall model performance. Overall
model performance is assessed in glm
using R-squared. This
measure is bounded between 0 and 1, and corresponds to the proportion of
the variance of our y
that is explained by our predicts
\(x_1...x_n\).
##
## Call:
## lm(formula = y ~ x1 + x2, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.239245 -0.085943 -0.009243 0.097867 0.236612
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.24861 0.01764 14.094 < 2e-16 ***
## x1 0.83780 0.16622 5.040 2.16e-06 ***
## x2 0.01662 0.16325 0.102 0.919
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1237 on 97 degrees of freedom
## Multiple R-squared: 0.8206, Adjusted R-squared: 0.8169
## F-statistic: 221.9 on 2 and 97 DF, p-value: < 2.2e-16
But R-squared only tells us so much. What we may want to look at is
how the model is truly fitting. That is, are there areas of the
predictor data that do not capture the variation in y
equally? This situation would lead to a distribution of residual
variation (model residuals are the differences between model-predicted
values or fitted values from the true values). To explore that, we can
use the base plot
command on the model object to explore
some of these things
Residuals vs fitted - there should be no strong patterns and no outliers, residuals should be randomly distributed around zero.
Normal Q-Q - residuals should go around the diagonal line, i.e., should be normally distributed. This plot helps checking if they are approximately normal.
Scale-location - as you can see, on Y axis there are also residuals (like in Residuals vs fitted plot), but they are scaled, so it’s similar to (1), but in some cases it works better.
Residuals vs Leverage – it helps to diagnose outlying cases. As in previous plots, outlying cases are numbered, but on this plot if there are any cases that are very different from the rest of the data they are plotted below thin red lines (check wiki on Cook’s distance).
The measures of model performance discussed above are based on how well the predictors can explain the response, but, as we noted in the \(n\)-order polynomial case, a model could have an R-squared of 1 and be utterly useless. A non-useless model would arguably be able to predict to new data. That is, if more \(x\) data were added, the response \(y\) could be predicted. We can extract predictions from our model object using the predict function. We can do this for the entire data used to fit the models, or a new data object.
## fit lwr upr
## 1 0.7625539 0.7258079 0.7992999
## 2 0.3194017 0.2669924 0.3718110
## 3 0.3466137 0.2861279 0.4070995
## 4 0.2670109 0.2318178 0.3022040
## 5 0.9828930 0.9295321 1.0362539
## 6 1.0220861 0.9669989 1.0771732
## 7 0.4368018 0.4043727 0.4692310
## 8 0.5155674 0.4851393 0.5459955
## 9 1.0449817 0.9879433 1.1020201
## 10 0.8880765 0.8434301 0.9327230
preds <- as.data.frame(predict(m6, newData, interval='confidence'))
plot(y=y, x=x1, col=1, pch=16)
points(y=preds$fit, x=x1[1:10], pch=16, col='red')
segments(y0=preds$lwr, y1=preds$upr, x0=x1[1:10], col='red')
But we may wish to predict on new data or on a held out subset of data not used to train our models. This way, we can assess the performance of the model independent of the trained model conditioned on the existing data.
trainMod <- function(dat, holdOut=0.2){
inds <- sample(1:nrow(dat), round(nrow(dat)*holdOut,0))
train <- dat[-inds, ]
test <- dat[inds, ]
mod <- glm(y ~ x1+x2, data=train)
ret <- data.frame(truth=test$y, pred=predict(mod, newdata=test))
return(ret)
}
preds2 <- trainMod(dat)
This idea of assessing model performance on a holdout set is not something traditional in ecological model building. This could potentially be that ecologists want to use all of their data to try to get at which variables are the most associated or explain the most variance in some response variable. This is fair. But the model will always be conditioned upon on the data available (I will not go into Bayesian modeling here). The book I recommend above sort of lays a lot of the foundation down for learning more about machine learning. Machine learning methods are typically designed to bypass some of the assumptions of more traditional models like those described above. Often, this results in methods that handle NAs well, are flexible to non-linear relationships between response and predictor variables, and that tend to be pretty focused on prediction over inferene (this is not inherently true though). In fact, the book on statistical learning spends the majority of time going over the inner workings of linear models for regression adn classification. It isn’t until about halfway through the book where they introduce more “machine learning -esque” methods for classification and regression such as support vector machines and tree-based approaches.
# install.packages('gbm')
trainMod2 <- function(dat, holdOut=0.2){
inds <- sample(1:nrow(dat), round(nrow(dat)*holdOut,0))
train <- dat[-inds, ]
test <- dat[inds, ]
mod <- glm(y ~ x1+x2, data=train)
mod2 <- gbm::gbm(y ~ x1+x2, data=train, n.trees=100,
interaction.depth=2, cv.folds=5, n.cores=4)
brt.best.iter <- gbm::gbm.perf(mod2, method='OOB', plot=FALSE)
ret <- data.frame(truth=test$y,
predGLM=predict(mod, newdata=test),
predGBM=predict(mod2, newdata=test, type='response', n.trees=brt.best.iter))
return(ret)
}
preds3 <- trainMod2(dat)
## Distribution not specified, assuming gaussian ...
## OOB generally underestimates the optimal number of iterations although predictive performance is reasonably competitive. Using cv_folds>1 when calling gbm usually results in improved predictive performance.
glmAcc <- sqrt(mean((preds3$truth - preds3$predGLM)**2))
gbmAcc <- sqrt(mean((preds3$truth - preds3$predGBM)**2))
It is important to note that the model accuracy will change as it uses a random subset of 20% holdout data each time the function is called. We can use that to our advantage when comparing models, as we can train the model \(n\) times to see if the distributions of accuracy are distinct between models. It is also important to note here though that we can have lots of control over the outcome of any statistical test for model differences here by increasing the number of times we train the models to minimize the error in the distribution of accuracy values.
glmA <- gbmA <- c()
for(i in 1:10){
set.seed(i)
preds3 <- trainMod2(dat)
glmA[i] <- sqrt(mean((preds3$truth - preds3$predGLM)**2))
gbmA[i] <- sqrt(mean((preds3$truth - preds3$predGBM)**2))
}
And we can use a t-test to see if the distributions of accuracy values are different between the glm and the gbm.
##
## Welch Two Sample t-test
##
## data: glmA and gbmA
## t = 1.7174, df = 17.84, p-value = 0.1032
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.002713131 0.026923419
## sample estimates:
## mean of x mean of y
## 0.1204547 0.1083495
On average, using the boosted regression tree resulted in a change in
root mean squared error of mean(gbmA-glmA)
when training
100 models. What happens if we train 1000 models?
glmA2 <- gbmA2 <- c()
for(i in 1:1000){
set.seed(i)
preds3 <- trainMod2(dat)
glmA2[i] <- sqrt(mean((preds3$truth - preds3$predGLM)**2))
gbmA2[i] <- sqrt(mean((preds3$truth - preds3$predGBM)**2))
}
This may not really influence the mean difference between the models
in terms of root mean squared errors (mean(gbmA2-glmA2)
),
but it will influence the resulting statistical test that we
inappropriately apply to prove a point.
##
## Welch Two Sample t-test
##
## data: glmA2 and gbmA2
## t = 22.468, df = 1997.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.01346294 0.01603792
## sample estimates:
## mean of x mean of y
## 0.1252865 0.1105361
Recall from last lecture on querying APIs, where we pulled occurrence records for red pandas from GBIF? We may want to develop predictive models of habitat suitability, in order to estimate where red pandas could potentially occur. These models are often called ‘species distribution models’, and are essentially just a form of logistic regression. That is, we have binary data, in that species occurrences are either known (1), or we have no data, so we can sampled points randomly and treat them as absences (0). These are generally called pseudo-absences, but the point of them is not to treat them as absences per se, but to try to get at the distribution of your predictor values relative to the known occurrences (raise your hand if that makes no sense).
Alright. So we have the positive occurrence records for giraffes. To build a species distribution model, we need background points as well as environmental data.
bc <- geodata::worldclim_global(var="bio", res=10, path='.')
bcN <- terra::spatSample(bc, 5000, 'random')
bcN <- bcN[which(!is.na(bcN[,1])), ]
bcN <- bcN[1:1000, ]
bcN$prez <- 0
bcP <- terra::extract(bc, giraffe[,which(colnames(giraffe) %in% c('decimalLatitude', 'decimalLongitude'))])
bcP <- bcP[which(!is.na(bcP[,2])), ]
bcP$prez <- 1
bcP$ID <- NULL
dat <- rbind(bcN, bcP)
So we used the geodata package to get the bioclim data, extracted data from these raster layers for each sampled point (both background and presence points). So we are ready to construct a model. We can start with the simplest model, a logistic regression. This assumes a linear response to environmental conditions, where the binary data are transformed into a probit space.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = prez ~ ., family = "binomial", data = dat)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.240e+01 2.460e+00 -5.040 4.66e-07 ***
## wc2.1_10m_bio_1 -3.200e+00 5.166e-01 -6.193 5.90e-10 ***
## wc2.1_10m_bio_2 -8.997e-01 2.888e-01 -3.115 0.00184 **
## wc2.1_10m_bio_3 -2.491e-02 3.635e-02 -0.685 0.49319
## wc2.1_10m_bio_4 1.646e-01 2.128e-02 7.739 1.01e-14 ***
## wc2.1_10m_bio_5 -1.019e+05 1.490e+05 -0.684 0.49412
## wc2.1_10m_bio_6 1.019e+05 1.490e+05 0.684 0.49411
## wc2.1_10m_bio_7 1.019e+05 1.490e+05 0.684 0.49411
## wc2.1_10m_bio_8 -3.849e-02 4.417e-02 -0.871 0.38349
## wc2.1_10m_bio_9 6.815e-02 4.902e-02 1.390 0.16446
## wc2.1_10m_bio_10 -6.909e+00 9.692e-01 -7.129 1.01e-12 ***
## wc2.1_10m_bio_11 9.140e+00 9.892e-01 9.240 < 2e-16 ***
## wc2.1_10m_bio_12 7.209e-03 2.858e-03 2.522 0.01166 *
## wc2.1_10m_bio_13 1.306e-01 1.539e-02 8.483 < 2e-16 ***
## wc2.1_10m_bio_14 3.378e-02 5.616e-02 0.601 0.54759
## wc2.1_10m_bio_15 -3.098e-02 7.145e-03 -4.336 1.45e-05 ***
## wc2.1_10m_bio_16 -7.426e-02 8.684e-03 -8.552 < 2e-16 ***
## wc2.1_10m_bio_17 -2.559e-02 1.873e-02 -1.366 0.17186
## wc2.1_10m_bio_18 6.473e-03 2.750e-03 2.354 0.01858 *
## wc2.1_10m_bio_19 1.740e-03 1.853e-03 0.939 0.34770
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2203.24 on 1644 degrees of freedom
## Residual deviance: 532.07 on 1625 degrees of freedom
## AIC: 572.07
##
## Number of Fisher Scoring iterations: 9
This indicates that many of the bioclim variables have a significant effect on occurrence, but it does not really give us an idea of how well the model performs (we will go over more details of model performance very soon). In the meantime, we can at least visualize the predicted values versus the true values. We will discuss why this is not a great measure of performance soon though.