So we fit a model. We have some attempt at understanding a system, or at least some response variable given our set of predictors. But what we want to know is “is this model doing a good job?”. Significant predictor variables do not inherently mean that these predictors are informative or predictive of our response. Estimates of model performance attempt to quantify the amount of variance explained (e.g., \(R^2\)) or the relationship between predicted and empirical values of the response variable.
So we have our predictions that were made on data that the model hasn’t seen…so it reduces the chances that we would overfit. Now we want to assess the performance of the model. We can do so in a whole bunch of ways. A helpful way to think about model performance is to think about how we define model performance. Do we want a model whose predictions most closely match the data? Do we want a model is quite certain in it’s predictions (whether the predictions are wrong or right).
Calibration: Calibration refers to the statistical consistency between distributional predictions and true values. So are the range of predictions in alignment with the range and distribution of the empirical data.
This one is tricky to understand, but it basically argues that in a well-calibrated model, the distribution of the output predictions should roughly match the distribution of true values. So if a model predicts that 10 out of 100 points will have between 0.7 and 0.8 suitability, a well-calibrated model will have around 70% to 80% of the values in this range as true occurrence points.
Discrimination: discrimination refers to the ability of the model to rank values correctly. This makes the most sense when thinking about binary data, where the truth is 0/1 and the prediction takes on some continuous value.
One example of this would be AUC (the area under the receiver operating characteristic). This gets at the balance between the true positive rate (1’s correctly predicted as 1’s) and the false positive rate (0’s falsely predicted to be 1’s). True positive rate is commonly referred to as sensitivity and false positive rate is commonly referred to as 1-specificity. We won’t go into the details here, but the wikipedia for AUC is pretty solid, as is this explainer.
Precision: precision (also called sharpness) measures the uncertainty in model predictions for a given prediction
In a traditional view of precision, we can imagine the classic accuracy v. precision dartboard example, where accurate points tend to be in the center of the board, but precise points will be grouped together anywhere on the board. Precision in this case relates to the ability of the model to predict values that coincide with 0 or 1, such that an estimate of 0.5 is the least precise.
Accuracy: accuracy estimates the distance between predicted values and true values, typically ignoring the uncertainty in predicted values.
So one measure of accuracy for our simple example above would simply be the squared difference between prediction and truth (in order to make all values positive).
This results in a vector that gets at the relative error in each predicted point. We can also calculate a measure of performance for the entire model by measuring root mean squared error (RMSE). Here, we just take the square root of the average model accuracy as defined above as the squared difference.
## [1] 0.3847824
What is interesting here though is that these measures would not be comparable across models, and don’t really have clear bounds or ways to say that if RMSE is below a certain value, then the model is good. This is why we may prefer a measure of model performance that is either relative to an uninformed model (or some baseline model containing a few variables) or when the performance measure is bounded between certain values (e.g., a correlation between predicted and actual values will always be bound between -1 and 1).
This was obviously a simple example of how we could calculate one aspect of model performance, but there are plenty of others. For binary response data, there is a good discussion of model performance measures in https://doi.org/10.1002/ecm.1370. Otherwise, a good resource for thinking about model building and testing is Hastie, Tibshirani, and Friedman Elements of Statistical Learning (https://web.stanford.edu/~hastie/ElemStatLearn/).
Another consideration is that sometimes the response variable we are attempting to model is not a continuous variable (meaning that some model performance measures like the correlation coefficient above would not be appropriate). In many cases, the goal of a regression or classification model is to predict a binary output (e.g., survival models are a good example of this, as are species distribution models). The next section will explore model performance measures for binary data, while also discussing the importance of cross-validation and model transference.
Oftentimes when we want to identify important variables, we use all
of the data to get the best fit model, and then craft the narrative
around the effects of certain variables in the model (i.e., which
variables were significantly related to the response variable?). Fitting
models in this way can sometimes tell us about how much collective
variation the predictors explained (e.g., \(R^2\) measures), but these will not tell us
about the predictive ability of the model. The reason for this is that
model performance estimates measured on data the model has already
‘seen’ creates bias, in that a model could be incredibly
overfit and look like a really good model, when it actually has
no predictive worth. Overfitting occurs when you have complex
models that greedily try to capture as much variation in the response
variable as possible, but in doing so, they cannot be used to
extrapolate to new data. We talked about overfitting previously in the
modeling
lectures.
To avoid overfitting in predictive models, we can simply split the data into two chunks, using one chunk for model fitting and the other as an independent test case. Let’s work through an example of how you can approach this train/test split in the context of another logistic model.
Let’s continue to explore model performance and the importance of the train/test split for a model with a continuous response variable. Here, our goal is to develop a predictive model of the price of the fare for people on the Titanic. I know it’s a bit of a dark dataset to use, but it’s weirdly common as a beginner dataset. We’ll explore some biological data as well.
titanic <- read.csv('titanic3.csv')
titanic$age[which(is.na(titanic$age))] <- mean(titanic$age, na.rm=TRUE)
titanic$fare[which(is.na(titanic$fare))] <- mean(titanic$fare, na.rm=TRUE)
titanicMod0 <- glm(fare ~ sex+age+pclass, data=titanic, family='gaussian')
summary(titanicMod0)
##
## Call:
## glm(formula = fare ~ sex + age + pclass, family = "gaussian",
## data = titanic)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.63133 5.46893 22.423 < 2e-16 ***
## sexmale -12.43795 2.48950 -4.996 6.64e-07 ***
## age -0.09844 0.09870 -0.997 0.319
## pclass -34.15610 1.52710 -22.367 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1808.245)
##
## Null deviance: 3501400 on 1308 degrees of freedom
## Residual deviance: 2359760 on 1305 degrees of freedom
## AIC: 13538
##
## Number of Fisher Scoring iterations: 2
plot(x=predict(titanicMod0, type='response'), y=titanic$fare,
ylab='Actual fare', xlab='Predicted fare')
##
## Pearson's product-moment correlation
##
## data: predict(titanicMod0) and titanic$fare
## t = 25.146, df = 1307, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5333289 0.6064298
## sample estimates:
## cor
## 0.5710102
Is this a good model?
Let’s explore how it does on a test set versus the training data
inds <- sample(1:nrow(titanic), round(0.75 * nrow(titanic)), replace=FALSE)
train <- titanic[inds, ]
test <- titanic[-inds, ]
titanicMod1 <- glm(fare ~ sex+age+pclass, data=train, family='gaussian')
par(mar=c(4,4,0.5,0.5))
plot(predict(titanicMod1, type='response'), train$fare,
xlab='Predictions', ylab='Actual', pch=16, col=adjustcolor(1,0.25))
# predicting to the test set
par(mar=c(4,4,0.5,0.5))
plot(predict(titanicMod1, type='response', newdata=test), test$fare,
xlab='Predictions', ylab='Actual', pch=16, col=adjustcolor(1,0.25))
# correlation coefficients for train and test
cor.test(predict(titanicMod1, type='response'), train$fare)
##
## Pearson's product-moment correlation
##
## data: predict(titanicMod1, type = "response") and train$fare
## t = 22.836, df = 980, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.5469330 0.6287076
## sample estimates:
## cor
## 0.5893277
##
## Pearson's product-moment correlation
##
## data: predict(titanicMod1, newdata = test, type = "response") and test$fare
## t = 11.009, df = 325, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.4374409 0.5959452
## sample estimates:
## cor
## 0.5211729
I’m focusing on correlation coefficients of predicted versus actual values as a good measure of model performance, but there are plenty of others. Many of them may be more appropriate when the user only really cares about how much variance the model explains. That is, we become less worried about the train/test split and the ability of the model to predict, and care more about the ability of the model to capture the variance (or deviance) in the data.
One way we can do this is to explore the relationship between residual deviance and null deviance. Here, we define deviance as a measure of error, where lower deviance means a better fit to the data. Null deviance is the amount of error given an intercept-only model. This should always be a larger value than the deviance.
## [1] 0.3473072
This measure is bounded between 0 and 1, with larger values corresponding to more variance explained. This balance between deviance and null deviance is incredibly common, as the measure above is referred to as \(R^2\).
Let’s think about how to make it better. Work in small groups to develop your own predictive model of fare. The goal here is to maximize model performance (10-15 minutes)
In this section, we will build a model of the probability of surviving the sinking of the titanic. The predictors include things like sex, age, fare, ticket number, class, and cabin.
inds <- sample(1:nrow(titanic), round(0.75 * nrow(titanic)), replace=FALSE)
train <- titanic[inds, ]
test <- titanic[-inds, ]
titanicMod <- glm(survived ~ sex+age+fare+pclass, data=train, family='binomial')
summary(titanicMod)
##
## Call:
## glm(formula = survived ~ sex + age + fare + pclass, family = "binomial",
## data = train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 4.4659684 0.4821359 9.263 < 2e-16 ***
## sexmale -2.4640256 0.1732470 -14.223 < 2e-16 ***
## age -0.0372791 0.0071435 -5.219 1.8e-07 ***
## fare 0.0003564 0.0018854 0.189 0.85
## pclass -1.0400564 0.1265223 -8.220 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1304.07 on 981 degrees of freedom
## Residual deviance: 921.31 on 977 degrees of freedom
## AIC: 931.31
##
## Number of Fisher Scoring iterations: 4
# visualizing predictions on the training set
par(mar=c(4,4,0.5,0.5))
plot(predict(titanicMod, type='response'), train$survived,
xlab='Predictions', ylab='Actual', pch=16, col=adjustcolor(1,0.25))
# predicting to the test set
par(mar=c(4,4,0.5,0.5))
plot(predict(titanicMod, type='response', newdata=test), test$survived,
xlab='Predictions', ylab='Actual', pch=16, col=adjustcolor(1,0.25))
So let’s put some numbers behind these plots. We’ll use the
ROCR
package to calculate some performance measures.
library(ROCR)
predTrain <- prediction(predictions=predict(titanicMod, type='response'),
labels=train$survived)
predTest <- prediction(predictions=predict(titanicMod, type='response', newdata=test),
labels=test$survived)
## discrimination (AUC)
as.numeric(performance(predTrain, 'auc')@y.values)
## [1] 0.8315966
## [1] 0.8417913
par(mar=c(4,4,0.5,0.5))
plot(unlist(performance(predTrain, 'tpr')@y.values)~
unlist(performance(predTrain, 'fpr')@y.values),
pch=16, type='l', xlab='1-FPR', ylab='TPR')
lines(unlist(performance(predTest, 'tpr')@y.values)~
unlist(performance(predTest, 'fpr')@y.values),
col='red', lwd=2)
abline(a=0,b=1, lwd=2)
This is actually not a terribly overfit model, but we still see that the model performance is higher when estimated on the data used to train the model than on the holdout test set.
Can you calculate differences in accuracy between the two predictions above (i.e.,
predTrain
, andpredTest
)?
Another useful visualization is the confusion matrix. This can start to tell you if the model is worse off at distinguishing false from true, or true from false. This can be visualized as data going into 1 of four quadrants. Here, the rows represent your predictions, and the columns the actual values. So the top left box corresponds to the case where you predicted a positive and the true value was positive (good job!). Same with the lower right, except you predicted a negative right. The other two situations are the situations that form the basis for lots of the measures of model performance. The top right case corresponds to the case where you predicted a positive but the value was negative (false positives; also called Type I error). The lower left case is where you predict a negative but it is actually positive (false negatives; also called Type II errror).
Rows are predictions, columns are actual values.
But the thing about our predictions currently is that we have binary truth and continuous predictions, right? That is, our model does not output either 0 or 1, but a value in between 0 and 1. This is the fundamental difference between this being a regression task (continuous output) versus a classification task (binary output). This also speaks to the underlying structure of many measures of model performance (e.g., AUC) that are based on the confusion matrix.
But given that measures like AUC need to be in the confusion matrix format, how do we get AUC values when our model predictions are continuous?
aucCurveTrain <- performance(predTrain, measure = "tpr", x.measure = "fpr")
aucCurveTest <- performance(predTest, measure = "tpr", x.measure = "fpr")
thresh <- unlist(aucCurveTrain@alpha.values)
thresh[1] <- 1
plot(unlist(aucCurveTrain@x.values),
unlist(aucCurveTrain@y.values),
ylab='True positive rate',
xlab='False positive rate',
col=rainbow(100)[cut(thresh, 100)])
In this space of TP and FP, what would a poorly performing model look like?
## R version 4.3.0 (2023-04-21)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.2 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.10.0
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ROCR_1.0-11 stringr_1.5.0 rmarkdown_2.22 sf_1.0-13 raster_3.6-20
## [6] sp_2.0-0 rgbif_3.7.7
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.3 bslib_0.5.0 xfun_0.39 ggplot2_3.4.2
## [5] lattice_0.21-8 tzdb_0.4.0 vctrs_0.6.3 tools_4.3.0
## [9] generics_0.1.3 curl_5.0.1 parallel_4.3.0 tibble_3.2.1
## [13] proxy_0.4-27 fansi_1.0.4 highr_0.10 pkgconfig_2.0.3
## [17] KernSmooth_2.23-20 data.table_1.14.8 lifecycle_1.0.3 compiler_4.3.0
## [21] tinytex_0.45 munsell_0.5.0 terra_1.7-37 codetools_0.2-19
## [25] sass_0.4.6 htmltools_0.5.5 class_7.3-21 yaml_2.3.7
## [29] lazyeval_0.2.2 jquerylib_0.1.4 pillar_1.9.0 crayon_1.5.2
## [33] whisker_0.4.1 classInt_0.4-9 cachem_1.0.8 tidyselect_1.2.0
## [37] digest_0.6.32 stringi_1.7.12 dplyr_1.1.2 fastmap_1.1.1
## [41] grid_4.3.0 colorspace_2.1-0 cli_3.6.1 magrittr_2.0.3
## [45] triebeard_0.4.1 crul_1.4.0 utf8_1.2.3 e1071_1.7-13
## [49] withr_2.5.0 scales_1.2.1 bit64_4.0.5 oai_0.4.0
## [53] httr_1.4.6 bit_4.0.5 evaluate_0.21 knitr_1.43
## [57] rlang_1.1.1 urltools_1.7.3 Rcpp_1.0.10 glue_1.6.2
## [61] DBI_1.1.3 httpcode_0.3.0 xml2_1.3.4 vroom_1.6.3
## [65] jsonlite_1.8.5 R6_2.5.1 plyr_1.8.8 units_0.8-2