Probabilistic processes

What do I mean by “probabilistic processes”?

We have already gone over a bit of probabilistic processes in a roundabout way. Throughout this course, we have generated random data from probability distributions. For instance, we pulled values from a uniform distribution, in which all values are equally probable of being drawn between two bounds (runif(100, 1, 10)).

Probabilistic processes are simply those processes whose outcome is determined by some probability (e.g., every time I flip a coin, I don’t know the outcome, right?). We can code this up, right? Every trial of a coin flip results in 1 of 2 outcomes (i.e., heads or tails) with the probability we assume is 0.5 (both outcomes are equally likely). The probability distribution corresponding to this situation is the binomial distribution.

For instance, the following code would simulate flipping a coin 100 times.

coin <- rbinom(100, 1, p=0.5)
sum(coin == 1)

## [1] 48

As a biological example, consider an animal who disperses with some probability \(p\), and moves a distance determined by a Poisson distribution with mean 10km. How do we code this up?

This starts to explore how you might go about setting up a simulation model, but this is already getting spatial and going into things that are a bit more complicated, so let’s dial it back.

But the above incorporation of probabilistic processes is based on phenomenological modeling, where many of us might also want to do some statistical modeling. We will essentially do both in this lecture, where we first go into probability distributions as they relate to gaining statistical insight from your data, and end with some simulation modeling to show how probabilistic processes influence population dynamics.

notation in R

“d” returns the height of the probability density function “p” returns the cumulative density function “q” returns the inverse cumulative density function (quantiles) “r” returns randomly generated numbers

Generate random draws from probability distributions

rnorm(100, 1, 1)

##   [1] -0.27833555  1.66387576 -0.07138748  1.81889947  0.27234259  0.19401427
##   [7]  0.80840265  1.42687900  0.68245371  0.67556281  1.32074091  0.02780968
##  [13]  1.54451864  0.92171909  1.08758204 -0.19176644  2.65881690 -0.01467829
##  [19]  0.79819202  1.00317683  1.24317796  1.08143637 -0.82252336  1.90387317
##  [25]  2.96265530  1.99319811  1.21914762  1.22331422  1.47903282  1.83565757
##  [31]  1.87482136  1.07123061  0.37282197  2.79734352  3.68432899  1.16198820
##  [37]  3.50937107  0.19993932 -0.91335398  0.08623995  0.84354643  1.74179126
##  [43]  0.87242126  1.41820597  1.76758856  1.93042036  1.73089937  2.67968720
##  [49]  0.35461655  0.19576248  0.16301967  1.94429980  2.57540243  0.22730127
##  [55]  1.61117233  2.06928864  1.38221460  0.94336439 -0.28758968  1.18444623
##  [61]  2.55508519  0.40393389  1.20668910 -0.16880693  3.16525607  1.91259746
##  [67]  2.46689099  1.04022657  0.54658834  0.43561486 -0.57938259  1.55250266
##  [73]  2.60056613 -0.36863535  3.22576788  1.46945405  1.00967178  1.43590999
##  [79]  0.92038290  1.82388110 -0.14582746 -1.32623947  1.11754529  1.14688610
##  [85] -0.39308594 -2.26916939  1.43090517  1.41060810  1.18426272  1.29737311
##  [91]  2.24751809 -1.14249708  0.06161899  2.30841570  0.41371622 -0.87233856
##  [97]  1.23666502  0.06823881 -1.51405126 -0.30401855

rbinom(100, 1, 0.25)

##   [1] 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 1 1 0 0 0 1
##  [38] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0
##  [75] 0 0 0 0 0 0 0 0 1 0 1 1 0 1 0 0 0 0 1 0 0 0 0 0 1 0

rpois(100, 1)

##   [1] 1 1 1 1 0 2 1 0 2 0 2 1 0 0 2 1 0 2 0 0 0 1 1 3 0 0 0 2 0 1 0 0 3 2 0 1 0
##  [38] 0 3 0 2 1 1 1 0 1 0 1 1 3 0 1 1 0 1 1 1 0 1 3 2 0 0 0 1 0 1 0 1 2 1 0 2 0
##  [75] 3 2 1 0 2 1 0 1 0 1 0 1 1 3 0 2 2 1 0 3 1 0 0 0 2 1

runif(100, 1, 20)

##   [1] 18.934203 17.093164 19.989258 16.661609 15.838980  8.545819 13.743518
##   [8]  2.758775  5.929221  7.338371 15.580463  2.177961  6.231099 18.839952
##  [15]  5.998731 15.773331 16.224287 18.783163  4.611437 19.543696 11.719480
##  [22]  1.979433 12.484805 12.764715 16.983632  7.293576  9.772898 14.052177
##  [29] 15.534509 14.779542  3.267762  3.086773 15.251756 16.060709  2.171911
##  [36] 11.218273 17.961800 14.827658 18.301455 19.526855  2.352490 17.067504
##  [43] 16.198982  4.269017 14.058052  1.507930 19.844265 12.536722  4.823331
##  [50] 17.400807 19.501036  3.311858  5.016037  3.937873 14.740602 18.060506
##  [57] 19.999935 12.526119  9.535645  6.091207  4.959408  4.199898  9.589381
##  [64] 13.204895  9.730261 15.165183 13.255038 15.967018 13.691425 14.637707
##  [71]  6.940825 12.763927 11.494865  8.975066  8.424139 13.536074 11.378376
##  [78]  3.773158 13.474706 13.635168 17.096550  5.771050  8.299611 12.817094
##  [85]  5.114253 15.691075  2.588399 15.603388 16.229751 17.092321  9.885063
##  [92]  1.093568  6.616522  2.211237 19.928411 11.633847  9.416057  2.037772
##  [99]  7.972221  2.967428

rgamma(100, 1, 1)

##   [1] 1.88086837 0.26599412 0.06584626 1.10304061 1.87770988 1.27862891
##   [7] 0.86391196 1.00397265 0.17831943 0.07943037 0.42272279 0.56624874
##  [13] 1.78924424 0.08911141 1.35461386 0.01269588 0.68763576 0.60908572
##  [19] 0.04720405 1.32706778 2.21666515 1.68378999 0.85603054 0.43549608
##  [25] 1.43788519 2.12923072 1.14341121 0.43643145 2.93819892 0.52785270
##  [31] 0.37789720 1.31361293 0.51839133 0.29761619 0.51848144 0.99663316
##  [37] 0.37300835 1.12413984 1.61643975 0.23875145 1.70246787 1.81139974
##  [43] 0.66001083 0.43293303 1.55380198 0.21926377 0.10661738 1.50946079
##  [49] 0.74535170 0.31886360 0.64609038 0.64738183 0.35618540 1.02258602
##  [55] 2.59985677 0.04032825 0.45261346 1.30796511 0.63181714 0.61706148
##  [61] 0.20117679 0.52802194 0.18675220 1.35868676 1.61707920 0.82737322
##  [67] 0.58942291 1.41036021 0.18556237 0.14943125 0.15976139 0.81745296
##  [73] 0.03008043 0.39262621 1.20280626 0.95358900 0.17070903 0.04526196
##  [79] 0.25631812 0.75881373 0.58811128 0.96461645 0.41698923 0.13610339
##  [85] 0.88127440 0.63377946 0.33799256 3.53546230 0.93990425 2.34408776
##  [91] 0.42655073 0.38497857 1.47364359 0.85487197 0.19681840 1.31195022
##  [97] 0.39384344 0.04054786 0.28269990 0.47813659

Given a number or a list it computes the probability that a random number will be less than that number.

pnorm(-1, mean=0, sd=1)

## [1] 0.1586553

pnorm(1, mean=0, sd=1)

## [1] 0.8413447

pnorm(1, mean=0, sd=1, lower.tail=FALSE)

## [1] 0.1586553

pbinom(0, 1, 0.15)

## [1] 0.85

ppois(1, lambda=2)

## [1] 0.4060058

punif(0.25, 0, 1)

## [1] 0.25

The next function we look at is q---- which is the inverse of p----. The idea behind q---- is that you give it a probability, and it returns the number whose cumulative distribution matches the probability.

pnorm(0.25, mean=0, sd=1)

## [1] 0.5987063

qnorm(0.25, mean=0, sd=1)

## [1] -0.6744898

pnorm(-0.6744898, mean=0, sd=1)

## [1] 0.25

So this means that to the left of -0.6745, we should have 25% of the weight of the data, right?

colorArea <- function(from, to, density, ..., col="dodgerblue", dens=NULL){
    y_seq <- seq(from, to, length.out=500)
    d <- c(0, density(y_seq, ...), 0)
    polygon(c(from, y_seq, to), d, col=col, density=dens)
}

curve(dnorm(x), from=-4, to=4, 
  ylab = "Probability Density",
  xlab = "My random normal variate")

colorArea(from=-4, to=qnorm(0.25), dnorm)

curve(dnorm(x), from=-4, to=4, 
  ylab = "Probability Density",
  xlab = "My random normal variate")

colorArea(from=-4, to=qnorm(0.5), dnorm)

So this means that if we take random draws from a distribution, we should be able to build up something that looks like the results of dnorm, right? But building a distribution from the random variates requires that we have enough samples to accurately capture the true underlying distribution. Let’s explore this.

plot(density(rnorm(10, mean=0, sd=1)), col='firebrick', 
  xlim=c(-4,4), ylim=c(0,0.4))
curve(dnorm(x), from=-4, to=4, add=TRUE, lwd=2)

plot(density(rnorm(1000, mean=0, sd=1)), col='firebrick', 
  xlim=c(-4,4), ylim=c(0,0.4))
curve(dnorm(x), from=-4, to=4, add=TRUE, lwd=2)

plot(density(rnorm(100000, mean=0, sd=1)), col='firebrick', 
  xlim=c(-4,4), ylim=c(0,0.4))
curve(dnorm(x), from=-4, to=4, add=TRUE, lwd=2)

The above code uses the d--- function, where the d stands for the density. This gives the expected density of values that would occur at a certain point given the probability distribution. The d here corresponds to the probability density function, so while p--- gave us probabilities and q--- gave us the area under the curve, d--- gives us the actual value of the density function.

plot(density(rnorm(100000, mean=0, sd=1)), col='firebrick', 
  xlim=c(-4,4), ylim=c(0,0.4))
curve(dnorm(x), from=-4, to=4, add=TRUE, lwd=2)
points(x=0, y=dnorm(0), pch=16, cex=3)
points(x=1, y=dnorm(1), pch=16, cex=3)

But how does the above relate to statistical tests? We will explore this by considering an example of the t-test, but think about other statistical tests that may (and likely do) follow a similar structure in terms of how the test statistic is calculated and how we think about p-values.

A case study of the t-test

The t-test can be used to test for differences between two groups of data, or of one group of data and some mean \(\mu\). It is a comparison of means, so we implicitly assume no differences in variance or distributional shape between the two groups.

One sample

\[ \dfrac{X - \mu}{sd(X) / \sqrt(n)} \]

This tests if the population mean is different from some value that you provide. In the example below, we generate random data from a normal distribution with mean 1 and variance 2. We want to know if the population mean of the random data are significantly different from a value of 2 (\(\mu\) = 2).

x <- rnorm(100, 1, 2)
mu <- 2

hist(x)
abline(v=mu, col='dodgerblue', lwd=2)

t.test(x, mu=mu)

## 
##  One Sample t-test
## 
## data:  x
## t = -5.9076, df = 99, p-value = 4.92e-08
## alternative hypothesis: true mean is not equal to 2
## 95 percent confidence interval:
##  0.439426 1.224170
## sample estimates:
## mean of x 
## 0.8317982

tt <- (mean(x) - mu) / (sd(x) / sqrt(length(x)))
pt(tt, df=length(x)-1)

## [1] 2.459947e-08

# why is the above value different from the t-test calculation from `t.test`?

pt(tt, df=length(x)-1)*2

## [1] 4.919893e-08

hist(rt(10000, df=99), xlim=c(-7,7)) 
abline(v=tt, lwd=3)

Two-sample

\[ \dfrac{X1-X2}{sp} \]

\[ sp = \dfrac{\sqrt{varx1 + varx2}}{2} \]

where sp is pooled variance

x1 <- rnorm(100, 1, 2)
x2 <- rnorm(100, 2, 1)

t.test(x1, x2, var.equal=TRUE)

## 
##  Two Sample t-test
## 
## data:  x1 and x2
## t = -3.619, df = 198, p-value = 0.0003755
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.3679365 -0.4029605
## sample estimates:
## mean of x mean of y 
##  1.068594  1.954042

sp <- sqrt((var(x1)+var(x2)) / 2)
tt2 <- (mean(x1)-mean(x2)) / (sp*sqrt(2/length(x1)))

df <- (2*length(x1)) - 2

pt(tt2, df=df) * 2

## [1] 0.0003755012

hist(rt(10000, df=df), xlim=c(-5,5)) 
abline(v=tt2, lwd=3)

## at larger sample sizes, the normal distribution is approximately equal to the t-distribution (a z-test is the equivalent, which assumes that we know lots). The normal distribution is commonly used for significance testing (which we won't really go into here much beyond this t-test bit). 

pt(tt2, df=df)

## [1] 0.0001877506

pnorm(tt2)

## [1] 0.0001478766

I flip a coin 100 times and get the following outcomes. What is the probability that this is a fair coin (\(p\) = 0.5)? (there are many ways to solve this problem. One approach would be to simulate a fair coin, calculate the number of heads, and compare this to the ‘test’ coin).

outcomes <- c(0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 
1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 
0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 
1, 0, 0, 0, 0)



# the cheat way (binomial exact test) 
binom.test(sum(outcomes), n=100, p=0.5)

## 
##  Exact binomial test
## 
## data:  sum(outcomes) and 100
## number of successes = 21, number of trials = 100, p-value = 4.337e-09
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1349437 0.3029154
## sample estimates:
## probability of success 
##                   0.21

pbinom(sum(outcomes), 100, 0.5) * 2

## [1] 4.337367e-09

Effect size

One thing you may be thinking to yourself at this stage is that the outcomes of these functions seems pretty tied to the number of samples you have. You’re right. We could observe a ‘significant’ result from 10 trials, like the probability of having 3 heads and 7 tails out of a sample of 10 is

pbinom(3, 10, 0.5)

## [1] 0.171875

but the probability of having 30 heads and 70 tails is

pbinom(30, 100, 0.5)

## [1] 3.92507e-05

But how much we ‘trust’ the outcome is not reported. One way of doing this is to calculate effect sizes. Cohen’s \(d\) is a pretty classic measure of effect size (though certainly others exist). And the best part is that it is pretty simple. In the case above, we could compute the effect size as

\[ d = \dfrac{\bar{X} - \mu_{0}}{\hat{\sigma}} \]

or the mean of the sample minus the expected value divided by the standard deviation of the sample.

Some practice problems

You flip a fair coin 10 times. What is the probability that you will have 2 heads and 8 tails outcomes?

An individual produces \(n\) offspring per year following a Poisson distribution with lambda = 2. Calculate the probability that a mother has more than 5 offspring in a year.

1-dpois(5, 2)

## [1] 0.9639106

How about the probability of having less than 2 offspring, 2 years in a row? (we may not do this one, as it gets us a bit further in probability territory where some are quite uncomfortable)

What’s the probability of observing a sentence with fewer than 25 characters given the sentences data?

library(stringr)
data(sentences)

What is the above solution assuming about the distribution of the number of characters per sentence from the sentences data?

The logistic model of population growth

Incorporating probabilistic processes into simulation models allows researchers to not only see one realization of the dynamics given the parameters, but also to quantify and explore the variability in outcome solely as a result of random chance. We’ll explore this a bit using a simple model of population dynamics.

In deterministic logistic growth, a population grows with growth rate \(r\) until it reaches a carrying capacity \(k\). I code this model below and we can explore the dynamics.

logisticGrowth <- function(n, r, k){
  n*exp(r*(1-(n / k)))
}


logisticDynamics <- function(n,r,k, steps=100){
  ret <- c()
  ret[1] <- n
  if(length(r) == 1){
    r <- rep(r, steps)
  }
  for(i in 1:(steps-1)){
    ret[i+1] <- logisticGrowth(ret[i], r[i], k)
  }
  return(ret)
}


plot(logisticDynamics(n=2, r=0.25, k=100, steps=100), 
  type='b', las=1, ylab='population size', xlab='time',
  col='dodgerblue', pch=16)

But this is not how populations change over time, right?

Why not?

No individual variation in birth rates
No temporal changes to carrying capacity or growth rate

but perhaps most importantly …

Birth and death are probabilistic processes

This process is often called ‘demographic stochasticity’, and is especially important when thinking about small population sizes. As a thought experiment, imagine flipping a fair coin 5 times. The probability of landing on ‘heads’ is 0.5, but with only 5 trials, the resulting number of heads is far more variable than if we flipped that same coin 100 times. Considering birth and death as probabilistic processes, we can start to understand how we might expect population dynamics to be more variable at small population sizes. That is, the effects of ‘demographic stochasticity’ are dependent on population density.

This does take us a bit away from the logistic model, as the code below does not consider the influence of carrying capacity \(k\).

What if we treated birth as offspring drawn from a Poisson distribution, with some mean number of offspring \(\lambda\)?

logisticP1 <- function(Nt, b=0.1, d=0.1) {
  births <- sum(rbinom(Nt,1,b))
  deaths <- 0
  pop <- (Nt + births - deaths) 
  return(pop)
}

So the model above treats birth as drawn from a Poisson distribution, and death as dependent on the population density.

What if we treated death as binomial, with some probability \(p\)?

logisticP2 <- function(Nt, b=0.1, d=0.1) {
  births <- sum(rbinom(Nt, 1, b))
  deaths <- sum(rbinom(Nt, 1, d))
  pop <- (Nt + births - deaths) 
  return(pop)
}

logisticPDynamics <- function(Nt, b, d, steps=1000, mod=logisticP1){
  ret <- c()
  ret[1] <- Nt
  if(length(b) == 1){
    b <- rep(b, steps)
  }
  if(length(d) == 1){
    d <- rep(d, steps)
  }
  for(i in 1:(steps-1)){
    ret[i+1] <- mod(ret[i], b=b[i], d=d[i])
  }
  return(ret)
}


plot(logisticPDynamics(10, b=0.1, d=0.1, mod=logisticP1, steps=100), 
  type='l', las=1, ylab='population size', 
  xlab='time', col='dodgerblue')

# birth death equal
plot(logisticPDynamics(10, b=0.1, d=0.1, mod=logisticP2, steps=100), 
  type='l', ylim=c(0,30), 
  las=1, ylab='population size', xlab='time',
  col='dodgerblue')
lapply(1:100, function(x){
  lines(logisticPDynamics(10, b=0.1, d=0.1, mod=logisticP2, steps=100), col='dodgerblue')
})

# birth 2x death
plot(logisticPDynamics(10, b=0.2, d=0.1, mod=logisticP2, steps=100), 
  type='l', ylim=c(0,500),
  las=1, ylab='population size', xlab='time',
  col='dodgerblue')
lapply(1:100, function(x){
  lines(logisticPDynamics(10, b=0.2, d=0.1, mod=logisticP2, steps=100), col='dodgerblue')
})

# 1000 time steps
plot(logisticPDynamics(10, b=0.1, d=0.1, mod=logisticP2, steps=1000), 
  type='l', ylim=c(0,200),
  las=1, ylab='population size', xlab='time',
  col='dodgerblue')
lapply(1:100, function(x){
  lines(logisticPDynamics(10, b=0.1, d=0.1, mod=logisticP2, steps=1000), col=rainbow(100)[x])
})

How would you incorporate the carrying capacity \(k\) into the logisticP2 function? Show how it influences population dynamics.

Infectious disease modeling

The SIR model is a compartmental model of infectious disease, where the goal is to track the fraction of the population that is in each state (susceptible, infected, or recovered). Individuals move between these different states based on some probability of becoming infected (\(\beta\)) and some probability of recovery (\(\gamma\)).

\[ S = -\beta SI \] \[ I = \beta SI - \gamma I \] \[ R = \gamma I \]

sirModel <- function(init=c(100,1,0), beta=0.1, gamma=0.09, 
  steps=500, determ=TRUE){
  s <- i <- r <- c()
  s[1] <- init[1]/sum(init)
  i[1] <- init[2]/sum(init)
  r[1] <- init[3]/sum(init)

  if(determ){
    for(z in 2:steps){
      s[z] <- s[z-1] - (beta*s[z-1]*i[z-1])
      i[z] <- i[z-1] + (beta*s[z-1]*i[z-1]) - (gamma*i[z-1])
      r[z] <- r[z-1] + (gamma*i[z-1])
    }
  }
  if(determ==FALSE){
    for(z in 2:steps){
      infection <- (sum(rbinom(s[z-1], 1, beta*i[z-1])))
      recovery <- (sum(rbinom(i[z-1], 1, gamma)))
      s[z] <- s[z-1] - infection
      i[z] <- i[z-1] + infection - recovery
      r[z] <- r[z-1] + recovery
    }
  }
  return(data.frame(s,i,r))
}

plot(sirModel()$i)

plot(sirModel(determ=FALSE)$i)

sessionInfo

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] stringr_1.5.0
## 
## loaded via a namespace (and not attached):
##  [1] digest_0.6.33   R6_2.5.1        fastmap_1.1.1   xfun_0.39      
##  [5] magrittr_2.0.3  cachem_1.0.8    glue_1.6.2      knitr_1.43     
##  [9] htmltools_0.5.5 rmarkdown_2.23  tinytex_0.45    lifecycle_1.0.3
## [13] cli_3.6.1       sass_0.4.7      jquerylib_0.1.4 compiler_4.3.1 
## [17] highr_0.10      tools_4.3.1     bslib_0.5.0     evaluate_0.21  
## [21] yaml_2.3.7      jsonlite_1.8.7  rlang_1.1.1     stringi_1.7.12