Iteration

What is iteration?

Iteration refers to the process of doing the same task to a bunch of different objects. Consider a toy example of the actions required by a cashier at a grocery store. They scan each item, where items can be different sizes/shapes/prices. This is an iterative task, as it uses the same motions (essentially) across a variety of different objects (groceries) which may vary in many ways, but have some commonalities (e.g., most items have a barcode).

Why is iteration important?

Up until this point, we have dealt with single data.frame objects (or vectors, the building blocks of data.frames). However, we also introduced the concept of lists in one of the first lectures, and will go into more detail about lists soon. For now, we’ll talk about iteration independent of list objects, but keep in mind that iteration is important for lists.

Essentially, iteration allows us to process a large amount of data without the need to repeat ourselves. Recall the gapminder data.

dat <- read.delim(file = "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")

We discussed the gapminder data when introducing some tools around data subsetting and summarising. We ended that lecture by discussing dplyr, a useful package for data processing.

library(plyr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Recall that towards the end of that lecture, we introduce piping commands with dplyr to summarise data. For instance, the code below calculates mean life expectancy (lifeExp) by country.

tmp3 <- dat |>
    dplyr::group_by(country) |>
    dplyr::summarise(mnLifeExp=mean(lifeExp))

Approaching this with dplyr offers us a powerful way to summarise our data, but you will inevitably hit the limits of dplyr and thinking about how to do this in base R is difficult, right? In base R, we discussed subsetting, but to do what the above code does, we would have to subset by every country and then calculate the mean lifeExp for each subset. This is a good jumping off point for iteration, starting with the idea of the for loop (some folks use ‘looping’ and ‘iteration’ to mean the same thing). So we want a way to subset the dat data.frame by country, and then calculate mean lifeExp.

To start, we need to get a vector of the countries in the data.

countries <- unique(dat$country)

Then we need to get the overall structure of the loop in place. To do this, we use the structure for(i in range){ do something}. Essentially, we need to first define the range of what we want the loop to do, and then within the curly brackets, we need to do the thing. The power of this comes from the i in the for loop call. This is essentially saying to temporally treat i as one of the values in range, do something considering that, and then set i to the next value. This sequential process means that at the end of the loop, we will have cycled through all the entries in range.

for(i in countries){
  print(i)
}

## [1] "Afghanistan"
## [1] "Albania"
## [1] "Algeria"
## [1] "Angola"
## [1] "Argentina"
## [1] "Australia"
## [1] "Austria"
## [1] "Bahrain"
## [1] "Bangladesh"
## [1] "Belgium"
## [1] "Benin"
## [1] "Bolivia"
## [1] "Bosnia and Herzegovina"
## [1] "Botswana"
## [1] "Brazil"
## [1] "Bulgaria"
## [1] "Burkina Faso"
## [1] "Burundi"
## [1] "Cambodia"
## [1] "Cameroon"
## [1] "Canada"
## [1] "Central African Republic"
## [1] "Chad"
## [1] "Chile"
## [1] "China"
## [1] "Colombia"
## [1] "Comoros"
## [1] "Congo, Dem. Rep."
## [1] "Congo, Rep."
## [1] "Costa Rica"
## [1] "Cote d'Ivoire"
## [1] "Croatia"
## [1] "Cuba"
## [1] "Czech Republic"
## [1] "Denmark"
## [1] "Djibouti"
## [1] "Dominican Republic"
## [1] "Ecuador"
## [1] "Egypt"
## [1] "El Salvador"
## [1] "Equatorial Guinea"
## [1] "Eritrea"
## [1] "Ethiopia"
## [1] "Finland"
## [1] "France"
## [1] "Gabon"
## [1] "Gambia"
## [1] "Germany"
## [1] "Ghana"
## [1] "Greece"
## [1] "Guatemala"
## [1] "Guinea"
## [1] "Guinea-Bissau"
## [1] "Haiti"
## [1] "Honduras"
## [1] "Hong Kong, China"
## [1] "Hungary"
## [1] "Iceland"
## [1] "India"
## [1] "Indonesia"
## [1] "Iran"
## [1] "Iraq"
## [1] "Ireland"
## [1] "Israel"
## [1] "Italy"
## [1] "Jamaica"
## [1] "Japan"
## [1] "Jordan"
## [1] "Kenya"
## [1] "Korea, Dem. Rep."
## [1] "Korea, Rep."
## [1] "Kuwait"
## [1] "Lebanon"
## [1] "Lesotho"
## [1] "Liberia"
## [1] "Libya"
## [1] "Madagascar"
## [1] "Malawi"
## [1] "Malaysia"
## [1] "Mali"
## [1] "Mauritania"
## [1] "Mauritius"
## [1] "Mexico"
## [1] "Mongolia"
## [1] "Montenegro"
## [1] "Morocco"
## [1] "Mozambique"
## [1] "Myanmar"
## [1] "Namibia"
## [1] "Nepal"
## [1] "Netherlands"
## [1] "New Zealand"
## [1] "Nicaragua"
## [1] "Niger"
## [1] "Nigeria"
## [1] "Norway"
## [1] "Oman"
## [1] "Pakistan"
## [1] "Panama"
## [1] "Paraguay"
## [1] "Peru"
## [1] "Philippines"
## [1] "Poland"
## [1] "Portugal"
## [1] "Puerto Rico"
## [1] "Reunion"
## [1] "Romania"
## [1] "Rwanda"
## [1] "Sao Tome and Principe"
## [1] "Saudi Arabia"
## [1] "Senegal"
## [1] "Serbia"
## [1] "Sierra Leone"
## [1] "Singapore"
## [1] "Slovak Republic"
## [1] "Slovenia"
## [1] "Somalia"
## [1] "South Africa"
## [1] "Spain"
## [1] "Sri Lanka"
## [1] "Sudan"
## [1] "Swaziland"
## [1] "Sweden"
## [1] "Switzerland"
## [1] "Syria"
## [1] "Taiwan"
## [1] "Tanzania"
## [1] "Thailand"
## [1] "Togo"
## [1] "Trinidad and Tobago"
## [1] "Tunisia"
## [1] "Turkey"
## [1] "Uganda"
## [1] "United Kingdom"
## [1] "United States"
## [1] "Uruguay"
## [1] "Venezuela"
## [1] "Vietnam"
## [1] "West Bank and Gaza"
## [1] "Yemen, Rep."
## [1] "Zambia"
## [1] "Zimbabwe"

So what did the above code do?

Alright. So we have a way to sequentially work through all of the countries and we know how to subset the data based on country. So we can now subset the data for each of the countries, using the i iterator as a stand-in for each of the country names. But this does not actually do anything with the data, such that tmp will just be the subset data for the last country in the countries vector.

for(i in countries){
  tmp <- dat[which(dat$country == i), ]
}

So let’s now compute the mean lifeExp for each country.

meanLifeExp <- c()
for(i in countries){
  tmp <- dat[which(dat$country == i), ]
  meanLifeExp <- c(meanLifeExp, mean(tmp$lifeExp))
}

Here, we first create a vector to hold the output data (meanLifeExp) and then append the value for each mean onto the vector. That is, we essentially re-write the meanLifeExp vector at every step of the iteration. This is bad practice for a number of reasons (e.g., no memory efficient, writing over objects where the object itself is in the call is bad practice, etc.). So how can we get around doing this? for loops can be handed a vector of character values (as we have done above) or they can be handed a numeric range. This is often useful, as it eases indexing and can be a bit clearer in the code.

meanLifeExp <- c()
for(i in 1:length(countries)){
  tmp <- dat[which(dat$country == countries[i]), ]
  meanLifeExp[i] <- mean(tmp$lifeExp)
}

And the results of this code should be the same as the other for loop. We now have a vector of mean life expectancy values for each country in countries. But that was a fair bit of work to get the same thing we could have gotten with dplyr, right? Let’s explore a situation where it would be a bit tougher to get the same thing out of dplyr (at least with our current knowledge, as the example I’ll give below can be solved using dplyr::do).

Let’s say that we want to explore the relationship between year and lifeExp for each country. That is, we want to know how life expectancy is changing over time across the different countries. To do this, we can use the cor.test function in R to calculate Pearson’s correlation coefficients (assumes linear structure between the two variables) or Spearman’s rank correlation (assumes monotonic, but not linear response). The output of cor.test is a object, such that dplyr::summarise would fail.

tmp3 <- dat |>
    dplyr::group_by(country) |>
    dplyr::summarise(cor.test(year, lifeExp))

So summarise expects the output to be a vector (note that there are ways around this, by pulling out the information we want from the cor.test)

tmp3 <- dat |>
    dplyr::group_by(country) |>
    dplyr::summarise(cor.test(year, lifeExp)$estimate)

But how we do pull out multiple values from the same test? And how do we handle and diagnose potential errors when we don’t work through each test sequentially?

lifeExpTime <- matrix(0, ncol=4, nrow=length(countries))

for(i in 1:length(countries)){
  tmp <- dat[which(dat$country == countries[i]), ]
  crP <- cor.test(tmp$year, tmp$lifeExp)
  crS <- cor.test(tmp$year, tmp$lifeExp, method='spearman')
  lifeExpTime[i, ] <- c(crP$estimate, crP$p.value, 
    crS$estimate, crS$p.value)
}

## Warning in cor.test.default(tmp$year, tmp$lifeExp, method = "spearman"): Cannot
## compute exact p-value with ties

colnames(lifeExpTime) <- c('pearsonEst', 'pearsonP', 
  'spearmanEst', 'spearmanP')

lifeExpTime <- as.data.frame(lifeExpTime)
lifeExpTime$country <- countries

And we can explore these data, to determine which countries have increasing or decreasing life expectancy values as a function of time.

lifeExpTime[which.min(lifeExpTime$pearsonEst),]

##     pearsonEst  pearsonP spearmanEst spearmanP country
## 141 -0.2446149 0.4435318  -0.1888112 0.5578278  Zambia

This may seem like a lot of work when we could have done a bit less using dplyr syntax. The real power of for loops will be in working with lists, simulating data, and plotting. For instace, let’s say we don’t have data directly to work with, but want to generate data. We could generate a bunch of data, mash it all together in a data.frame, and then feed it into dplyr, the data generation step would require a for loop already, so why not keep things all contained in the for loop.

Let’s say we want to create a Fibonacci sequence. This is a vector of numbers in which each number is the sum of the two preceding numbers in the vector. For the example, we will limit the length of the vector to be length 1000.

fib <- c(0,1)
for(i in 3:1000){
  fib[i] <- sum(fib[(i-2):(i-1)])
}

And now we have a Fibonacci sequence starting with c(0,1).

Why do I start the for loop above at 3, and how else could you approach this same problem (there are many ways)?

Apply statements

apply statements exist in many types, depending on the data.structure you wish to do the action on: e.g. apply, sapply, lapply, vapply, tapply. We will focus on apply and lapply, but realize that these other options may be better suited for your use case (especially vapply, which gives you a bit more control over output format). In the loop above, we wanted to find the mean of each entry in a list. We used a for loop to loop over elements, and stored the resulting means in a vector called out. Instead, we could use lapply…the l in it means it performs some action on a list object.

lapply(X=testList2, FUN=mean)

## $a
## [1] 0.5149336
## 
## $b
## [1] 0.7264648
## 
## $d
## [1] 0.5308854

The output of lapply will always be a list, which is nice in some instances and not nice in others. sapply is a wrapper for lapply which always returns a vector of values.

sapply(X=testList2, FUN=mean)

##         a         b         d 
## 0.5149336 0.7264648 0.5308854

Now that we have an idea of what the apply family of functions do, we can look specifically at apply, which operates on matrices or data.frames. What if we wanted to calculate the mean of every column or row in a data.frame? We could loop over each column or row…

testDF <- data.frame(a=runif(100), b=rpois(100,2), d=rbinom(100,1,0.5))


# over columns
ret <- c()
for(i in 1:ncol(testDF)){
    ret[i] <- mean(testDF[,i])
}


# over rows
ret <- c()
for(i in 1:nrow(testDF)){
    ret[i] <- mean(unlist(testDF[i, ]))
}

Or we could use apply statements

apply(X=testDF, MARGIN=2, FUN=mean)

##        a        b        d 
## 0.469083 2.000000 0.490000

apply(X=testDF, MARGIN=1, FUN=mean)

##   [1] 1.12152229 1.49631144 0.73707709 1.68469032 1.26639734 0.82555550
##   [7] 0.87387685 1.19143617 0.71481739 2.35170235 0.38762291 0.70784483
##  [13] 0.79960014 0.80506469 2.79985396 0.93868430 1.03796539 0.85367128
##  [19] 0.97654511 0.00320723 1.48774981 1.47519670 0.96997310 0.84182996
##  [25] 1.49191477 0.58143265 1.03971350 0.42935628 0.90874980 0.62243689
##  [31] 0.65866024 1.35179540 1.71779201 0.79101765 1.30958502 0.81788804
##  [37] 0.80596578 0.77268310 0.43195416 0.42865734 1.00974948 0.52750960
##  [43] 0.73019151 0.27450089 1.27699876 1.04488856 1.92626858 1.08529232
##  [49] 1.25809380 0.90327604 1.08262631 1.34647928 1.72408794 0.57147791
##  [55] 0.44251309 0.66276009 1.95583199 1.68507013 1.05234152 0.38790862
##  [61] 0.71660575 0.48574734 1.25181866 0.70360726 0.66633188 0.84078321
##  [67] 1.36613855 1.23309436 0.47210247 1.40138343 0.84533758 0.93075785
##  [73] 0.46337942 0.86157207 0.69433379 1.07107694 0.22762355 1.26310464
##  [79] 0.28622747 1.83946655 0.60442356 0.90493334 1.63832061 1.28150308
##  [85] 0.58820865 1.36576519 0.60564570 2.17036041 2.76127076 1.04491107
##  [91] 1.30175586 0.56756594 0.01324793 0.36053165 0.39450862 1.31327433
##  [97] 1.22232270 0.63959893 0.66841202 0.11338135

One advantage is that indexing rows of a data.frame is a pain, which is why we had to unlist each row in the for loop over rows above. If we do not do this, we get a vector of NA values. This is because a data.frame is a list of vectors. This is why column-wise operations on data.frames can also be performed using lapply (if we wanted list output) or sapply (if we wanted vector output).

lapply(X=testDF, FUN=mean)

## $a
## [1] 0.469083
## 
## $b
## [1] 2
## 
## $d
## [1] 0.49

sapply(X=testDF, FUN=mean)

##        a        b        d 
## 0.469083 2.000000 0.490000

A fun class exercise

You are creating a game of rock-paper-scissors. In the game, each player can select their strategy, and the strategy can be different in each trial (where there can be 100s of trials).

I think that the outcome is random, so as a player, I already have decided what I’m going to play before the game starts.

strat <- sample(c('rock','paper', 'scissors'), 100, replace=TRUE)

Write a for loop to simulate rock-paper-scissors game of 500 trials between two players, where my strategy above is one of the players.

How would you go about changing the strategy of the other player to beat my strategy?

How would you modify your strategy to be adaptive? For instance, if your opponent selects ‘rock’ twice in a row, it may be unlikely that they’ll select ‘rock’ again. How do you incorporate this into the code?

sessionInfo

sessionInfo()