Iteration refers to the process of doing the same task to a bunch of different objects. Consider a toy example of the actions required by a cashier at a grocery store. They scan each item, where items can be different sizes/shapes/prices. This is an iterative task, as it uses the same motions (essentially) across a variety of different objects (groceries) which may vary in many ways, but have some commonalities (e.g., most items have a barcode).
Up until this point, we have dealt with single data.frame objects (or
vectors, the building blocks of data.frames). However, we also
introduced the concept of lists
in one of the first
lectures, and will go into more detail about lists soon. For now, we’ll
talk about iteration independent of list objects, but keep in mind that
iteration is important for lists.
Essentially, iteration allows us to process a large amount of data without the need to repeat ourselves. Recall the gapminder data.
dat <- read.delim(file = "http://www.stat.ubc.ca/~jenny/notOcto/STAT545A/examples/gapminder/data/gapminderDataFiveYear.txt")
We discussed the gapminder
data when introducing some
tools around data subsetting and summarising. We ended that lecture by
discussing dplyr
, a useful package for data processing.
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Recall that towards the end of that lecture, we introduce piping
commands with dplyr
to summarise data. For instance, the
code below calculates mean life expectancy (lifeExp
) by
country
.
Approaching this with dplyr
offers us a powerful way to
summarise our data, but you will inevitably hit the limits of
dplyr
and thinking about how to do this in base R is
difficult, right? In base R, we discussed subsetting, but to do what the
above code does, we would have to subset by every country and then
calculate the mean lifeExp
for each subset. This is a good
jumping off point for iteration, starting with the idea of the
for
loop (some folks use ‘looping’ and ‘iteration’ to mean
the same thing). So we want a way to subset the dat
data.frame by country, and then calculate mean lifeExp
.
To start, we need to get a vector of the countries in the data.
Then we need to get the overall structure of the loop in place. To do
this, we use the structure for(i in range){ do something}
.
Essentially, we need to first define the range of what we want the loop
to do, and then within the curly brackets, we need to do the thing. The
power of this comes from the i
in the for
loop
call. This is essentially saying to temporally treat i
as
one of the values in range
, do something considering that,
and then set i
to the next value. This sequential process
means that at the end of the loop, we will have cycled through all the
entries in range
.
## [1] "Afghanistan"
## [1] "Albania"
## [1] "Algeria"
## [1] "Angola"
## [1] "Argentina"
## [1] "Australia"
## [1] "Austria"
## [1] "Bahrain"
## [1] "Bangladesh"
## [1] "Belgium"
## [1] "Benin"
## [1] "Bolivia"
## [1] "Bosnia and Herzegovina"
## [1] "Botswana"
## [1] "Brazil"
## [1] "Bulgaria"
## [1] "Burkina Faso"
## [1] "Burundi"
## [1] "Cambodia"
## [1] "Cameroon"
## [1] "Canada"
## [1] "Central African Republic"
## [1] "Chad"
## [1] "Chile"
## [1] "China"
## [1] "Colombia"
## [1] "Comoros"
## [1] "Congo, Dem. Rep."
## [1] "Congo, Rep."
## [1] "Costa Rica"
## [1] "Cote d'Ivoire"
## [1] "Croatia"
## [1] "Cuba"
## [1] "Czech Republic"
## [1] "Denmark"
## [1] "Djibouti"
## [1] "Dominican Republic"
## [1] "Ecuador"
## [1] "Egypt"
## [1] "El Salvador"
## [1] "Equatorial Guinea"
## [1] "Eritrea"
## [1] "Ethiopia"
## [1] "Finland"
## [1] "France"
## [1] "Gabon"
## [1] "Gambia"
## [1] "Germany"
## [1] "Ghana"
## [1] "Greece"
## [1] "Guatemala"
## [1] "Guinea"
## [1] "Guinea-Bissau"
## [1] "Haiti"
## [1] "Honduras"
## [1] "Hong Kong, China"
## [1] "Hungary"
## [1] "Iceland"
## [1] "India"
## [1] "Indonesia"
## [1] "Iran"
## [1] "Iraq"
## [1] "Ireland"
## [1] "Israel"
## [1] "Italy"
## [1] "Jamaica"
## [1] "Japan"
## [1] "Jordan"
## [1] "Kenya"
## [1] "Korea, Dem. Rep."
## [1] "Korea, Rep."
## [1] "Kuwait"
## [1] "Lebanon"
## [1] "Lesotho"
## [1] "Liberia"
## [1] "Libya"
## [1] "Madagascar"
## [1] "Malawi"
## [1] "Malaysia"
## [1] "Mali"
## [1] "Mauritania"
## [1] "Mauritius"
## [1] "Mexico"
## [1] "Mongolia"
## [1] "Montenegro"
## [1] "Morocco"
## [1] "Mozambique"
## [1] "Myanmar"
## [1] "Namibia"
## [1] "Nepal"
## [1] "Netherlands"
## [1] "New Zealand"
## [1] "Nicaragua"
## [1] "Niger"
## [1] "Nigeria"
## [1] "Norway"
## [1] "Oman"
## [1] "Pakistan"
## [1] "Panama"
## [1] "Paraguay"
## [1] "Peru"
## [1] "Philippines"
## [1] "Poland"
## [1] "Portugal"
## [1] "Puerto Rico"
## [1] "Reunion"
## [1] "Romania"
## [1] "Rwanda"
## [1] "Sao Tome and Principe"
## [1] "Saudi Arabia"
## [1] "Senegal"
## [1] "Serbia"
## [1] "Sierra Leone"
## [1] "Singapore"
## [1] "Slovak Republic"
## [1] "Slovenia"
## [1] "Somalia"
## [1] "South Africa"
## [1] "Spain"
## [1] "Sri Lanka"
## [1] "Sudan"
## [1] "Swaziland"
## [1] "Sweden"
## [1] "Switzerland"
## [1] "Syria"
## [1] "Taiwan"
## [1] "Tanzania"
## [1] "Thailand"
## [1] "Togo"
## [1] "Trinidad and Tobago"
## [1] "Tunisia"
## [1] "Turkey"
## [1] "Uganda"
## [1] "United Kingdom"
## [1] "United States"
## [1] "Uruguay"
## [1] "Venezuela"
## [1] "Vietnam"
## [1] "West Bank and Gaza"
## [1] "Yemen, Rep."
## [1] "Zambia"
## [1] "Zimbabwe"
So what did the above code do?
Alright. So we have a way to sequentially work through all of the
countries
and we know how to subset the data based on
country. So we can now subset the data for each of the countries, using
the i
iterator as a stand-in for each of the country names.
But this does not actually do anything with the data, such that
tmp
will just be the subset data for the last country in
the countries
vector.
So let’s now compute the mean lifeExp
for each
country.
meanLifeExp <- c()
for(i in countries){
tmp <- dat[which(dat$country == i), ]
meanLifeExp <- c(meanLifeExp, mean(tmp$lifeExp))
}
Here, we first create a vector to hold the output data
(meanLifeExp
) and then append the value for each mean onto
the vector. That is, we essentially re-write the
meanLifeExp
vector at every step of the iteration. This is
bad practice for a number of reasons (e.g., no memory efficient, writing
over objects where the object itself is in the call is bad practice,
etc.). So how can we get around doing this? for
loops can
be handed a vector of character values (as we have done above) or they
can be handed a numeric range. This is often useful, as it eases
indexing and can be a bit clearer in the code.
meanLifeExp <- c()
for(i in 1:length(countries)){
tmp <- dat[which(dat$country == countries[i]), ]
meanLifeExp[i] <- mean(tmp$lifeExp)
}
And the results of this code should be the same as the other
for
loop. We now have a vector of mean life expectancy
values for each country in countries
. But that was a fair
bit of work to get the same thing we could have gotten with
dplyr
, right? Let’s explore a situation where it would be a
bit tougher to get the same thing out of dplyr
(at least
with our current knowledge, as the example I’ll give below can be solved
using dplyr::do
).
Let’s say that we want to explore the relationship between
year
and lifeExp
for each country. That is, we
want to know how life expectancy is changing over time across the
different countries. To do this, we can use the cor.test
function in R to calculate Pearson’s correlation coefficients (assumes
linear structure between the two variables) or Spearman’s rank
correlation (assumes monotonic, but not linear response). The output of
cor.test
is a object, such that
dplyr::summarise
would fail.
So summarise expects the output to be a vector (note that there are ways around this, by pulling out the information we want from the cor.test)
But how we do pull out multiple values from the same test? And how do we handle and diagnose potential errors when we don’t work through each test sequentially?
lifeExpTime <- matrix(0, ncol=4, nrow=length(countries))
for(i in 1:length(countries)){
tmp <- dat[which(dat$country == countries[i]), ]
crP <- cor.test(tmp$year, tmp$lifeExp)
crS <- cor.test(tmp$year, tmp$lifeExp, method='spearman')
lifeExpTime[i, ] <- c(crP$estimate, crP$p.value,
crS$estimate, crS$p.value)
}
## Warning in cor.test.default(tmp$year, tmp$lifeExp, method = "spearman"): Cannot
## compute exact p-value with ties
colnames(lifeExpTime) <- c('pearsonEst', 'pearsonP',
'spearmanEst', 'spearmanP')
lifeExpTime <- as.data.frame(lifeExpTime)
lifeExpTime$country <- countries
And we can explore these data, to determine which countries have increasing or decreasing life expectancy values as a function of time.
## pearsonEst pearsonP spearmanEst spearmanP country
## 141 -0.2446149 0.4435318 -0.1888112 0.5578278 Zambia
This may seem like a lot of work when we could have done a bit less
using dplyr
syntax. The real power of for
loops will be in working with lists, simulating data, and plotting. For
instace, let’s say we don’t have data directly to work with, but want to
generate data. We could generate a bunch of data, mash it all together
in a data.frame, and then feed it into dplyr
, the data
generation step would require a for
loop already, so why
not keep things all contained in the for
loop.
Let’s say we want to create a Fibonacci sequence. This is a vector of numbers in which each number is the sum of the two preceding numbers in the vector. For the example, we will limit the length of the vector to be length 1000.
And now we have a Fibonacci sequence starting with
c(0,1)
.
Why do I start the
for
loop above at 3, and how else could you approach this same problem (there are many ways)?
apply
statements exist in many types, depending on the
data.structure you wish to do the action on: e.g. apply
,
sapply
, lapply
, vapply
,
tapply
. We will focus on apply
and
lapply
, but realize that these other options may be better
suited for your use case (especially vapply
, which gives
you a bit more control over output format). In the loop above, we wanted
to find the mean of each entry in a list. We used a for
loop to loop over elements, and stored the resulting means in a vector
called out
. Instead, we could use lapply
…the
l
in it means it performs some action on a list object.
## $a
## [1] 0.5149336
##
## $b
## [1] 0.7264648
##
## $d
## [1] 0.5308854
The output of lapply
will always be a list, which is
nice in some instances and not nice in others. sapply
is a
wrapper for lapply
which always returns a vector of
values.
## a b d
## 0.5149336 0.7264648 0.5308854
Now that we have an idea of what the apply
family of
functions do, we can look specifically at apply
, which
operates on matrices or data.frames. What if we wanted to calculate the
mean of every column or row in a data.frame? We could loop over each
column or row…
testDF <- data.frame(a=runif(100), b=rpois(100,2), d=rbinom(100,1,0.5))
# over columns
ret <- c()
for(i in 1:ncol(testDF)){
ret[i] <- mean(testDF[,i])
}
# over rows
ret <- c()
for(i in 1:nrow(testDF)){
ret[i] <- mean(unlist(testDF[i, ]))
}
Or we could use apply statements
## a b d
## 0.469083 2.000000 0.490000
## [1] 1.12152229 1.49631144 0.73707709 1.68469032 1.26639734 0.82555550
## [7] 0.87387685 1.19143617 0.71481739 2.35170235 0.38762291 0.70784483
## [13] 0.79960014 0.80506469 2.79985396 0.93868430 1.03796539 0.85367128
## [19] 0.97654511 0.00320723 1.48774981 1.47519670 0.96997310 0.84182996
## [25] 1.49191477 0.58143265 1.03971350 0.42935628 0.90874980 0.62243689
## [31] 0.65866024 1.35179540 1.71779201 0.79101765 1.30958502 0.81788804
## [37] 0.80596578 0.77268310 0.43195416 0.42865734 1.00974948 0.52750960
## [43] 0.73019151 0.27450089 1.27699876 1.04488856 1.92626858 1.08529232
## [49] 1.25809380 0.90327604 1.08262631 1.34647928 1.72408794 0.57147791
## [55] 0.44251309 0.66276009 1.95583199 1.68507013 1.05234152 0.38790862
## [61] 0.71660575 0.48574734 1.25181866 0.70360726 0.66633188 0.84078321
## [67] 1.36613855 1.23309436 0.47210247 1.40138343 0.84533758 0.93075785
## [73] 0.46337942 0.86157207 0.69433379 1.07107694 0.22762355 1.26310464
## [79] 0.28622747 1.83946655 0.60442356 0.90493334 1.63832061 1.28150308
## [85] 0.58820865 1.36576519 0.60564570 2.17036041 2.76127076 1.04491107
## [91] 1.30175586 0.56756594 0.01324793 0.36053165 0.39450862 1.31327433
## [97] 1.22232270 0.63959893 0.66841202 0.11338135
One advantage is that indexing rows of a data.frame is a pain, which
is why we had to unlist
each row in the for loop over rows
above. If we do not do this, we get a vector of NA values. This is
because a data.frame is a list of vectors. This is why column-wise
operations on data.frames can also be performed using
lapply
(if we wanted list output) or sapply
(if we wanted vector output).
## $a
## [1] 0.469083
##
## $b
## [1] 2
##
## $d
## [1] 0.49
## a b d
## 0.469083 2.000000 0.490000
You are creating a game of rock-paper-scissors. In the game, each player can select their strategy, and the strategy can be different in each trial (where there can be 100s of trials).
I think that the outcome is random, so as a player, I already have decided what I’m going to play before the game starts.
Write a for loop to simulate rock-paper-scissors game of 500 trials between two players, where my strategy above is one of the players.
How would you go about changing the strategy of the other player to beat my strategy?
How would you modify your strategy to be adaptive? For instance, if your opponent selects ‘rock’ twice in a row, it may be unlikely that they’ll select ‘rock’ again. How do you incorporate this into the code?
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.2 plyr_1.8.8
##
## loaded via a namespace (and not attached):
## [1] vctrs_0.6.3 cli_3.6.1 knitr_1.43 rlang_1.1.1
## [5] xfun_0.39 generics_0.1.3 jsonlite_1.8.7 glue_1.6.2
## [9] htmltools_0.5.5 tinytex_0.45 sass_0.4.7 fansi_1.0.4
## [13] rmarkdown_2.23 evaluate_0.21 jquerylib_0.1.4 tibble_3.2.1
## [17] fastmap_1.1.1 yaml_2.3.7 lifecycle_1.0.3 compiler_4.3.1
## [21] pkgconfig_2.0.3 Rcpp_1.0.11 digest_0.6.33 R6_2.5.1
## [25] tidyselect_1.2.0 utf8_1.2.3 pillar_1.9.0 magrittr_2.0.3
## [29] bslib_0.5.0 tools_4.3.1 cachem_1.0.8