Visualizations help identify interesting things about the data. This includes how the data are distributed, values that are outliers, and potential mistakes in data entry.
Visualizations are often more compelling at showing relationships between variables (and interactions between variables) than tables of statistical tests, and show more information than simple mean and standard deviations which would be reported in a table.
Visualizations can be striking ways to show relationships. Nearly all academic manuscripts will contain at least one or two figures, and knowing how to make data-rich figures is an important skill.
Visualizations in R all start by identifying the plotting area (the
regions of space where the actual figure goes relative to the margins
and other bits). This is done using the par
function before
a plot is created. Even if you do not explicitly use par
,
there are default arguments to par
which are used whenever
you call a plotting function (e.g., plot
).
par
takes a number of arguments. I will only go over a
couple of common ones that are important to making figures which
eliminate unnecessary white space. The most important of these is
mar
, which sets up the exterior plotting margins. This is
the amount of whitespace on either side of the plotting area.
Running this code should open up a blank plotting window, but with
margins present. The input to mar
is a vector of 4 values,
identifying the amount of space (in lines, I believe) for the bottom,
left, top, and right plotting areas, respectively.
What does the above code identify the margins as? Why do we opt to have values of 4 for the first 2 and then values of 0.5 for the second 2?
Histograms are useful to explore the distribution of your data. The goal is to show frequencies (the number of times your data falls into a given bin or range of values). This is useful to start to explore the distribution of your data.
What we have done above is to create a vector of pulls from a uniform
distribution (a distribution where every value between two bounds is
equally likely of being sampled). The default range is between 0 and 1,
so we should expect the histogram to look fairly ‘flat’, in that we
expect that each bin in the histogram should have roughly the same
frequency. Here we have 10 bins, so we should see about 10 observations
in each bin. We can control the number of bins as well, using the
breaks
argument. This is useful because it also allows us
more fine-scale control of how the data are binned.
Unequal bin sizes are sometimes useful for displaying the data, but notice that there was a switch on the y-axis in terms of what is actually being represented? It went from “frequency” (count of number of observations falling into that bin) to “density”. For now, we won’t go into this change or why it’s important. Instead, let’s make this plot a bit less ugly.
Most of the time, when we are thinking about plotting, we want to
show the relationship between two or more variables. To do this, we can
use the base plot
function (which can powerfully handle
many data of different types e.g., numeric
,
factor
, etc.). So let’s create a second variable
(uniformRV2
) and explore the relationship between the two
continuous variables.
And it looks like there’s no relationship between the variables
(because there isn’t, right?). Another thing to note is that the order
matters (x
is the first argument to the plot function, so
we could be more explicit about how we hand variables to the plotting
function). And we can also start making this prettier. hist
had a smaller list of arguments that we could hand to manipulate the
display of the data. plot
has many more arguments, as
plot
is the main function for visualization in R. In fact,
packages written in R that deal with different types of data work with
the plot
argument to display data well beyond what the
plot
function was initially designed for, as we will see
when we start visualizing maps and other fun stuff. Take some time and
explore what arguments plot
can take by issuing the command
?plot
into the R console.
plot(x=uniformRV, y=uniformRV2,
pch=16, cex=2, las=1,
xlab = 'Our first uniform variable',
ylab='Our second uniform variable',
col='firebrick')
It is important to note here that (as in many other areas of
programming), the plot
function will try to work with
whatever information you give it, and this can sometimes lead you down
weird paths. For instance, the col
, cex
, and
pch
arguments can accept vectors, where it expects you to
provide a vector of the same length as the number of data points you
have. This can be useful, as in the example below where we can color all
points above a threshold a different color or point shape
(pch
).
plot(x=uniformRV, y=uniformRV2,
pch=c(16,17)[1+(uniformRV2 > 50)],
cex=2, las=1,
xlab = 'Our first uniform variable',
ylab='Our second uniform variable',
col='firebrick')
This builds on what we previously learned about
conditionals, as well as what we’ve learned about
indexing vectors. Let’s break this down. For the pch
argument above, we create a vector of two values (c(16,17)
)
and then index these based on the output of a conditional.
## [1] 1 1 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2
## [38] 2 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 1 2 2 2 1 2
## [75] 2 2 2 2 2 1 2 1 1 2 1 2 2 2 1 1 1 1 2 2 2 2 1 2 2 2
Do the same as above, but indexing different colors for points on the x-axis (
uniformRV
).
A common visualization is to show the mean and standard deviation of some continuous values across treatments. For instance, let’s say we have an experiment where we expose plants to different combinations of nitrogen and phosphorous. If we have a single level of each of N and P, then it means we have 4 treatments (control, N, P, N+P), right?
What is the importance of the control in this experiment?
control <- rnorm(100, 2, 0.1)
N <- rnorm(100, 3, 0.25)
P <- rnorm(100, 3.5, 0.5)
NP <- rnorm(100, 6, 0.4)
How would you first start to visualize the data?
There are many different ways we could visualize the differences among treatments, depending on the amount of data we want to show. I’ll start with the simplest, and then we can spend some time thinking about other ways to display the data.
So let’s say I only want to see the means and standard deviations of the treatments. This means my x-axis will be treatment, and the y-axis will be the values obtained from the experiment in terms of plant performance.
plantDF <- data.frame(performance=c(control, N, P, NP),
treatment=c(
rep('control',100), rep('N', 100),
rep('P', 100), rep('NP',100))
)
plot(as.factor(plantDF$treatment), plantDF$performance)
Wait, what have we plotted? And why are there two ways to get the exact same plot?
This has to do with the plotting function being the R multi-tool in
terms of visualization. I hand it a categorical variable and a
continuous variable, and it defaults to visualizing these as a boxplot.
There is also the boxplot
function, which does the same
thing, but needs the formula
interface. The nice part about
boxplot
is that you can hand it as many vectors as you want
and it’ll create more levels to the plot.
But wait…something else interesting happened here, right? There was a shift between the variables in the third and fourth positions. What’s going on there?
R is ordering the x-axis alphabetically, such that our
NP
treatment is in the third position for the first two
plots, but last when we explicitly hand the boxplot function the order
we want.
plantDF$trtFactor <- factor(plantDF$treatment, levels=unique(plantDF$treatment))
plot(plantDF$trtFactor, plantDF$performance)
This was a bit of a hacky solution, as I knew the order of the
levels
when I called unique
. So
unique
provided a vector of the unique treatment levels,
but it did so sequentially (so the way I had the data formatted forced
it to be control, N, P, NP
). We can also specify the order
of the levels ourselves.
plantDF$trtFactor <- factor(plantDF$treatment, levels=c('control', 'N', 'P', 'NP'))
plot(plantDF$trtFactor, plantDF$performance)
So there’s one way to visualize these data. Let’s spend the next 10 minutes seeing what you can do to improve this visualization, either aesthetically, or by plotting it differently (are we seeing too much detail? not enough detail?)
##
## Call:
## lm(formula = plantDF$performance ~ plantDF$trtFactor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.14935 -0.16576 0.00278 0.18125 1.03590
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.99999 0.03444 58.06 <2e-16 ***
## plantDF$trtFactorN 0.97439 0.04871 20.00 <2e-16 ***
## plantDF$trtFactorP 1.50148 0.04871 30.82 <2e-16 ***
## plantDF$trtFactorNP 3.97553 0.04871 81.61 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3444 on 396 degrees of freedom
## Multiple R-squared: 0.9482, Adjusted R-squared: 0.9478
## F-statistic: 2417 on 3 and 396 DF, p-value: < 2.2e-16
t.test(plantDF$performance[which(plantDF$trt=='control')],
plantDF$performance[which(plantDF$trt=='N')])
##
## Welch Two Sample t-test
##
## data: plantDF$performance[which(plantDF$trt == "control")] and plantDF$performance[which(plantDF$trt == "N")]
## t = -36.571, df = 143.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -1.0270506 -0.9217221
## sample estimates:
## mean of x mean of y
## 1.999986 2.974372
There are often many things that cannot be done with a single plotting call. That is, we need to add something to the plotting area (e.g., a legend). R has numerous functions that add things to existing plots, allowing complete controllability of what gets plotted.
plot(plantDF$trtFactor, plantDF$performance)
title('Look at the differences when we add N and P!', line=1)
points(y=plantDF$performance, x=plantDF$trtFactor,
col=adjustcolor(1,0.25), pch=16)
legend('topleft', legend=c('control', 'N', 'P', 'NP'),
col=1:4, pch=16, bty='n')
Make the colors of the additional points (or of the boxes in the boxplot) match the colors that we have supplied in the legend?
In practice, this is not a great visualization. That is, we don’t need the legend, as the x-axis already contains all that of that information. The goal with plotting is to keep the output as simple as possible while layering on the necessary and important information. This gets at some of the underlying ideas around what makes compelling visualizations. One of the figures in this field is Edward Tufte, who emphasizes some core principles of visualization design:
The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented
Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graph itself. Label important events in the data.
Show data variation, not design variation.
In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.
The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.
Above we have gone over some pretty typical types of visualizations,
but there are likely many more (e.g., we’ll go over spatial
visualizations later in the semester). Recall that model we fit above
for the plant data (plantMod
)? The model is a linear model
object, which we called summary
on to see some of the
results. summary
is a base function that can work with many
different variable classes and types of data. To emphasize this, look at
the help files for ?summary
and for
?summary.lm
. It’s the same function, but
summary.lm
is designed to report the results of the linear
model. plot
is also like this.
We won’t go over all of what these mean, but they are essentially checks on the data to test the assumptions of the underlying linear model. This is useful plotting for visually checking the assumptions of your analyses.
But there are other types of plotting we may be interested in. How about heatmaps, which are really useful at showing the interacting effects of two variables on a response.
Note that many of the arguments are shared across base R plotting
(e.g., col
, xlab
, axes
,
las
, pch
, etc. etc. etc.). This common
language is a strength of base R plotting. Some of these variable names
are used in other packages (e.g., xlab/ylab in lattice
),
but try to edit the levelplot
code and get ready for some
confusion (try to change the color ramp).
Static visualizations are most of what we do as scientists, because often the figures are used in presentations and publications. However, interactive graphics that allow users to view tooltips (data that is shown on hover or click) can be really nice for communicating data.
library(plotly)
fig <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, color = ~Species)
fig
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
This is useful, but the tooltips just tell us the x and y positions. Below, we modify this to
d <- diamonds[sample(nrow(diamonds), 1000), ]
fig <- plot_ly(
d, x = ~carat, y = ~price,
# Hover text:
text = ~paste("Price: ", price, '$<br>Cut:', cut),
color = ~carat, size = ~carat
)
fig
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning: `line.width` does not currently support multiple values.
Hopefully that worked. It was storing the plot in a temporary folder
and not showing me the html on my computer, so fingers crossed that’s
just my machine. If not, no worries. This is supposed to just be a quick
overview. A final note is that the above plots are different from base
plot in another way apart from being interactive. They use a plotting
system called ggplot2
, short for the
grammar of graphics
. This package was developed by Hadley
Wickham (Posit) and is now supported by a team of developers and used by
a lot of folks. It relies on the use of linked functions to modify an
existing plot. For instance,
This only sets up the plotting window. It sets up the axes to
correspond to our chosen variables (aes(displ, hwy...
corresponds to those columns in the mpg data.frame). To actually get it
to plot the data though, we have to specify a layer such as
geom_point()
.
## Warning: `position_dodge()` requires non-overlapping x intervals
This provides a lot of flexibility, and lots and lots of room for fun errors (e.g., the violin plot from above makes zero sense).
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3; LAPACK version 3.10.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: America/New_York
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plotly_4.10.2 ggplot2_3.4.2 dplyr_1.1.2 plyr_1.8.8
##
## loaded via a namespace (and not attached):
## [1] viridis_0.6.3 sass_0.4.7 utf8_1.2.3 generics_0.1.3
## [5] tidyr_1.3.0 lattice_0.21-8 digest_0.6.33 magrittr_2.0.3
## [9] evaluate_0.21 grid_4.3.1 RColorBrewer_1.1-3 fastmap_1.1.1
## [13] jsonlite_1.8.7 tinytex_0.45 gridExtra_2.3 httr_1.4.6
## [17] purrr_1.0.1 fansi_1.0.4 crosstalk_1.2.0 viridisLite_0.4.2
## [21] scales_1.2.1 lazyeval_0.2.2 jquerylib_0.1.4 cli_3.6.1
## [25] rlang_1.1.1 ellipsis_0.3.2 munsell_0.5.0 withr_2.5.0
## [29] cachem_1.0.8 yaml_2.3.7 tools_4.3.1 colorspace_2.1-0
## [33] vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.3 htmlwidgets_1.6.2
## [37] pkgconfig_2.0.3 hexbin_1.28.3 pillar_1.9.0 bslib_0.5.0
## [41] gtable_0.3.3 glue_1.6.2 data.table_1.14.8 Rcpp_1.0.11
## [45] xfun_0.39 tibble_3.2.1 tidyselect_1.2.0 highr_0.10
## [49] knitr_1.43 farver_2.1.1 htmltools_0.5.5 labeling_0.4.2
## [53] rmarkdown_2.23 compiler_4.3.1