Why is data visualization important?

exploratory data analysis

Visualizations help identify interesting things about the data. This includes how the data are distributed, values that are outliers, and potential mistakes in data entry.

clear demonstration of results relative to tables or statistics

Visualizations are often more compelling at showing relationships between variables (and interactions between variables) than tables of statistical tests, and show more information than simple mean and standard deviations which would be reported in a table.

pretty

Visualizations can be striking ways to show relationships. Nearly all academic manuscripts will contain at least one or two figures, and knowing how to make data-rich figures is an important skill.

Visualizations in R all start by identifying the plotting area (the regions of space where the actual figure goes relative to the margins and other bits). This is done using the par function before a plot is created. Even if you do not explicitly use par, there are default arguments to par which are used whenever you call a plotting function (e.g., plot).

setting up the plotting window

par takes a number of arguments. I will only go over a couple of common ones that are important to making figures which eliminate unnecessary white space. The most important of these is mar, which sets up the exterior plotting margins. This is the amount of whitespace on either side of the plotting area.

par(mar=c(4,4,0.5,0.5))

Running this code should open up a blank plotting window, but with margins present. The input to mar is a vector of 4 values, identifying the amount of space (in lines, I believe) for the bottom, left, top, and right plotting areas, respectively.

What does the above code identify the margins as? Why do we opt to have values of 4 for the first 2 and then values of 0.5 for the second 2?

histograms

Histograms are useful to explore the distribution of your data. The goal is to show frequencies (the number of times your data falls into a given bin or range of values). This is useful to start to explore the distribution of your data.

uniformRV <- runif(100)
hist(uniformRV)

What we have done above is to create a vector of pulls from a uniform distribution (a distribution where every value between two bounds is equally likely of being sampled). The default range is between 0 and 1, so we should expect the histogram to look fairly ‘flat’, in that we expect that each bin in the histogram should have roughly the same frequency. Here we have 10 bins, so we should see about 10 observations in each bin. We can control the number of bins as well, using the breaks argument. This is useful because it also allows us more fine-scale control of how the data are binned.

hist(uniformRV, breaks=50)

hist(uniformRV, breaks=c(0,0.25,0.5,0.75,1))

hist(uniformRV, breaks=c(0,0.25,0.5,1))

Unequal bin sizes are sometimes useful for displaying the data, but notice that there was a switch on the y-axis in terms of what is actually being represented? It went from “frequency” (count of number of observations falling into that bin) to “density”. For now, we won’t go into this change or why it’s important. Instead, let’s make this plot a bit less ugly.

par(mar=c(4,4,0.5,0.5))
hist(uniformRV, breaks=10, 
  xlab='Our variable', ylab='Frequency', 
  main='', las=1, 
  col=adjustcolor('dodgerblue', 0.5), 
  border='dodgerblue')

scatterplots

Most of the time, when we are thinking about plotting, we want to show the relationship between two or more variables. To do this, we can use the base plot function (which can powerfully handle many data of different types e.g., numeric, factor, etc.). So let’s create a second variable (uniformRV2) and explore the relationship between the two continuous variables.

uniformRV2 <- runif(100, 10, 100)

plot(uniformRV, uniformRV2)

And it looks like there’s no relationship between the variables (because there isn’t, right?). Another thing to note is that the order matters (x is the first argument to the plot function, so we could be more explicit about how we hand variables to the plotting function). And we can also start making this prettier. hist had a smaller list of arguments that we could hand to manipulate the display of the data. plot has many more arguments, as plot is the main function for visualization in R. In fact, packages written in R that deal with different types of data work with the plot argument to display data well beyond what the plot function was initially designed for, as we will see when we start visualizing maps and other fun stuff. Take some time and explore what arguments plot can take by issuing the command ?plot into the R console.

plot(x=uniformRV, y=uniformRV2, 
  pch=16, cex=2, las=1, 
  xlab = 'Our first uniform variable', 
  ylab='Our second uniform variable', 
  col='firebrick')

It is important to note here that (as in many other areas of programming), the plot function will try to work with whatever information you give it, and this can sometimes lead you down weird paths. For instance, the col, cex, and pch arguments can accept vectors, where it expects you to provide a vector of the same length as the number of data points you have. This can be useful, as in the example below where we can color all points above a threshold a different color or point shape (pch).

plot(x=uniformRV, y=uniformRV2, 
  pch=c(16,17)[1+(uniformRV2 > 50)], 
  cex=2, las=1, 
  xlab = 'Our first uniform variable', 
  ylab='Our second uniform variable', 
  col='firebrick')

This builds on what we previously learned about conditionals, as well as what we’ve learned about indexing vectors. Let’s break this down. For the pch argument above, we create a vector of two values (c(16,17)) and then index these based on the output of a conditional.

1+(uniformRV2 > 50)
##   [1] 1 1 1 1 2 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2
##  [38] 2 1 2 1 1 1 1 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 2 1 2 1 2 1 2 2 1 1 2 2 2 1 2
##  [75] 2 2 2 2 2 1 2 1 1 2 1 2 2 2 1 1 1 1 2 2 2 2 1 2 2 2

Do the same as above, but indexing different colors for points on the x-axis (uniformRV).

Showing mean and standard deviation

A common visualization is to show the mean and standard deviation of some continuous values across treatments. For instance, let’s say we have an experiment where we expose plants to different combinations of nitrogen and phosphorous. If we have a single level of each of N and P, then it means we have 4 treatments (control, N, P, N+P), right?

What is the importance of the control in this experiment?

control <- rnorm(100, 2, 0.1)
N <- rnorm(100, 3, 0.25)
P <- rnorm(100, 3.5, 0.5)
NP <- rnorm(100, 6, 0.4)

How would you first start to visualize the data?

There are many different ways we could visualize the differences among treatments, depending on the amount of data we want to show. I’ll start with the simplest, and then we can spend some time thinking about other ways to display the data.

So let’s say I only want to see the means and standard deviations of the treatments. This means my x-axis will be treatment, and the y-axis will be the values obtained from the experiment in terms of plant performance.

plantDF <- data.frame(performance=c(control, N, P, NP), 
  treatment=c(
    rep('control',100), rep('N', 100), 
    rep('P', 100), rep('NP',100))
)

plot(as.factor(plantDF$treatment), plantDF$performance)

boxplot(plantDF$performance ~ plantDF$treatment)

Wait, what have we plotted? And why are there two ways to get the exact same plot?

This has to do with the plotting function being the R multi-tool in terms of visualization. I hand it a categorical variable and a continuous variable, and it defaults to visualizing these as a boxplot. There is also the boxplot function, which does the same thing, but needs the formula interface. The nice part about boxplot is that you can hand it as many vectors as you want and it’ll create more levels to the plot.

boxplot(control, N, P, NP)

But wait…something else interesting happened here, right? There was a shift between the variables in the third and fourth positions. What’s going on there?

R is ordering the x-axis alphabetically, such that our NP treatment is in the third position for the first two plots, but last when we explicitly hand the boxplot function the order we want.

plantDF$trtFactor <- factor(plantDF$treatment, levels=unique(plantDF$treatment))

plot(plantDF$trtFactor, plantDF$performance)

This was a bit of a hacky solution, as I knew the order of the levels when I called unique. So unique provided a vector of the unique treatment levels, but it did so sequentially (so the way I had the data formatted forced it to be control, N, P, NP). We can also specify the order of the levels ourselves.

plantDF$trtFactor <- factor(plantDF$treatment, levels=c('control', 'N', 'P', 'NP'))

plot(plantDF$trtFactor, plantDF$performance)

So there’s one way to visualize these data. Let’s spend the next 10 minutes seeing what you can do to improve this visualization, either aesthetically, or by plotting it differently (are we seeing too much detail? not enough detail?)

plantMod <- lm(plantDF$performance ~ plantDF$trtFactor)
summary(plantMod)
## 
## Call:
## lm(formula = plantDF$performance ~ plantDF$trtFactor)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.14935 -0.16576  0.00278  0.18125  1.03590 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          1.99999    0.03444   58.06   <2e-16 ***
## plantDF$trtFactorN   0.97439    0.04871   20.00   <2e-16 ***
## plantDF$trtFactorP   1.50148    0.04871   30.82   <2e-16 ***
## plantDF$trtFactorNP  3.97553    0.04871   81.61   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3444 on 396 degrees of freedom
## Multiple R-squared:  0.9482, Adjusted R-squared:  0.9478 
## F-statistic:  2417 on 3 and 396 DF,  p-value: < 2.2e-16
t.test(plantDF$performance[which(plantDF$trt=='control')], 
  plantDF$performance[which(plantDF$trt=='N')])
## 
##  Welch Two Sample t-test
## 
## data:  plantDF$performance[which(plantDF$trt == "control")] and plantDF$performance[which(plantDF$trt == "N")]
## t = -36.571, df = 143.85, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.0270506 -0.9217221
## sample estimates:
## mean of x mean of y 
##  1.999986  2.974372

Adding things on top of visualizations

There are often many things that cannot be done with a single plotting call. That is, we need to add something to the plotting area (e.g., a legend). R has numerous functions that add things to existing plots, allowing complete controllability of what gets plotted.

plot(plantDF$trtFactor, plantDF$performance)
title('Look at the differences when we add N and P!', line=1)
points(y=plantDF$performance, x=plantDF$trtFactor,
  col=adjustcolor(1,0.25), pch=16)

legend('topleft', legend=c('control', 'N', 'P', 'NP'), 
  col=1:4, pch=16, bty='n')

Make the colors of the additional points (or of the boxes in the boxplot) match the colors that we have supplied in the legend?

In practice, this is not a great visualization. That is, we don’t need the legend, as the x-axis already contains all that of that information. The goal with plotting is to keep the output as simple as possible while layering on the necessary and important information. This gets at some of the underlying ideas around what makes compelling visualizations. One of the figures in this field is Edward Tufte, who emphasizes some core principles of visualization design:

  • The representation of numbers, as physically measured on the surface of the graph itself, should be directly proportional to the numerical quantities represented

  • Clear, detailed and thorough labeling should be used to defeat graphical distortion and ambiguity. Write out explanations of the data on the graph itself. Label important events in the data.

  • Show data variation, not design variation.

  • In time-series displays of money, deflated and standardized units of monetary measurement are nearly always better than nominal units.

  • The number of information carrying (variable) dimensions depicted should not exceed the number of dimensions in the data. Graphics must not quote data out of context.

What other types of plotting do we do?

Above we have gone over some pretty typical types of visualizations, but there are likely many more (e.g., we’ll go over spatial visualizations later in the semester). Recall that model we fit above for the plant data (plantMod)? The model is a linear model object, which we called summary on to see some of the results. summary is a base function that can work with many different variable classes and types of data. To emphasize this, look at the help files for ?summary and for ?summary.lm. It’s the same function, but summary.lm is designed to report the results of the linear model. plot is also like this.

plot(plantMod)

We won’t go over all of what these mean, but they are essentially checks on the data to test the assumptions of the underlying linear model. This is useful plotting for visually checking the assumptions of your analyses.

But there are other types of plotting we may be interested in. How about heatmaps, which are really useful at showing the interacting effects of two variables on a response.

tmp <- 1:10 %*% t(1:10)

# a base R version 
image(tmp)

image(tmp, col=viridis::viridis(100), 
  xlab='Variable 1', ylab='Variable 2', 
  las=1)

# a lattice version 
lattice::levelplot(tmp)

Note that many of the arguments are shared across base R plotting (e.g., col, xlab, axes, las, pch, etc. etc. etc.). This common language is a strength of base R plotting. Some of these variable names are used in other packages (e.g., xlab/ylab in lattice), but try to edit the levelplot code and get ready for some confusion (try to change the color ramp).

Interactive visualizations

Static visualizations are most of what we do as scientists, because often the figures are used in presentations and publications. However, interactive graphics that allow users to view tooltips (data that is shown on hover or click) can be really nice for communicating data.

library(plotly)

fig <- plot_ly(data = iris, x = ~Sepal.Length, y = ~Petal.Length, color = ~Species)
fig
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode

This is useful, but the tooltips just tell us the x and y positions. Below, we modify this to

d <- diamonds[sample(nrow(diamonds), 1000), ]

fig <- plot_ly(
  d, x = ~carat, y = ~price,
  # Hover text:
  text = ~paste("Price: ", price, '$<br>Cut:', cut),
  color = ~carat, size = ~carat
)

fig
## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
## Warning: `line.width` does not currently support multiple values.

Hopefully that worked. It was storing the plot in a temporary folder and not showing me the html on my computer, so fingers crossed that’s just my machine. If not, no worries. This is supposed to just be a quick overview. A final note is that the above plots are different from base plot in another way apart from being interactive. They use a plotting system called ggplot2, short for the grammar of graphics. This package was developed by Hadley Wickham (Posit) and is now supported by a team of developers and used by a lot of folks. It relies on the use of linked functions to modify an existing plot. For instance,

library(ggplot2)

ggplot(mpg, aes(displ, hwy, colour = class)) 

This only sets up the plotting window. It sets up the axes to correspond to our chosen variables (aes(displ, hwy... corresponds to those columns in the mpg data.frame). To actually get it to plot the data though, we have to specify a layer such as geom_point().

ggplot(mpg, aes(displ, hwy, colour = class)) + geom_point()

ggplot(mpg, aes(displ, hwy, colour = class)) + geom_hex()

ggplot(mpg, aes(displ, hwy, colour = class)) + geom_violin()
## Warning: `position_dodge()` requires non-overlapping x intervals

This provides a lot of flexibility, and lots and lots of room for fun errors (e.g., the violin plot from above makes zero sense).

sessionInfo

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 22.04.3 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/atlas/libblas.so.3.10.3 
## LAPACK: /usr/lib/x86_64-linux-gnu/atlas/liblapack.so.3.10.3;  LAPACK version 3.10.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/New_York
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plotly_4.10.2 ggplot2_3.4.2 dplyr_1.1.2   plyr_1.8.8   
## 
## loaded via a namespace (and not attached):
##  [1] viridis_0.6.3      sass_0.4.7         utf8_1.2.3         generics_0.1.3    
##  [5] tidyr_1.3.0        lattice_0.21-8     digest_0.6.33      magrittr_2.0.3    
##  [9] evaluate_0.21      grid_4.3.1         RColorBrewer_1.1-3 fastmap_1.1.1     
## [13] jsonlite_1.8.7     tinytex_0.45       gridExtra_2.3      httr_1.4.6        
## [17] purrr_1.0.1        fansi_1.0.4        crosstalk_1.2.0    viridisLite_0.4.2 
## [21] scales_1.2.1       lazyeval_0.2.2     jquerylib_0.1.4    cli_3.6.1         
## [25] rlang_1.1.1        ellipsis_0.3.2     munsell_0.5.0      withr_2.5.0       
## [29] cachem_1.0.8       yaml_2.3.7         tools_4.3.1        colorspace_2.1-0  
## [33] vctrs_0.6.3        R6_2.5.1           lifecycle_1.0.3    htmlwidgets_1.6.2 
## [37] pkgconfig_2.0.3    hexbin_1.28.3      pillar_1.9.0       bslib_0.5.0       
## [41] gtable_0.3.3       glue_1.6.2         data.table_1.14.8  Rcpp_1.0.11       
## [45] xfun_0.39          tibble_3.2.1       tidyselect_1.2.0   highr_0.10        
## [49] knitr_1.43         farver_2.1.1       htmltools_0.5.5    labeling_0.4.2    
## [53] rmarkdown_2.23     compiler_4.3.1