Chapter 4 Effective data visualization

4.1 Overview

This chapter will introduce concepts and tools relating to data visualization beyond what we have seen and practiced so far. We will focus on guiding principles for effective data visualization and explaining visualizations independent of any particular tool or programming language. In the process, we will cover some specifics of creating visualizations (scatter plots, bar charts, line graphs, and histograms) for data using R. There are external references that contain a wealth of additional information on the topic of data visualization:

4.2 Chapter learning objectives

  • Describe when to use the following kinds of visualizations to answer specific questions using a dataset:
    • scatter plots
    • line plots
    • bar plots
    • histogram plots
  • Given a data set and a question, select from the above plot types and use R to create a visualization that best answers the question
  • Given a visualization and a question, evaluate the effectiveness of the visualization and suggest improvements to better answer the question
  • Referring to the visualization, communicate the conclusions in nontechnical terms
  • Identify rules of thumb for creating effective visualizations
  • Define the three key aspects of ggplot objects:
    • aesthetic mappings
    • geometric objects
    • scales
  • Use the ggplot2 package in R to create and refine the above visualizations using:
    • geometric objects: geom_point, geom_line, geom_histogram, geom_bar, geom_vline, geom_hline
    • scales: scale_x_continuous, scale_y_continuous
    • aesthetic mappings: x, y, fill, colour, shape
    • labelling: xlab, ylab, labs
    • font control and legend positioning: theme
    • flipping axes: coord_flip
    • subplots: facet_grid
  • Describe the difference in raster and vector output formats
  • Use ggsave to save visualizations in .png and .svg format

4.3 Choosing the visualization

4.3.0.1 Ask a question, and answer it

The purpose of a visualization is to answer a question about a data set of interest. So naturally, the first thing to do before creating a visualization is to formulate the question about the data that you are trying to answer. A good visualization will clearly answer your question without distraction; a great visualization will suggest even what the question was itself without additional explanation. Imagine your visualization as part of a poster presentation for a project; even if you aren’t standing at the poster explaining things, an effective visualization will be able to convey your message to the audience.

Recall the different data analysis questions from the first chapter of this book. With the visualizations we will cover in this chapter, we will be able to answer only descriptive and exploratory questions. Be careful not to try to answer any predictive, inferential, causal or mechanistic questions, as we have not learned the tools necessary to do that properly just yet.

As with most coding tasks, it is totally fine (and quite common) to make mistakes and iterate a few times before you find the right visualization for your data and question. There are many different kinds of plotting graphics available to use. For the kinds we will introduce in this course, the general rules of thumb are:

  • line plots visualize trends with respect to an independent, ordered quantity (e.g. time)
  • histograms visualize the distribution of one quantitative variable (i.e., all its possible values and how often they occur)
  • scatter plots visualize the distribution / relationship of two quantitative variables
  • bar plots visualize comparisons of amounts

All types of visualization have their (mis)uses, but three kinds are usually hard to understand or are easily replaced with an oft-better alternative. In particular, you should avoid pie charts; it is generally better to use bars, as it is easier to compare bar heights than pie slice sizes. You should also not use 3-D visualizations, as they are typically hard to understand when converted to a static 2-D image format. Finally, do not use tables to make numerical comparisons; humans are much better at quickly processing visual information than text and math. Bar plots are again typically a better alternative.

4.4 Refining the visualization

4.4.0.1 Convey the message, minimize noise

Just being able to make a visualization in R with ggplot2 (or any other tool for that matter) doesn’t mean that it effectively communicates your message to others. Once you have selected a broad type of visualization to use, you will have to refine it to suit your particular need. Some rules of thumb for doing this are listed below. They generally fall into two classes: you want to make your visualization convey your message, and you want to reduce visual noise as much as possible. Humans have limited cognitive ability to process information; both of these types of refinement aim to reduce the mental load on your audience when viewing your visualization, making it easier for them to understand and remember your message quickly.

Convey the message

  • Make sure the visualization answers the question you have asked most simply and plainly as possible.
  • Use legends and labels so that your visualization is understandable without reading the surrounding text.
  • Ensure the text, symbols, lines, etc., on your visualization are big enough to be easily read.
  • Ensure the data are clearly visible; don’t hide the shape/distribution of the data behind other objects (e.g. a bar).
  • Make sure to use colour schemes that are understandable by those with colourblindness (a surprisingly large fraction of the overall population). For example, colorbrewer.org and the RColorBrewer R package provide the ability to pick such colour schemes, and you can check your visualizations after you have created them by uploading to online tools such as the colour blindness simulator.
  • Redundancy can be helpful; sometimes conveying the same message in multiple ways reinforces it for the audience.

Minimize noise

  • Use colours sparingly. Too many different colours can be distracting, create false patterns, and detract from the message.
  • Be wary of overplotting. If your plot has too many dots or lines and starts to look like a mess, you need to do something different.
  • Only make the plot area (where the dots, lines, bars are) as big as needed. Simple plots can be made small.
  • Don’t adjust the axes to zoom in on small differences. If the difference is small, show that it’s small!

4.5 Creating visualizations with ggplot2

4.5.0.1 Build the visualization iteratively

This section will cover examples of how to choose and refine a visualization given a data set and a question that you want to answer, and then how to create the visualization in R using ggplot2. To use the ggplot2 package, we need to load the tidyverse metapackage.

library(tidyverse)

4.5.1 The Mauna Loa CO2 data set

The Mauna Loa CO2 data set, curated by Dr. Pieter Tans, NOAA/GML and Dr. Ralph Keeling, Scripps Institution of Oceanography records the atmospheric concentration of carbon dioxide (CO2, in parts per million) at the Mauna Loa research station in Hawaii from 1959 onwards.

Question: Does the concentration of atmospheric CO2 change over time, and are there any interesting patterns to note?

# mauna loa carbon dioxide data
co2_df <- read_csv("data/mauna_loa.csv") %>%
  filter(ppm > 0, date_decimal < 2000)
co2_df
## # A tibble: 495 x 4
##     year month date_decimal   ppm
##    <dbl> <dbl>        <dbl> <dbl>
##  1  1958     3        1958.  316.
##  2  1958     4        1958.  317.
##  3  1958     5        1958.  318.
##  4  1958     7        1959.  316.
##  5  1958     8        1959.  315.
##  6  1958     9        1959.  313.
##  7  1958    11        1959.  313.
##  8  1958    12        1959.  315.
##  9  1959     1        1959.  316.
## 10  1959     2        1959.  316.
## # … with 485 more rows

Since we are investigating a relationship between two variables (CO2 concentration and date), a scatter plot is a good place to start. Scatter plots show the data as individual points with x (horizontal axis) and y (vertical axis) coordinates. Here, we will use the decimal date as the x coordinate and CO2 concentration as the y coordinate. When using the ggplot2 package, we create the plot object with the ggplot function; there are a few basic aspects of a plot that we need to specify:

  • the data: the name of the data frame object that we would like to visualize
    • here, we specify the co2_df data frame
  • the aesthetic mapping: tells ggplot how the columns in the data frame map to properties of the visualization
    • to create an aesthetic mapping, we use the aes function
    • here, we set the plot x axis to the date_decimal variable, and the plot y axis to the ppm variable
  • the geometric object: specifies how the mapped data should be displayed
    • to create a geometric object, we use a geom_* function (see the ggplot reference for a list of geometric objects)
    • here, we use the geom_point function to visualize our data as a scatterplot

We could pass many other possible arguments to the aesthetic mapping and geometric object to change how the plot looks. For the purposes of quickly testing things out to see what they look like, though, we can just go with the default settings:

co2_scatter <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
  geom_point()
co2_scatter
Scatter plot of atmospheric concentration of CO2 over time

Figure 4.1: Scatter plot of atmospheric concentration of CO2 over time

Certainly, the visualization shows a clear upward trend in the atmospheric concentration of CO2 over time. This plot answers the first part of our question in the affirmative, but that appears to be the only conclusion one can make from the scatter visualization. However, since time is an ordered quantity, we can try using a line plot instead using the geom_line function. Line plots require that their x coordinate orders the data, and connect the sequence of x and y coordinates with line segments. Let’s again try this with just the default arguments:

co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
  geom_line()
co2_line
Line plot of atmospheric concentration of CO2 over time

Figure 4.2: Line plot of atmospheric concentration of CO2 over time

Aha! There is another interesting phenomenon in the data: in addition to increasing over time, the concentration seems to oscillate as well. Given the visualization as it is now, it is still hard to tell how fast the oscillation is, but nevertheless, the line seems to be a better choice for answering the question than the scatter plot was. The comparison between these two visualizations illustrates a common issue with scatter plots: often, the points are shown too close together or even on top of one another, muddling information that would otherwise be clear (overplotting).

Now that we have settled on the rough details of the visualization, it is time to refine things. This plot is fairly straightforward, and there is not much visual noise to remove. But there are a few things we must do to improve clarity, such as adding informative axis labels and making the font a more readable size. To add axis labels, we use the xlab and ylab functions. To change the font size, we use the theme function with the text argument:

co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
  geom_line() +
  xlab("Year") +
  ylab("Atmospheric CO2 (ppm)") +
  theme(text = element_text(size = 18))
co2_line
Line plot of atmospheric concentration of CO2 over time with clearer axes and labels

Figure 4.3: Line plot of atmospheric concentration of CO2 over time with clearer axes and labels

Finally, let’s see if we can better understand the oscillation by changing the visualization slightly. Note that it is totally fine to use a small number of visualizations to answer different aspects of the question you are trying to answer. We will accomplish this by using scales, another important feature of ggplot2 that easily transforms the different variables and set limits. We scale the horizontal axis using the scale_x_continuous function, and the vertical axis with the scale_y_continuous function. We can transform the axis by passing the trans argument, and set limits by passing the limits argument. In particular, here, we will use the scale_x_continuous function with the limits argument to zoom in on just five years of data (say, 1990-1995):

co2_line <- ggplot(co2_df, aes(x = date_decimal, y = ppm)) +
  geom_line() +
  xlab("Year") +
  ylab("Atmospheric CO2 (ppm)") +
  scale_x_continuous(limits = c(1990, 1995)) +
  theme(text = element_text(size = 18))
co2_line
Line plot of atmospheric concentration of CO2 from 1990 to 1995 only

Figure 4.4: Line plot of atmospheric concentration of CO2 from 1990 to 1995 only

Interesting! It seems that each year, the atmospheric CO2 increases until it reaches its peak somewhere around April, decreases until around late September, and finally increases again until the end of the year. In Hawaii, there are two seasons: summer from May through October, and winter from November through April. Therefore, the oscillating pattern in CO2 matches up fairly closely with the two seasons.

4.5.2 The island landmass data set

The islands.csv data set contains a list of Earth’s landmasses as well as their area (in thousands of square miles).

Question: Are the continents (North / South America, Africa, Europe, Asia, Australia, Antarctica) Earth’s seven largest landmasses? If so, what are the next few largest landmasses after those?

# islands data
islands_df <- read_csv("data/islands.csv")
islands_df
## # A tibble: 48 x 2
##    landmass      size
##    <chr>        <dbl>
##  1 Africa       11506
##  2 Antarctica    5500
##  3 Asia         16988
##  4 Australia     2968
##  5 Axel Heiberg    16
##  6 Baffin         184
##  7 Banks           23
##  8 Borneo         280
##  9 Britain         84
## 10 Celebes         73
## # … with 38 more rows

Here, we have a list of Earth’s landmasses, and are trying to compare their sizes. The right type of visualization to answer this question is a bar plot, specified by the geom_bar function in ggplot2. However, by default, geom_bar sets the heights of bars to the number of times a value appears in a data frame (its count); here, we want to plot exactly the values in the data frame, i.e., the landmass sizes. So we have to pass the stat = "identity" argument to geom_bar:

islands_bar <- ggplot(islands_df, aes(x = landmass, y = size)) +
  geom_bar(stat = "identity")
islands_bar
Bar plot of all Earth's landmasses' size with squished labels

Figure 4.5: Bar plot of all Earth’s landmasses’ size with squished labels

Alright, not bad! This plot is definitely the right kind of visualization, as we can clearly see and compare sizes of landmasses. The major issues are that the smaller landmasses’ sizes are hard to distinguish, and the names of the landmasses are obscuring each other as they have been squished into too little space. But remember that the question we asked was only about the largest landmasses; let’s make the plot a little bit clearer by keeping only the largest 12 landmasses. We do this using the top_n function. Then to help us make sure the labels have enough space, we’ll use horizontal bars instead of vertical ones. We do this using the coord_flip function, which swaps the x and y coordinate axes:

islands_top12 <- top_n(islands_df, 12, size)
islands_bar <- ggplot(islands_top12, aes(x = landmass, y = size)) +
  geom_bar(stat = "identity") +
  coord_flip()
islands_bar
Bar plot of size for Earth's largest 12 landmasses

Figure 4.6: Bar plot of size for Earth’s largest 12 landmasses

This plot is definitely clearer now, and allows us to answer our question (“are the top 7 largest landmasses continents?”) in the affirmative. But the question could be made clearer from the plot by organizing the bars not by alphabetical order but by size, and to colour them based on whether they are a continent. To do this, we use mutate to add a column to the data regarding whether or not the landmass is a continent:

islands_top12 <- top_n(islands_df, 12, size)
continents <- c("Africa", "Antarctica", "Asia", "Australia", "Europe", "North America", "South America")
islands_ct <- mutate(islands_top12, is_continent = ifelse(landmass %in% continents, "Continent", "Other"))
islands_ct
## # A tibble: 12 x 3
##    landmass       size is_continent
##    <chr>         <dbl> <chr>       
##  1 Africa        11506 Continent   
##  2 Antarctica     5500 Continent   
##  3 Asia          16988 Continent   
##  4 Australia      2968 Continent   
##  5 Baffin          184 Other       
##  6 Borneo          280 Other       
##  7 Europe         3745 Continent   
##  8 Greenland       840 Other       
##  9 Madagascar      227 Other       
## 10 New Guinea      306 Other       
## 11 North America  9390 Continent   
## 12 South America  6795 Continent

In order to colour the bars, we add the fill argument to the aesthetic mapping. Then we use the reorder function in the aesthetic mapping to organize the landmasses by their size variable. Finally, we use the labs and theme functions to add labels, change the font size, and position the legend:

islands_bar <- ggplot(islands_ct, aes(x = reorder(landmass, size), y = size, fill = is_continent)) +
  geom_bar(stat = "identity") +
  labs(x = "Landmass", y = "Size (1000 square mi)", fill = "Type") +
  coord_flip() +
  theme(text = element_text(size = 18), legend.position = c(0.75, 0.45))
islands_bar
Bar plot of size for Earth's largest 12 landmasses coloured by whether its a continent with clearer axes and labels

Figure 4.7: Bar plot of size for Earth’s largest 12 landmasses coloured by whether its a continent with clearer axes and labels

This is now a very effective visualization for answering our original questions. Landmasses are organized by their size, and continents are coloured differently than other landmasses, making it quite clear that continents are the largest seven landmasses.

4.5.3 The Old Faithful eruption/waiting time data set

The faithful data set contains measurements of the waiting time between eruptions and the subsequent eruption duration (in minutes). The faithful data set is available in base R under the name faithful so it does not need to be loaded.

Question: Is there a relationship between the waiting time before an eruption to the duration of the eruption?

# old faithful eruption time / wait time data
faithful
## # A tibble: 272 x 2
##    eruptions waiting
##        <dbl>   <dbl>
##  1      3.6       79
##  2      1.8       54
##  3      3.33      74
##  4      2.28      62
##  5      4.53      85
##  6      2.88      55
##  7      4.7       88
##  8      3.6       85
##  9      1.95      51
## 10      4.35      85
## # … with 262 more rows

Here again, we investigate the relationship between two quantitative variables (waiting time and eruption time). But if you look at the output of the data frame, you’ll notice that neither of the columns are ordered. So, in this case, let’s start again with a scatter plot:

faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point()
faithful_scatter
Scatter plot of waiting time and eruption time

Figure 4.8: Scatter plot of waiting time and eruption time

We can see that the data tend to fall into two groups: one with short waiting and eruption times, and one with long waiting and eruption times. Note that in this case, there is no overplotting: the points are generally nicely visually separated, and the pattern they form is clear. In order to refine the visualization, we need only to add axis labels and make the font more readable:

faithful_scatter <- ggplot(faithful, aes(x = waiting, y = eruptions)) +
  geom_point() +
  labs(x = "Waiting Time (mins)", y = "Eruption Duration (mins)") +
  theme(text = element_text(size = 18))
faithful_scatter
Scatter plot of waiting time and eruption time with clearer axes and labels

Figure 4.9: Scatter plot of waiting time and eruption time with clearer axes and labels

4.5.4 The Michelson speed of light data set

The morley data set contains measurements of the speed of light (in kilometres per second with 299,000 subtracted) from the year 1879 for five experiments, each with 20 consecutive runs. This data set is available in base R under the name morley so it does not need to be loaded.

Question: Given what we know now about the speed of light (299,792.458 kilometres per second), how accurate were each of the experiments?

# michelson morley experimental data
morley
## # A tibble: 100 x 3
##     Expt   Run Speed
##    <int> <int> <int>
##  1     1     1   850
##  2     1     2   740
##  3     1     3   900
##  4     1     4  1070
##  5     1     5   930
##  6     1     6   850
##  7     1     7   950
##  8     1     8   980
##  9     1     9   980
## 10     1    10   880
## # … with 90 more rows

In this experimental data, Michelson was trying to measure just a single quantitative number (the speed of light). The data set contains many measurements of this single quantity. To tell how accurate the experiments were, we need to visualize the distribution of the measurements (i.e., all their possible values and how often each occurs). We can do this using a histogram. A histogram helps us visualize how a particular variable is distributed in a data set by separating the data into bins, and then using vertical bars to show how many data points fell in each bin. To create a histogram in ggplot2 we will use the geom_histogram geometric object, setting the x axis to the Speed measurement variable; and as we did before, let’s use the default arguments just to see how things look:

morley_hist <- ggplot(morley, aes(x = Speed)) +
  geom_histogram()
morley_hist
Histogram of Michelson's speed of light data

Figure 4.10: Histogram of Michelson’s speed of light data

This is a great start. However, we cannot tell how accurate the measurements are using this visualization unless we can see what the true value is. In order to visualize the true speed of light, we will add a vertical line with the geom_vline function, setting the xintercept argument to the true value. There is a similar function, geom_hline, that is used for plotting horizontal lines. Note that vertical lines are used to denote quantities on the horizontal axis, while horizontal lines are used to denote quantities on the vertical axis.

morley_hist <- ggplot(morley, aes(x = Speed)) +
  geom_histogram() +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
morley_hist
Histogram of Michelson's speed of light data with vertical line indicating true speed of light

Figure 4.11: Histogram of Michelson’s speed of light data with vertical line indicating true speed of light

We also still cannot tell which experiments (denoted in the Expt column) led to which measurements; perhaps some experiments were more accurate than others. To fully answer our question, we need to separate the measurements from each other visually. We can try to do this using a coloured histogram, where counts from different experiments are stacked on top of each other in different colours. We create a histogram coloured by the Expt variable by adding it to the fill aesthetic mapping. We make sure the different colours can be seen (despite them all sitting on top of each other) by setting the alpha argument in geom_histogram to 0.5 to make the bars slightly translucent:

morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) +
  geom_histogram(position = "identity", alpha = 0.5) +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
morley_hist
Histogram of Michelson's speed of light data coloured by experiment

Figure 4.12: Histogram of Michelson’s speed of light data coloured by experiment

Unfortunately, the attempt to separate out the experiment number visually has created a bit of a mess. All of the colours are blending together, and although it is possible to derive some insight from this (e.g., experiments 1 and 3 had some of the most incorrect measurements), it isn’t the clearest way to convey our message and answer the question. Let’s try a different strategy of creating multiple separate histograms on top of one another.

In order to create a plot in ggplot2 that has multiple subplots arranged in a grid, we use the facet_grid function. The argument to facet_grid specifies the variable(s) used to split the plot into subplots. It has the syntax vertical_variable ~ horizontal_variable, where veritcal_variable is used to split the plot vertically, horizontal_variable is used to split horizontally, and . is used if there should be no split along that axis. In our case, we only want to split vertically along the Expt variable, so we use Expt ~ . as the argument to facet_grid.

morley_hist <- ggplot(morley, aes(x = Speed, fill = factor(Expt))) +
  geom_histogram(position = "identity") +
  facet_grid(Expt ~ .) +
  geom_vline(xintercept = 792.458, linetype = "dashed", size = 1.0)
morley_hist
Histogram of Michelson's speed of light data split vertically by experiment

Figure 4.13: Histogram of Michelson’s speed of light data split vertically by experiment

The visualization now makes it quite clear how accurate the different experiments were with respect to one another. There are two finishing touches to make this visualization even clearer. First and foremost, we need to add informative axis labels using the labs function, and increase the font size to make it readable using the theme function. Second, and perhaps more subtly, even though it is easy to compare the experiments on this plot to one another, it is hard to get a sense for just how accurate all the experiments were overall. For example, how accurate is the value 800 on the plot, relative to the true speed of light? To answer this question, we’ll use the mutate function to transform our data into a relative measure of accuracy rather than absolute measurements:

morley_rel <- mutate(morley, relative_accuracy = 100 * ((299000 + Speed) - 299792.458) / (299792.458))
morley_hist <- ggplot(morley_rel, aes(x = relative_accuracy, fill = factor(Expt))) +
  geom_histogram(position = "identity") +
  facet_grid(Expt ~ .) +
  geom_vline(xintercept = 0, linetype = "dashed", size = 1.0) +
  labs(x = "Relative Accuracy (%)", y = "# Measurements", fill = "Experiment ID") +
  theme(text = element_text(size = 18))
morley_hist
Histogram of relative accuracy split vertically by experiment with clearer axes and labels

Figure 4.14: Histogram of relative accuracy split vertically by experiment with clearer axes and labels

Wow, impressive! These measurements of the speed of light from 1879 had errors around 0.05% of the true speed. This shows you that even though experiments 2 and 5 were perhaps the most accurate, all of the experiments did quite an admirable job given the technology available at the time.

4.6 Explaining the visualization

4.6.0.1 Tell a story

Typically, your visualization will not be shown entirely on its own, but rather it will be part of a larger presentation. Further, visualizations can provide supporting information for any aspect of a presentation, from opening to conclusion. For example, you could use an exploratory visualization in the opening of the presentation to motivate your choice of a more detailed data analysis / model, a visualization of the results of your analysis to show what your analysis has uncovered, or even one at the end of a presentation to help suggest directions for future work.

Regardless of where it appears, a good way to discuss your visualization is as a story:

  1. Establish the setting and scope, and motivate why you did what you did.
  2. Pose the question that your visualization answers. Justify why the question is important to answer.
  3. Answer the question using your visualization. Make sure you describe all aspects of the visualization (including describing the axes). But you can emphasize different aspects based on what is important to answer your question:
    • trends (lines): Does a line describe the trend well? If so, the trend is linear, and if not, the trend is nonlinear. Is the trend increasing, decreasing, or neither? Is there a periodic oscillation (wiggle) in the trend? Is the trend noisy (does the line “jump around” a lot) or smooth?
    • distributions (scatters, histograms): How spread out are the data? Where are they centered, roughly? Are there any obvious “clusters” or “subgroups”, which would be visible as multiple bumps in the histogram?
    • distributions of two variables (scatters): is there a clear / strong relationship between the variables (points fall in a distinct pattern), a weak one (points fall in a pattern but there is some noise), or no discernible relationship (the data are too noisy to make any conclusion)?
    • amounts (bars): How large are the bars relative to one another? Are there patterns in different groups of bars?
  4. Summarize your findings, and use them to motivate whatever you will discuss next.

Below are two examples of how one might take these four steps in describing the example visualizations that appeared earlier in this chapter. Each of the steps is denoted by its numeral in parentheses, e.g. (3).

Mauna Loa Atmospheric CO2 Measurements: (1) Many current forms of energy generation and conversion—from automotive engines to natural gas power plants—rely on burning fossil fuels and produce greenhouse gases, typically primarily carbon dioxide (CO2), as a byproduct. Too much of these gases in the Earth’s atmosphere will cause it to trap more heat from the sun, leading to global warming. (2) In order to assess how quickly the atmospheric concentration of CO2 is increasing over time, we (3) used a data set from the Mauna Loa observatory from Hawaii, consisting of CO2 measurements from 1959 to the present. We plotted the measured concentration of CO2 (on the vertical axis) over time (on the horizontal axis). From this plot, you can see a clear, increasing, and generally linear trend over time. There is also a periodic oscillation that occurs once per year and aligns with Hawaii’s seasons, with an amplitude that is small relative to the growth in the overall trend. This shows that atmospheric CO2 is clearly increasing over time, and (4) it is perhaps worth investigating more into the causes.

Michelson Light Speed Experiments: (1) Our modern understanding of the physics of light has advanced significantly from the late 1800s when Michelson and Morley’s experiments first demonstrated that it had a finite speed. We now know based on modern experiments that it moves at roughly 299792.458 kilometres per second. (2) But how accurately were we first able to measure this fundamental physical constant, and did certain experiments produce more accurate results than others? (3) To better understand this we plotted data from 5 experiments by Michelson in 1879, each with 20 trials, as histograms stacked on top of one another. The horizontal axis shows the accuracy of the measurements relative to the true speed of light as we know it today, expressed as a percentage. From this visualization, you can see that most results had relative errors of at most 0.05%. You can also see that experiments 1 and 3 had measurements that were the farthest from the true value, and experiment 5 tended to provide the most consistently accurate result. (4) It would be worth further investigating the differences between these experiments to see why they produced different results.

4.7 Saving the visualization

4.7.0.1 Choose the right output format for your needs

Just as there are many ways to store data sets, there are many ways to store visualizations and images. Which one you choose can depend on several factors, such as file size/type limitations (e.g., if you are submitting your visualization as part of a conference paper or to a poster printing shop) and where it will be displayed (e.g., online, in a paper, on a poster, on a billboard, in talk slides). Generally speaking, images come in two flavours: bitmap (or raster) formats and vector (or scalable graphics) formats.

Bitmap / Raster images are represented as a 2-D grid of square pixels, each with its own colour. Raster images are often compressed before storing so they take up less space. A compressed format is lossy if the image cannot be perfectly recreated when loading and displaying, with the hope that the change is not noticeable. Lossless formats, on the other hand, allow a perfect display of the original image.

  • Common file types:
    • JPEG (.jpg, .jpeg): lossy, usually used for photographs
    • PNG (.png): lossless, usually used for plots / line drawings
    • BMP (.bmp): lossless, raw image data, no compression (rarely used)
    • TIFF (.tif, .tiff): typically lossless, no compression, used mostly in graphic arts, publishing
  • Open-source software: GIMP

Vector / Scalable Graphics images are represented as a collection of mathematical objects (lines, surfaces, shapes, curves). When the computer displays the image, it redraws all of the elements using their mathematical formulas.

  • Common file types:
    • SVG (.svg): general-purpose use
    • EPS (.eps), general-purpose use (rarely used)
  • Open-source software: Inkscape

Raster and vector images have opposing advantages and disadvantages. A raster image of a fixed width / height takes the same amount of space and time to load regardless of what the image shows (caveat: the compression algorithms may shrink the image more or run faster for certain images). A vector image takes space and time to load corresponding to how complex the image is, since the computer has to draw all the elements each time it is displayed. For example, if you have a scatter plot with 1 million points stored as an SVG file, it may take your computer some time to open the image. On the other hand, you can zoom into / scale up vector graphics as much as you like without the image looking bad, while raster images eventually start to look “pixellated.”

PDF files: The portable document format PDF (.pdf) is commonly used to store both raster and vector graphics formats. If you try to open a PDF and it’s taking a long time to load, it may be because there is a complicated vector graphics image that your computer is rendering.

Let’s investigate how different image file formats behave with a scatter plot of the Old Faithful data set:

library(svglite) # we need this to save SVG files
faithful_plot <- ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
  geom_point()

faithful_plot
Scatter plot of waiting time and eruption time

Figure 4.15: Scatter plot of waiting time and eruption time


ggsave("faithful_plot.png", faithful_plot)
ggsave("faithful_plot.jpg", faithful_plot)
ggsave("faithful_plot.bmp", faithful_plot)
ggsave("faithful_plot.tiff", faithful_plot)
ggsave("faithful_plot.svg", faithful_plot)

print(paste("PNG filesize: ", file.info("faithful_plot.png")["size"] / 1000000, "MB"))
## [1] "PNG filesize:  0.07861 MB"
print(paste("JPG filesize: ", file.info("faithful_plot.jpg")["size"] / 1000000, "MB"))
## [1] "JPG filesize:  0.139187 MB"
print(paste("BMP filesize: ", file.info("faithful_plot.bmp")["size"] / 1000000, "MB"))
## [1] "BMP filesize:  3.148978 MB"
print(paste("TIFF filesize: ", file.info("faithful_plot.tiff")["size"] / 1000000, "MB"))
## [1] "TIFF filesize:  9.443892 MB"
print(paste("SVG filesize: ", file.info("faithful_plot.svg")["size"] / 1000000, "MB"))
## [1] "SVG filesize:  0.046145 MB"

Wow, that’s quite a difference! Notice that for such a simple plot with few graphical elements (points), the vector graphics format (SVG) is over 100 times smaller than the uncompressed raster images (BMP, TIFF). Also, note that the JPG format is twice as large as the PNG format since the JPG compression algorithm is designed for natural images (not plots). Below, we also show what the images look like when we zoom in to a rectangle with only 3 data points. You can see why vector graphics formats are so useful: because they’re just based on mathematical formulas, vector graphics can be scaled up to arbitrary sizes. This makes them great for presentation media of all sizes, from papers to posters to billboards.

Zoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formatsZoomed in `faithful`, raster (PNG, left) and vector (SVG, right) formats

Figure 4.16: Zoomed in faithful, raster (PNG, left) and vector (SVG, right) formats