Lecture 4: Data Visualization

DSCI 100

Data visualization in R

Attribution: images in these slides that are not accompanied by code mostly come from
The Fundamentals of Data Visualization by Claus O. Wilke

Artwork by @allison_horst

Today: Visualization

image source: R for Data Science by Grolemund & Wickham

Designing a Visualization

Ask a question, then answer it

  • The purpose of a visualization is to answer a question about a dataset of interest.

  • A good visualization answers the question clearly. A great visualization also hints at the question itself.

Designing a Visualization

Visualizations alone help us answer two types of questions:

  • descriptive: What are the largest 7 landmasses on Earth?
  • exploratory: Is there a relationship between penguin body mass and bill length?
  • inferential
  • predictive
  • causal
  • mechanistic

(we need more tools + visualizations to answer the others)

Creating Visualizations in R with ggplot

  • It’s an iterative procedure. Try things, make mistakes, and refine!
  • We will use ggplot2. There are three key aspects of plots in ggplot2:
    1. aesthetic mappings: map dataframe columns to visual properties
    2. geometric objects: encode how to display those visual properties
    3. scales: transform variables, set limits
  • Add these one by one using +

Creating Visualizations in R with ggplot

ggplot is loaded in with the tidyverse package in R, or can be loaded on its own! We need a number of functions from various packages from the tidyverse (including dplyr, so we’ll load in tidyverse:

Types of Variables

A variable refers to a characteristic of interest and can be:

  1. Categorical: can be divided into groups (categories) e.g. marital status
  2. Quantitative: measured on a numeric scale (usually units are attached) e.g. height


The types of variables (along with the question we wish to answer/explore) we have may depict the type of data visualization we should use.

Scatter Plots

Scatterplots are used to visualize the relationship between two quantitative variables

  • Example: Is there a relationship between horsepower and fuel economy of an engine? Does the number of cylinders affect that relationship?

Line Plots

Line plots are used to visualize trends with respect to an independent quantity

  • Example: How has atmospheric carbon dioxide changed over the last 40 years?

Not coding in these slides? You can find co2_df as a csv file here

Bar Plots

Barplots are used to visualize the comparison of amounts

  • Example: Which are the largest 12 island landmasses on Earth? Are they all continents or are there some other islands with large landmasses as well?

Not coding in these slides? You can find islands_df as a csv file here

Histograms

Histograms are used to visualize the distribution of a single quantitative variable

  • Example: Was there a difference in life expectancy across different continents in 2016?

Not coding in these slides? You can find gapminder_2016 data as a csv file here

Notes:

Rules of Thumb for Visualizations

1) No tables / pie charts

Which one is easier to interpret?

Notes:

Rules of Thumb for Visualizations

2) No 3D visualizations

Notes:

  • the third dimension does not improve the reading of the data

  • these plots are difficult to interpret because of the distorted effect of perspective associated with the third dimension.

Rules of Thumb for Visualizations

3) Use simple, colourblind-friendly colour palettes

Notes:

Rules of Thumb for Visualizations

4) Include labels and legends, make them legible

Remember: a great visualization tells its own story without needing you to be there explaining things

Notes:

Rules of Thumb for Visualizations

5) Avoid overplotting

Generally, need to use an alternative geometric object

Add alpha = 0.2 to geom_point()

  • transparency setting must be between [0,1]

Notes:

Wrap-up

Time to work on our worksheet!


Before Next Class: Please register for a free GitHub account (this will help you follow along!) https://github.com/signup


Need a data-viz refresh? Check out this optional video.