# Chapter 1 R, Jupyter, and the tidyverse

This is an open source textbook aimed at introducing undergraduate students to data science. It was originally written for the University of British Columbia’s DSCI 100 - Introduction to Data Science course. In this book, we define data science as the study and development of reproducible, auditable processes to obtain value (i.e., insight) from data.

The book is structured so that learners spend the first four chapters learning how to use the R programming language and Jupyter notebooks to load, wrangle/clean, and visualize data, while answering descriptive and exploratory data analysis questions. The remaining chapters illustrate how to solve four common problems in data science, which are useful for answering predictive and inferential data analysis questions:

1. Predicting a class/category for a new observation/measurement (e.g., cancerous or benign tumour)
2. Predicting a value for a new observation/measurement (e.g., 10 km race time for 20 year old females with a BMI of 25).
3. Finding previously unknown/unlabelled subgroups in your data (e.g., products commonly bought together on Amazon)
4. Estimating an average or a proportion from a representative sample (group of people or units) and using that estimate to generalize to the broader population (e.g., the proportion of undergraduate students that own an iphone)

For each of these problems, we map them to the type of data analysis question being asked and discuss what kinds of data are needed to answer such questions (Leek and Peng 2015; Peng and Matsui 2015). More advanced (e.g., causal or mechanistic) data analysis questions are beyond the scope of this text.

Types of data analysis questions

Question type Description Example
Descriptive A question which asks about summarized characteristics of a data set without interpretation (i.e., report a fact). How many people live in each province or territory in Canada?
Exploratory A question asks if there are patterns, trends, or relationships within a single data set. Often used to propose hypotheses for future study. Does political party voting change with indicators of wealth in a set of data collected on 2,000 people living in Canada?
Inferential A question that looks for patterns, trends, or relationships in a single data set and also asks for quantification of how applicable these findings are to the wider population. Does political party voting change with indicators of wealth for all people living in Canada?
Predictive A question that asks about predicting measurements or labels for individuals (people or things). The focus is on what things predict some outcome, but not what causes the outcome. What political party will someone vote for in the next Canadian election?
Causal A question that asks about whether changing one factor will lead to a change in another factor, on average, in the wider population. Does wealth lead to voting for a certain political party in Canadian elections?
Mechanistic A question that asks about the underlying mechanism of the observed patterns, trends, or relationship (i.e., how does it happen?) How does wealth lead to voting for a certain political party in Canadian elections?

Source: What is the question? by Jeffery T. Leek, Roger D. Peng & The Art of Data Science by Roger Peng & Elizabeth Matsui

## 1.1 Chapter learning objectives

By the end of the chapter, students will be able to:

• use a Jupyter notebook to execute provided R code
• edit code and markdown cells in a Jupyter notebook
• create new code and markdown cells in a Jupyter notebook
• load the tidyverse package into R
• create new variables and objects in R using the assignment symbol
• use the help and documentation tools in R
• match the names of the following functions from the tidyverse package to their documentation descriptions:
• read_csv
• select
• filter
• mutate
• ggplot
• aes

## 1.2 Jupyter notebooks

Jupyter notebooks are documents that contain a mix of computer code (and its output) and formattable text. Given that they are able to combine these two in a single document—code is not separate from the output or written report—notebooks are one of the leading tools to create reproducible data analyses. A reproducible data analysis is one where you can reliably and easily recreate the same results when analyzing the same data. Although this sounds like something that should always be true of any data analysis, in reality this is not often the case; one needs to make a conscious effort to perform data analysis in a reproducible manner.

The name Jupyter came from combining the names of the three programming language that it was initially targeted for (Julia, Python, and R), and now many other languages can be used with Jupyter notebooks.

A notebook looks like this:

We have included a short demo video here to help you get started and to introduce you to R and Jupyter. However, the best way to learn how to write and run code and formattable text in a Jupyter notebook is to do it yourself! Here is a worksheet that provides a step-by-step guide through the basics.

Often, the first thing we need to do in data analysis is to load a dataset into R. When we bring spreadsheet-like (think Microsoft Excel tables) data, generally shaped like a rectangle, into R it is represented as what we call a data frame object. It is very similar to a spreadsheet where the rows are the collected observations and the columns are the variables.

The first kind of data we will learn how to load into R (as a data frame) is the spreadsheet-like comma-separated values format (.csv for short). These files have names ending in .csv, and can be opened open and saved from common spreadsheet programs like Microsoft Excel and Google Sheets. For example, a .csv file named can_lang.csv is included with the code for this book. This file— originally from {canlang} R data package—has language data collected in the 2016 Canadian census (Canada 2016). If we were to open this data in a plain text editor, we would see each row on its own line, and each entry in the table separated by a comma:

category,language,mother_tongue,most_at_home,most_at_work,lang_known
Aboriginal languages,"Aboriginal languages, n.o.s.",590,235,30,665
Non-Official & Non-Aboriginal languages,Afrikaans,10260,4785,85,23415
Non-Official & Non-Aboriginal languages,"Afro-Asiatic languages, n.i.e.",1150,445,10,2775
Non-Official & Non-Aboriginal languages,Akan (Twi),13460,5985,25,22150
Non-Official & Non-Aboriginal languages,Albanian,26895,13135,345,31930
Aboriginal languages,"Algonquian languages, n.i.e.",45,10,0,120
Aboriginal languages,Algonquin,1260,370,40,2480
Non-Official & Non-Aboriginal languages,American Sign Language,2685,3020,1145,21930
Non-Official & Non-Aboriginal languages,Amharic,22465,12785,200,33670

To load this data into R, and then to do anything else with it afterwards, we will need to use something called a function. A function is a special word in R that takes in instructions (we call these arguments) and does something. The function we will use to read a .csv file into R is called read_csv.

In its most basic use-case, read_csv expects that the data file:

• has column names (or headers),
• uses a comma (,) to separate the columns, and
• does not have row names.

Below you’ll see the code used to load the data into R using the read_csv function. But there is one extra step we need to do first. Since read_csv is not included in the base installation of R, to be able to use it we have to load it from somewhere else: a collection of useful functions known as a package. The read_csv function in particular is in the tidyverse package (more on this later), which we load using the library function.

Next, we call the read_csv function and pass it a single argument: the name of the file, "can_lang.csv". We have to put quotes around filenames and other letters and words that we use in our code to distinguish it from the special words that make up R programming language. This is the only argument we need to provide for this file, because our file satifies everthing else the read_csv function expects in the default use-case (which we just discussed). Later in the course, we’ll learn more about how to deal with more complicated files where the default arguments are not appropriate. For example, files that use spaces or tabs to separate the columns, or with no column names.

library(tidyverse)
read_csv("data/can_lang.csv")
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…           590          235           30        665
##  2 Non-Official & Non-Abori… Afrikaans                   10260         4785           85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…          1150          445           10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)                  13460         5985           25      22150
##  5 Non-Official & Non-Abori… Albanian                    26895        13135          345      31930
##  6 Aboriginal languages      Algonquian languag…            45           10            0        120
##  7 Aboriginal languages      Algonquin                    1260          370           40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…          2685         3020         1145      21930
##  9 Non-Official & Non-Abori… Amharic                     22465        12785          200      33670
## 10 Non-Official & Non-Abori… Arabic                     419890       223535         5585     629055
## # … with 204 more rows

Above you can also see something neat that Jupyter does to help us understand our code: it colours text depending on its meaning in R. For example, you’ll note that functions get bold green text, while letters and words surrounded by quotations like filenames get blue text.

In case you want to know more (optional): We use the read_csv function from the tidyverse instead of the base R function read.csv because it’s faster and it creates a nicer variant of the base R data frame called a tibble. This has several benefits that we’ll discuss in further detail later in the course.

## 1.4 Assigning value to a data frame

When we loaded the language data collected in the 2016 Canadian census in R above using read_csv, we did not give this data frame a name, so it was just printed to the screen and we cannot do anything else with it. That isn’t very useful; what we would like to do is give a name to the data frame that read_csv outputs so that we can use it later for analysis and visualization.

To assign name to something in R, there are two possible ways—using either the assignment symbol (<-) or the equals symbol (=). From a style perspective, the assignment symbol is preferred and is what we will use in this course. When we name something in R using the assignment symbol, <-, we do not need to surround it with quotes like the filename. This is because we are formally telling R about this word and giving it a value. Only characters and words that act as values need to be surrounded by quotes.

Let’s now use the assignment symbol to give the name can_lang to the language data collected in the 2016 Canadian census data frame that we get from read_csv.

can_lang <- read_csv("data/can_lang.csv")

Wait a minute! Nothing happened this time! Or at least it looks like that. But actually, something did happen: the data was read in and now has the name can_lang associated with it. And we can use that name to access the data frame and do things with it. First we will type the name of the data frame to print it to the screen.

can_lang
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…           590          235           30        665
##  2 Non-Official & Non-Abori… Afrikaans                   10260         4785           85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…          1150          445           10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)                  13460         5985           25      22150
##  5 Non-Official & Non-Abori… Albanian                    26895        13135          345      31930
##  6 Aboriginal languages      Algonquian languag…            45           10            0        120
##  7 Aboriginal languages      Algonquin                    1260          370           40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…          2685         3020         1145      21930
##  9 Non-Official & Non-Abori… Amharic                     22465        12785          200      33670
## 10 Non-Official & Non-Abori… Arabic                     419890       223535         5585     629055
## # … with 204 more rows

## 1.5 Creating subsets of data frames with select & filter

Now, we are going to learn how to obtain subsets of data from a data frame in R using two other tidyverse functions: select and filter. The select function allows you to create a subset of the columns of a data frame, while the filter function allows you to obtain a subset of the rows with specific values.

Before we start using select and filter, let’s take a look at the language data collected in the 2016 Canadian census again to familiarize ourselves with it. We will do this by printing the data we loaded earlier in the chapter to the screen.

can_lang
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…           590          235           30        665
##  2 Non-Official & Non-Abori… Afrikaans                   10260         4785           85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…          1150          445           10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)                  13460         5985           25      22150
##  5 Non-Official & Non-Abori… Albanian                    26895        13135          345      31930
##  6 Aboriginal languages      Algonquian languag…            45           10            0        120
##  7 Aboriginal languages      Algonquin                    1260          370           40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…          2685         3020         1145      21930
##  9 Non-Official & Non-Abori… Amharic                     22465        12785          200      33670
## 10 Non-Official & Non-Abori… Arabic                     419890       223535         5585     629055
## # … with 204 more rows

In this data frame there are 214 rows rows (corresponding to the 214 languages recorded on the 2016 Canadian census) and 6 columns:

1. category: Higher level language category (describing whether the language is an Official Canadian language, an Aboriginal language, or a Non-Official and Non-Aboriginal language).
2. language: Language queried about on the Canadian Census.
3. mother_tongue: Total count of Canadians from the Census who reported the language as their mother tongue. Mother tongue is generally defined as the language someone was exposed to since birth.
4. most_at_home: Total count of Canadians from the Census who reported the language as spoken most often at home.
5. most_at_work: Total count of Canadians from the Census who reported the language as used most often at work for the population.
6. lang_known: Total count of Canadians from the Census who reported knowledge of language for the population in private households.

Now let’s use select to extract the language column from this data frame. To do this, we need to provide the select function with two arguments. The first argument is the name of the data frame object, which in this example is can_lang. The second argument is the column name that we want to select, here language. After passing these two arguments, the select function returns a single column (the language column that we asked for) as a data frame.

language_column <- select(can_lang, language)
language_column
## # A tibble: 214 x 1
##    language
##    <chr>
##  1 Aboriginal languages, n.o.s.
##  2 Afrikaans
##  3 Afro-Asiatic languages, n.i.e.
##  4 Akan (Twi)
##  5 Albanian
##  6 Algonquian languages, n.i.e.
##  7 Algonquin
##  8 American Sign Language
##  9 Amharic
## 10 Arabic
## # … with 204 more rows

### 1.5.1 Using select to extract multiple columns

We can also use select to obtain a subset of the data frame with multiple columns. Again, the first argument is the name of the data frame. Then we list all the columns we want as arguments separated by commas. Here we create a subset of three columns: language, mother tongue, and language spoken most often at home.

three_columns <- select(can_lang, language, mother_tongue, most_at_home)
three_columns
## # A tibble: 214 x 3
##    language                       mother_tongue most_at_home
##    <chr>                                  <dbl>        <dbl>
##  1 Aboriginal languages, n.o.s.             590          235
##  2 Afrikaans                              10260         4785
##  3 Afro-Asiatic languages, n.i.e.          1150          445
##  4 Akan (Twi)                             13460         5985
##  5 Albanian                               26895        13135
##  6 Algonquian languages, n.i.e.              45           10
##  7 Algonquin                               1260          370
##  8 American Sign Language                  2685         3020
##  9 Amharic                                22465        12785
## 10 Arabic                                419890       223535
## # … with 204 more rows

### 1.5.2 Using select to extract a range of columns

We can also use select to obtain a subset of the data frame constructed from a range of columns. To do this we use the colon (:) operator to denote the range. For example, to get all the columns in the data frame from language to most_at_home we pass language:most_at_home as the second argument to the select function.

column_range <- select(can_lang, language:most_at_home)
column_range
## # A tibble: 214 x 3
##    language                       mother_tongue most_at_home
##    <chr>                                  <dbl>        <dbl>
##  1 Aboriginal languages, n.o.s.             590          235
##  2 Afrikaans                              10260         4785
##  3 Afro-Asiatic languages, n.i.e.          1150          445
##  4 Akan (Twi)                             13460         5985
##  5 Albanian                               26895        13135
##  6 Algonquian languages, n.i.e.              45           10
##  7 Algonquin                               1260          370
##  8 American Sign Language                  2685         3020
##  9 Amharic                                22465        12785
## 10 Arabic                                419890       223535
## # … with 204 more rows

### 1.5.3 Using filter to extract a single row

We can use the filter function to obtain the subset of rows with desired values from a data frame. Again, our first argument is the name of the data frame object, can_lang. The second argument is a logical statement to use when filtering the rows. Here, for example, we’ll say that we are interested in rows where the language is Mandarin. To make this comparison, we use the equivalency operator == to compare the values of the language column with the value "Mandarin". Similar to when we loaded the data file and put quotes around the filename, here we need to put quotes around "Mandarin" to tell R that this is a character value and not one of the special words that make up R programming language, nor one of the names we have given to data frames in the code we have already written.

With these arguments, filter returns a data frame that has all the columns of the input data frame but only the rows we asked for in our logical filter statement.

mandarin <- filter(can_lang, language == "Mandarin")
mandarin
## # A tibble: 1 x 6
##   category                              language mother_tongue most_at_home most_at_work lang_known
##   <chr>                                 <chr>            <dbl>        <dbl>        <dbl>      <dbl>
## 1 Non-Official & Non-Aboriginal langua… Mandarin        592040       462890        60090     814450

### 1.5.4 Using filter to extract rows with values above a threshold

If we are interested in finding information about the languages who have a higher number of people who primarily speak it at home compared to Mandarin—which is reported to have 462890 people speaking it as the primary language they speak in their home—then we can create a filter to obtain rows where the value of most_at_home is greater than 462890. In this case, we see that filter returns a data frame with 2 rows; this indicates that there are two languages that are spoken more often at home compared to Mandarin.

spoke_often_at_home <- filter(can_lang, most_at_home > 462890)
spoke_often_at_home
## # A tibble: 2 x 6
##   category           language mother_tongue most_at_home most_at_work lang_known
##   <chr>              <chr>            <dbl>        <dbl>        <dbl>      <dbl>
## 1 Official languages English       19460850     22162865     15265335   29748265
## 2 Official languages French         7166700      6943800      3825215   10242945

## 1.6 Exploring data with visualizations

Creating effective data visualizations is an essential piece to any data analysis. For the remainder of Chapter 1, we will learn how to use functions from the tidyverse to make visualizations that let us explore relationships in data. In particular, we’ll develop a visualization of the language data collected in the 2016 Canadian census we’ve been working with that will help us understand two potential relationships in the data: first, the relationship between the number of people who speak a language as their mother tongue and the number of people who speak that language as their primary spoken language at home, and second, whether there is a pattern in the strength of this relationship in the higher level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages). This is an example of an exploratory data analysis question: we are looking for relationships and patterns within the data set we have, but are not trying to generalize what we find beyond this data set.

### 1.6.1 Using ggplot to create a scatter plot

Taking another look at our data set below, we can immediately see that the three columns (or variables) we are interested in visualizing—mother tongue, language spoken most at home, and higher level language category—are all in separate columns. In addition, there is a single row (or observation) for each language. The data are therefore in what we call a tidy data format. Tidy data is particularly important concept and will be a major focus in the remainder of this course: many of the functions from tidyverse require tidy data, including the ggplot function that we will use shortly for our visualization. We will formally introduce this concept in chapter 3.

can_lang
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…           590          235           30        665
##  2 Non-Official & Non-Abori… Afrikaans                   10260         4785           85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…          1150          445           10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)                  13460         5985           25      22150
##  5 Non-Official & Non-Abori… Albanian                    26895        13135          345      31930
##  6 Aboriginal languages      Algonquian languag…            45           10            0        120
##  7 Aboriginal languages      Algonquin                    1260          370           40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…          2685         3020         1145      21930
##  9 Non-Official & Non-Abori… Amharic                     22465        12785          200      33670
## 10 Non-Official & Non-Abori… Arabic                     419890       223535         5585     629055
## # … with 204 more rows

We will begin with a scatter plot of the mother_tongue and most_at_home columns from our data frame. To create a scatter plot of these two variables using the ggplot function, we do the following:

1. call the ggplot function
2. provide the name of the data frame as the first argument
3. call the aesthetic function, aes, to specify which column will correspond to the x-axis and which will correspond to the y-axis
4. add a + symbol at the end of the ggplot call to add a layer to the plot
5. call the geom_point function to tell R that we want to represent the data points as dots/points to create a scatter plot.
ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
geom_point()

In case you have used R before and are curious: There are a small number of situations in which you can have a single R expression span multiple lines. Here, the + symbol at the end of the first line tells R that the expression isn’t done yet and to continue reading on the next line. While not strictly necessary, this sort of pattern will appear a lot when using ggplot as it keeps things more readable.

### 1.6.2 Formatting ggplot objects

It is motivating and exciting that we have already been able to visualize our data to help answer our question, but we are not done yet! There is more we can (and should) do to improve the interpretability of the data visualization that we created. For example, by default, R uses the column names as the axis labels, however, usually these column names do not have enough information about the variable in the column. We really should replace this default with a more informative label. For the example above, the column name mother_tongue is used as the label for the y-axis, but most people will not know what that is. And even if they did, they will not know how we are measuring mother tongue, nor which group of people the measurements were taken. An axis label that read “Mother tongue (number of Canadian residents)” would be much more informative.

Adding additional layers to our visualizations that we create in ggplot is one common and easy way to improve and refine our data visualizations. New layers are added to ggplot objects using the + symbol. For example, we can use the xlab and ylab functions to add layers where we specify meaningful and informative labels for the x and y axes. Again, since we are specifying words (e.g. "Mother tongue (number of Canadian residents)") as arguments to xlab and ylab, we surround them with double quotes. There are many more layers we can add to format the plot further, and we will explore these in later chapters.

ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
geom_point() +
xlab("Language spoken most at home (number of Canadian residents)") +
ylab("Mother tongue (number of Canadian residents)")

Most of the data points from the 214 observations in this data set are bunched up in the lower left-handside of this visualization. This is because many many more people in Canada speak the two official languages (English and French). Thus to answer our question, we will need to adjust the scale of the x and y axes so that they are on a log scale. We can again add additional layers to the plot object using the + symbol to do this:

ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
geom_point() +
xlab("Language spoken most at home (number of Canadian residents)") +
ylab("Mother tongue (number of Canadian residents)") +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)

From this visualization we see that for the 214 languages in this data set, as the number of people who have a language as their mother tongue increases, so does the number of people who speak that language at home. When we see two variables do this, we call this a positive relationship. Because the points are fairly close together, we can say that the relationship is strong. Because drawing a straight line through these points would fit the pattern we observe quite well, we say that it’s linear.

Learning how to describe data visualizations is a very useful skill. We will provide descriptions for you in this course (as we did above) until we get to Chapter 4, which focuses on data visualization. Then, we will explicitly teach you how to do this yourself, and how to not over-state or over-interpret the results from a visualization.

### 1.6.3 Changing the units

What does it mean that 19,460,850 people reported that their mother tongue was English in the 2016 Canadian census? To really understand this number, we need context. In particular, how many people were in Canada when this data was collected? From the 2016 Canadian census profile, we can see that the population was reported to be 35,151,728 people. The count of the number of people who report that English is their mother tongue is much more meaningful when we report it in this context. We can even go a step further and transform this count to a relative frequency, or proportion, so that we can represent this as a single meaningful number in our data visualizations. We can do this by dividing the number of people reporting a given language as their mother tongue by the number of people who live in Canada. For example, the proportion of people who reported that their mother tongue was English in the 2016 Canadian census was 0.55.

We can use the mutate function in R to do this for all of the languages in the 2016 Canadian census data set. mutate is useful for creating new columns in a data frame, as well as transforming existing columns. It’s general syntax is: mutate(dataframe, column_to_create/transform = value/how_to_transform). Below we use mutate to calculate the proportion of people reporting a given language as their mother tongue for all the languages in the can_lang data set:

can_lang <- mutate(can_lang, mother_tongue = mother_tongue / 35151728)
can_lang
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…    0.0000168           235           30        665
##  2 Non-Official & Non-Abori… Afrikaans              0.000292           4785           85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…    0.0000327           445           10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)             0.000383           5985           25      22150
##  5 Non-Official & Non-Abori… Albanian               0.000765          13135          345      31930
##  6 Aboriginal languages      Algonquian languag…    0.00000128           10            0        120
##  7 Aboriginal languages      Algonquin              0.0000358           370           40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…    0.0000764          3020         1145      21930
##  9 Non-Official & Non-Abori… Amharic                0.000639          12785          200      33670
## 10 Non-Official & Non-Abori… Arabic                 0.0119           223535         5585     629055
## # … with 204 more rows

Let’s also do this for the counts of the number of people who report that they speak a given language most often at home:

can_lang <- mutate(can_lang, most_at_home = most_at_home / 35151728)
can_lang
## # A tibble: 214 x 6
##    category                  language            mother_tongue most_at_home most_at_work lang_known
##    <chr>                     <chr>                       <dbl>        <dbl>        <dbl>      <dbl>
##  1 Aboriginal languages      Aboriginal languag…    0.0000168   0.00000669            30        665
##  2 Non-Official & Non-Abori… Afrikaans              0.000292    0.000136              85      23415
##  3 Non-Official & Non-Abori… Afro-Asiatic langu…    0.0000327   0.0000127             10       2775
##  4 Non-Official & Non-Abori… Akan (Twi)             0.000383    0.000170              25      22150
##  5 Non-Official & Non-Abori… Albanian               0.000765    0.000374             345      31930
##  6 Aboriginal languages      Algonquian languag…    0.00000128  0.000000284            0        120
##  7 Aboriginal languages      Algonquin              0.0000358   0.0000105             40       2480
##  8 Non-Official & Non-Abori… American Sign Lang…    0.0000764   0.0000859           1145      21930
##  9 Non-Official & Non-Abori… Amharic                0.000639    0.000364             200      33670
## 10 Non-Official & Non-Abori… Arabic                 0.0119      0.00636             5585     629055
## # … with 204 more rows

Finally, let’s visualize the data now that we have represented it as proportions (and change our axis labels to reflect this change in units!):

ggplot(can_lang, aes(x = most_at_home, y = mother_tongue)) +
geom_point() +
xlab("Language spoken most at home (proportion of Canadian residents)") +
ylab("Mother tongue (proportion of Canadian residents)") +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)

From the visualization above, we can now clearly see that not just a lot, but that the majority of Canadians reported English as their mother tongue and as the language they speak most often at home. Changing the units to include this context increases our understanding and allows us to interpret the numbers in our data set better.

### 1.6.4 Coloring points by group

Now we’ll move onto the second part of our exploratory data analysis question: when considering the relationship between the number of people who have a language as their mother tongue and the number of people who speak that language at home, is there a pattern in the strength of this relationship in the higher-level language categories (Official languages, Aboriginal languages, or non-official and non-Aboriginal languages)? One common way to explore this is to colour the data points on the scatter plot we have already created by group/category. For example, given that we have the higher level language category for each language recorded in the 2016 Canadian census, we can colour the points in our previous scatter plot to represent each language’s higher-level language category.

To do this, we modify our scatter plot code above. Specifically, we will add an argument to the aes function, specifying that the points should be coloured by the category column:

ggplot(can_lang, aes(x = most_at_home, y = mother_tongue, color = category)) +
geom_point() +
xlab("Language spoken most at home (proportion of Canadian residents)") +
ylab("Mother tongue (proportion of Canadian residents)") +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma)

What do we see when considering the second part of our exploratory question? Do we see a difference in the pattern of the relationship between the number of people who speak a language as their mother tongue and the number of people who speak a language as their primary spoken language at home between higher-level language categories? Probably not!

For each higher-level language category there appears to be a positive relationship between the number of people who speak a language as their mother tongue and the number of people who speak a language as their primary spoken language at home. This relationship looks similar, regardless of the category.

Does this mean that this relationship is positive for all languages in the world? Can we use this data visualization on its own to predict how many people have a given language as their mother tongue if we know how many people speak it as their primary language at home?

The answer to both these questions is “no.” However, with this exploratory data analysis, we can create new hypotheses, ideas, and questions (like the ones at the beginning of this paragraph). Answering those questions would likely involve gathering additional data and doing more complex analyses, which we will see more of later in this course.

### 1.6.5 Putting it all together

Below, we put everything from this chapter together in one code chunk. We have added a few more layers to make the data visualization even more effective. Specifically we used have improved the visualizations accessibility by choosing colours that are easier to distinguish and also mapped category to shape, we handled the problem of overlapping data points by making them slightly transparent, and we changed the background from grey to white to improve the contrast. This demonstrates the power of R: in relatively few lines of code, we are able to create an entire data science workflow with a highly effective data visualization.

library(tidyverse)

can_lang <- mutate(can_lang, mother_tongue = mother_tongue / 35151728)
can_lang <- mutate(can_lang, most_at_home = most_at_home / 35151728)

ggplot(can_lang, aes(
x = most_at_home,
y = mother_tongue,
colour = category,
shape = category
)) + # map categories to different shapes
geom_point(alpha = 0.6) + # set the transparency of the points
scale_color_manual(values = c("deepskyblue2", "firebrick1", "black")) + # choose point colours
xlab("Language spoken most at home (proportion of Canadian residents)") +
ylab("Mother tongue (proportion of Canadian residents)") +
scale_x_log10(labels = scales::comma) +
scale_y_log10(labels = scales::comma) +
theme_bw() # use a theme to have a white background

### References

Canada, Statistics. 2016. “Census of Population. Reproduced and Distributed on an "as Is" Basis with the Permission of Statistics Canada.”

Leek, Jeffery T, and Roger D Peng. 2015. “What Is the Question?” Science 347 (6228): 1314–5.

Peng, Roger D, and Elizabeth Matsui. 2015. “The Art of Data Science.” A Guide for Anyone Who Works with Data. Skybrude Consulting, LLC.