Case Study I: From Exploration to Design

DSCI 200

Intro

Welcome to the first case study in DSCI 200! This case study is the first in a series of hands-on projects designed to give you practical experience working with real-world data. Each case study reinforces key concepts from the course and helps you develop your analytical thinking and data science workflow.

In this case study, you’ll explore topics including exploratory data analysis (EDA), sampling and study design. The project is divided into milestones and tasks that align with the course material, allowing you to work on them progressively as you advance through the lectures.

Each question will have a specific point value assigned to it. Points will be distributed across different components of the question depending on the question type. For each question, you will see the rubric items associated with that question along with their corresponding point values (for example, Reasoning: 5 points). We will be using publicly available rubrics from the UBC MDS program, so for more details about each rubric type please visit https://github.com/UBC-MDS/public/tree/master/rubric.

Setup

Before diving into the case study, let’s make sure all the required packages are available. If any are missing, you can install them by uncommenting the relevant lines below (i.e., remove the # symbol before the install command).

#install.packages('tidyverse')

In addition to tidyverse, this case study uses the hecedsm package. This package includes datasets used in the HEC Montréal course MATH 80667A: Experimental Design and Statistical Methods.You can read more about it here. To install it directly from GitHub:

#devtools::install_github("lbelzile/hecedsm")

Once all the necessary packages are installed, load them with:

library(tidyverse)
library(hecedsm)

Submission Instructions

Each submission will be made on Gradescope (accessible via Canvas) and require the use of Jupyter to communicate the question asked, the analysis performed and the conclusions reached. Your submission must include:

Jupyter notebook (.ipynb file)
Rendered final document (.html or .pdf file)

Rubrics

Each question will have a specific point value assigned to it. Points will be distributed across different components of the question depending on the question type. After each question, you will see the rubric items associated with that question along with their corresponding point values (for example, Mechanics: 5 points). We will be using publicly available rubrics from the UBC MDS program, so for more details about each rubric type please visit https://github.com/UBC-MDS/public/tree/master/rubric.

The Data

In this case study, we will use published research and corresponding datasets from the hecedsm package to gain hands-on experience with real-world data.

Below are three dataset options you can choose from for this case study. Each one includes a research paper and a dataset, both accessible through the provided links.

Virtual communication curbs creative idea generation

The published paper of this study is available here.
The data is availbale in BL22_E from hecedsm dataset. You can read more about it here.

Smartwatches are more distracting than mobile phones while driving: Results from an experimental study

The published paper of this study is available here.
The data is availbale in BRLS21_EDA from hecedsm dataset. You can read more about it here.

Trivially informative semantic context inflates people’s confidence they can perform a highly complex skill

The published paper of this study is available here.
The data is availbale in JZBJG22_E2 from hecedsm dataset. You can read more about it here.

🏁 Milestone 1: Exploratory Data Analysis (EDA) (60 points)

We will begin by applying some of the EDA concepts we learned in class in the upcoming tasks below.

Task 1: Choose your favourite dataset (15 points)

Before moving forward, it is important to understand the context behind your data. In this task, you will start by choosing one of the studies listed above, then read the associated research paper and will load the dataset into R. This will give you a clearer picture of the study’s goals and how the data look like. Answer all questions in this task.

Choose one of the studies of your choice from the options above. Download the paper from the provided link and load data in R. (Mechanics: 5 points)
Read the paper and summarize goals of study. Write your summary in no more than 200 words and be sure to clearly explain: (Reasoning: 10 points)

The primary research questions or objectives the authors aimed to address
The importance or relevance of these goals in the context of the field
Any specific hypotheses or outcomes the study sought to explore

Task 2: Explore data (8 points)

Exploring data is a crucial step before producing any plots. This often helps us to detect any missing values and have a better understanding of the type of variables that we are working with. Answer all questions in this task.

Use the glimpse() or head() function to examine your dataset. How many observations and variables are there in the dataset? (Writing: 1 point)
What type of variables are in the dataset? (i.e numerical vs. categorical). Explain briefly. (Writing: 1 point)
Would you need to convert any variable in the dataset before starting your analysis? Explain briefly why or why not. (Reasoning: 2 points)
How many missing values are there for each variable in the dataset? Are there any variables with a large proportion of missing data (e.g., more than 20%)? (Writing: 2 points)
Is there evidence of class imbalance in any categorical variable in the dataset (e.g., one category makes up a large majority of the observations)? (Reasoning: 2 points)

Task 3: Data visualization (15 points)

Alright! Now we are ready to start making some plots to gain more insights into our dataset. While coding this in R please avoid using basic R functionality for plotting (functions such as plot, lines, ect) for this case study. You are welcome to use them later in your tasks, but we want you to learn using functions from ggplot and dplyr that we learned in class. For this task, answer three out of four questions of your choice.

Select two numerical variables from the dataset and name them here. What type of plot would be appropriate to explore the relationship between these two variables? Write the R code to produce this plot. Describe any pattern you observe. (Reasoning: 2.5 points + Coding: 2.5 points)
Create an appropriate plot to show the distribution of one numerical variable in the dataset. What does the distribution tell you about the variable? What is a typical observation from this distribution? (Reasoning: 2.5 points + Coding: 2.5 points)
Pick one numerical and one categorical variable from your dataset. Using an appropriate plot, what insights can you draw about the relationship between these two variables? (Reasoning: 2.5 points + Coding: 2.5 points)
Create an appropriate plot to show the distribution of one categorical variable in the dataset. What does the distribution tell you about the variable? Is there a “typical” value? (Reasoning: 2.5 points + Coding: 2.5 points)

Task 4: Find summary statistics (10 points)

In this task, you will work with summary statistics. These numbers like means, medians, and standard deviations, can help us better understand patterns in the data, especially when used alongside plots. While we should not rely on summary statistics alone to draw conclusions, combining them with visualizations often gives a clearer picture. Choose and answer any two out of the three questions in this task.

Choose one numerical variable from your dataset. Report an appropriate measure of central tendency and explain why this measure is appropriate for the variable you selected. (Reasoning: 3 points + Coding: 2 points)
Choose one categorical variable from your dataset. Report a relevant summary statistic and explain why it is informative. (Reasoning: 3 points + Coding: 2 points)
Choose two variables from your dataset and compute an appropriate measure of association to measure their relationship. Interpret the value in a simple language. (Reasoning: 3 points + Coding: 2 points)

Task 5: Develop a question (12 points)

By now, you have spent some time exploring your dataset and have a better understanding of its structure and context. In addition to answering research questions, data scientists often play a key role in developing them. This task gives you a chance to practice that skill. Before attempting the questions, make sure to read the study’s paper carefully. Answer all of the questions in this task.

Think of a research question you would like to explore using this dataset if you were the author of this paper. Write your question in one or two sentences in simple and plain language. (Reasoning: 4 points)
Based on the question you wrote do you think you need to do the splitting of your data into train and test? Answer in two or three sentences. (Reasoning: 4 points)
Considering your dataset and research question, create a new feature derived from the existing variables in your dataset that could help you answer your question. Describe how you constructed this variable and explain why you think it will be useful. (Reasoning: 2 points + Coding: 2 points)

🏁 Milestone 2: Sampling and Study Design (35 points)

In this part of case study, we will practice with some of different sampling and study designs techniques that we learned in class. Your goal is to recognize the sampling method that authors used in their study along with study design.

Task 6: Describe design (15 points)

Understanding the design of a study is important. You can often find some information about the design of a study in methodology section of a paper or its introduction part. Spend some time skimming through the text and answer all questions in this task.

What type of data analysis questions are the authors of the paper hoping to answer? Recall the different types of data analysis questions you learned in DSCI 100 (e.g., descriptive, inferential, predictive, causal, exploratory). Which category or categories best describe the authors’ goals? (Reasoning: 5 points)
Describe the population and sample used in this study. Clearly define both the population of interest and the sample that was actually observed in the study. (Reasoning: 5 points)
Was this an observational study or controlled experiment study? Explain your reasoning using relevant terminology. (Reasoning: 5 points)

Task 7: Dive into the design! (10 points)

As we discussed in the lecture notes, it is important to think about the presence of confounding variable in a study design. Also there are sources of sampling bias that can occur due to the way we choose to sample from population. Answer two out of three questions in this task.

Did researchers control for any confounding variable in this study? If yes explain how. If no, can you think of any potential confounding variable that you can adjust for? Explain your reasoning. (Reasoning: 5 points)
What type of sampling method did the authors use? Cite the section of the paper where the sampling method is described or being inferred. Why do you think the authors chose this method over others? (Reasoning: 5 points)
Were there any potential sources of sampling bias that may have affected the study’s conclusions? Explain your reasoning. (Reasoning: 5 points)

Task 8: Now it is your turn! (10 points)

Having done the previous questions in different tasks, you now have a better understanding of the study that you chose. Take a moment to think about all you have explored here and use that to answer the following question:

If you were to design this study yourself, would you use a different sampling and design method? Why or why not? Reflect on what you might do differently and how your choice of sampling method might affect the results. Additionally, consider why the authors might have chosen not to use alternative sampling or design methods. What are the potential limitations or challenges of your proposed approach? (Reasoning: 10 points)

Reproducibility and Organization Checklist (5 points)

Here are the criteria we’re looking for:

The document should read sensibly from top to bottom, with no major continuity errors. All output is recent and relevant. All required files are present in the submission and viewable without errors. Examples of errors: Missing plots, “Sorry about that, but we can’t show files that are this big right now” messages, error messages from broken R code. All of these output files are up-to-date and there should be no relic output files. For example, if you were exporting a notebook to html, but then changed the output to be only a pdf file, then the html file is a relic and should be deleted.

Attribution

This case study incorporates elements from the projects in UBC’s STAT 545: Exploratory Data Analysis.