DSCI 200
Katie Burak, Gabriela V. Cohen Freue
Last modified – 13 March 2026
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
This material is adapted from the following sources:
By the end of today’s lesson, you should be able to:
rvest package to extract data from text and HTML attributes into R.rvest package, part of the tidyverse (but not core).robotstxt package to verify whether we can scrape data from a particular url.
Many websites present their content (including data) using HyperText Markup Language (HTML).


<html>
<head>
<meta charset="UTF-8">
<title>Sample Webpage</title>
</head>
<body>
<h2 class="main-title">Welcome to My Site!</h2>
<p>Here is some sample text with <em>emphasis</em> included.</p>
<img src="sample-image.png" alt="A descriptive image" width="200">
</body>
</html><p>)</p>)<html>: The root element.<head>: Metadata like title and styles.<body>: The visible content.<h1> - Headings<p> - Paragraphs<section> - Sections<ol> - Ordered lists<b> - Bold<i> - Italics<a> - Links<p> is the parent, <b> is the child<b> has no children, but it has the content “name”Attributes provide extra information about elements.
Example syntax:
<a> is the tag (HTML element) for a link.
href is the attribute of the <a> tag, which specifies the URL the link points to.
“https://ubc-stat.github.io/dsci-200/” is the value of the href attribute, indicating the destination of the link.
Click here is the content inside the tag, which the user will see and click.
id='header')class='nav-item')<a href='example.com'>)<img src='photo.jpg'>) id and class are used with CSS (Cascading Style Sheets) to control the appearance of a page and are often useful when web scraping data!
rvest functionsread_html(): Loads HTML content from a webpage URL or a character string.
html_element(): Retrieve a single matching element based on a CSS selector.
html_elements(): Retrieves all matching elements using CSS selectors.
html_table(): Converts HTML tables into data frames.
html_text()/html_text2(): Gets the text content from within HTML tags.
html_name(): Returns the tag name(s) of HTML elements.
html_attr(): Retrieves a single attribute.
html_attrs(): Retrieves all attributes.
html <-
'<html>
<head>
<title>Sample Webpage</title>
</head>
<body>
<h2 class="sub-title">Welcome to My Site!</h2>
<p>Here is some sample text.</p>
<p> Some useful information... </p>
<img src="sample-image.png" alt="A descriptive image" width="200">
</body>
</html>'
read_html(html){html_document}
<html>
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n <h2 class="sub-title">Welcome to My Site!</h2>\n <p>Here i ...
html_element() always returns the same number of outputs as inputs. If you apply it to a whole document it’ll give you the first match:html_text() or html_text2() to get text content:html_text() returns the raw text and html_text2()tries to return text as it would appear in a web browser. For example:html_attr() to extract attributes like links:html_elements() and html_element() together.html_elements() identifies elements that will become observations, and html_element() extracts variables from those elements. Here’s an example using a simple HTML list of coffee shops around UBC:html <- read_html("
<ul>
<li><b>Loafe Cafe</b> serves <i>coffee and pastries</i> and has <span class='seating'>indoor & outdoor seating</span></li>
<li><b>Bean Around the World</b> serves <i>great coffee</i></li>
<li><b>JJ Bean</b> serves <i>strong espresso</i> and has <span class='seating'>cozy indoor seating</span></li>
<li><b>The Great Dane</b> has <span class='seating'>a dog-friendly patio</span></li>
</ul>
")We use html_elements("li") to create an object of class xml_nodeset where each element represents a different coffee shop:
{xml_nodeset (4)}
[1] <li>\n<b>Loafe Cafe</b> serves <i>coffee and pastries</i> and has <span c ...
[2] <li>\n<b>Bean Around the World</b> serves <i>great coffee</i>\n</li>
[3] <li>\n<b>JJ Bean</b> serves <i>strong espresso</i> and has <span class="s ...
[4] <li>\n<b>The Great Dane</b> has <span class="seating">a dog-friendly pati ...
To extract the name of each shop, we use html_element("b"):
Suppose we want one seating description for each shop, even if some don’t have seating information.
What if we used html_elements(".seating")?
{xml_nodeset (3)}
[1] <span class="seating">indoor & outdoor seating</span>
[2] <span class="seating">cozy indoor seating</span>
[3] <span class="seating">a dog-friendly patio</span>
html_element()returns one result per input node, whereashtml_elements()returns all matches (dropping missing ones)
HTML tables are often used to store tabular data on web pages and you can easily extract it. Key HTML table elements:
<table>: Defines the table.<tr>: Table row.<th>: Table heading.<td>: Table data.Here’s a simple HTML table with two columns and three rows:
rvest package to extract tables from HTML pages:# A tibble: 3 × 2
x y
<dbl> <dbl>
1 1.5 2.7
2 4.9 1.3
3 7.2 8.1
The html_table() function automatically converts values like “x” and “y” to numbers.
If you want to prevent this automatic conversion and handle it yourself, use the convert = FALSE argument.
CSS (Cascading Style Sheets) helps define page structure and styling. To extract elements efficiently, we use selectors such as:
p → selects all <p> elements.title → selects elements with class “title”#title → selects element with ID “title”Finding the right CSS selector is typically the hardest part of web scraping. This is because you want a selector that is specific enough that you aren’t capturing unnecessary information, but you also don’t want your scope to be too narrow such that you miss important information.
Instead of inspecting raw HTML code to find the right CSS selector, we are going to use a tool called SelectorGadget.
SelectorGadget works best with Chrome browsers. If you haven’t done so already, install the Chrome extension here.
Once added, the icon should appear to the right of the search bar.

You’re planning an adventurous getaway to the stunning wilderness of Lake Louise in Banff National Park, Alberta. To make the most of your trip, you want to gather up-to-date information on the best hiking trails in the area with details like trail distance and elevation gain.
In this exercise, you’ll explore a real-world data collection task by scraping hiking trail information from Parks Canada.
https://parks.canada.ca/pn-np/ab/banff/activ/randonnee-hiking/lakelouise
paths_allowed() function from the robotstxt package. This function checks if a bot has permissions to access page(s) and returns TRUE if allowed based on the site’s robots.txt file.rvest to read HTML content from Parks Canada site (https://parks.canada.ca/pn-np/ab/banff/activ/randonnee-hiking/lakelouise){html_document}
<html class="no-js" lang="en" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="" vocab="http://schema.org/" typeof="WebPage">\r\n \r\n\r ...
# A tibble: 6 × 3
trail distance elevation
<chr> <chr> <chr>
1 Lake Louise Lakeshore 2.3 km minimal
2 Fairview Lookout 1.2 km 100 m
3 Bow River Up to 5.7 km minimal
4 Rockpile 0.7 km loop 35 m
5 Moraine Lake Lakeshore 1.3 km minimal
6 Consolation Lakes 2.9 km 135 m
stringr package. This is not a specific learning objective of this course, but some string manipulation may be needed depending on the data. Please consult your instructor or a TA for any additional guidance on working with strings and regular expressions in R.library(stringr)
hikes_clean <- hikes_raw |>
mutate(
distance_km = str_extract(distance, "\\d+(\\.\\d+)?") |> as.numeric(),
elevation_m = case_when(
str_detect(elevation, "minimal") ~ 0,
TRUE ~ str_extract(elevation, "\\d+") |> as.numeric()
)
) |>
select(-distance, -elevation)
head(hikes_clean)# A tibble: 6 × 3
trail distance_km elevation_m
<chr> <dbl> <dbl>
1 Lake Louise Lakeshore 2.3 0
2 Fairview Lookout 1.2 100
3 Bow River 5.7 0
4 Rockpile 0.7 35
5 Moraine Lake Lakeshore 1.3 0
6 Consolation Lakes 2.9 135
Now that the data is in a tidy format, we are free to explore! What are some of the longest day hikes in the Lake Louise area?
write_csv() and use the saved data in your analysis.Web scraping has both ethical and legal considerations, which can vary depending on your location.
We will discuss data ownership in more detail later in the course, but for now we will discuss some general guidelines.
Note: While we are providing general guidelines and best practices for web scraping, please note that we are not lawyers. If you are unsure about the legal or ethical implications of a specific scraping project, you should consult a qualified legal professional. Always prioritize respecting website terms, data privacy, and intellectual property rights.
General Guidelines:
In this exercise, we’ll explore the Terms of Service on a website. Your task:
We will share our findings and discuss how these terms might affect web scraping.
As a class, let’s take a look at Spotify’s User Guidelines.

rvest package for scraping and parsing HTML contentUBC DSCI 200