DSCI 100 function reference sheet for R

This reference sheet contains the key objects that we use in DSCI 100, and a brief syntax example for each of the main packages. During the closed book exams, you will still have access to this page, so get familiar with it already now. There is no guarantee that every function or parameter in the textbook is covered here, but if you think there is something missing, please let us know and we can consider adding it.

Note that we have only described use cases relevant to DSCI 100. Sometimes we have included the exact parameter name of a function, e.g. print(x), other times we have opted to included a more descriptive name, e.g. mean(column).

Base R Operations

Function Description
abs(x) Convert numeric value(s) to absolute value
as.data.frame(x) Converts an object to a data frame
as.numeric(x) Converts a variable to a numeric data type
c(1,2,3) Combines values into a vector or list in R
is.na(column) Detect missing (NA) values in a vector or data frame
dim(column) Returns dimensions (rows and columns) of an R object
max(column) Returns maximum value in a numeric vector
mean(column) Returns average value in a numeric vector
median(column) Returns the median value in a numeric vector
min(column) Returns minimum value in a numeric vector
n() Counts the number of rows in a table’s group
names(tbl) Assigns or retrieves names of elements in an R object
ncol(tbl) Returns the number of columns in a matrix/data frame
nrow(tbl) Returns the number of rows in a matrix/data frame
print(x) Displays specified object’s value
round(num, digits) Rounds a number to specified decimals
sd(column) Calculates standard deviation for numeric data
seq(from, to, by) Generates a sequence of numbers
sum(column) Calculates the sum of numeric values in a vector or matrix
sort(df) Sorts a vector or data frame in ascending order
sqrt(num) Computes the square root of a numeric value

Operators

Function Description
== Compares two values and returns TRUE if they are equal
%in% Checks if elements on the left side are in the right
! Negates a logical value (!TRUE is FALSE)
& Performs element-wise logical AND operations
| Represents the OR logical operator
|> Pipe operator, which passes data from left to right

Data Reading

Function Description
download.file(url, destfile) Download a file from the web
read_csv(filepath) Reads comma-separated values into a data frame
read_csv2(filepath) Reads CSV files with semicolon delimiter
read_delim(filepath, delim, skip, col_names) Reads data from a delimited text file
read_excel(filepath) Reads Excel files into R data frames
read_html(filepath) Reads and parses HTML web pages
read_tsv(filepath) Reads tab-separated values into a data frame
write_csv(tbl, filepath) Writes data to a CSV file

Database functions:

Function Description
collect(database_table) Convert a database table to a tibble
dbConnect(database, dbname) Establishes a connection to a database
dbListTables(dbConnect_object) Lists tables in a database connection
RPostgres::Postgres() Connects to and interacts with PostgreSQL databases
RSQLite::SQLite() Access and manage SQLite database connections
tbl(dbConnect_object, table_name) Creates a data frame from a data source

Data Wrangling

Function Description
across(column_range, function) Apply the given function to each column in the specified column range
arrange(tibble, columns_as_arguments) Order rows by the values of the given columns (default is increasing)
colnames(tbl) Get a list of column names from a tibble
desc(column) Sort a column (or numeric vector) in descending order
everything() Select all variables (used in other functions)
filter(tbl, condition) Keep rows that match a condition
fct_reorder(factor_column, ordering_column, .desc = FALSE) Reorder a column by sorting according to another column
group_by(tbl, columns_as_arguments) Group a tibble by the list of columns provided
map(tbl, function) Apply the given function to each column, creating a list
map_chr(tbl, function) Apply the given function to each column, creating a character vector
map_df(tbl, function) Apply the given function to each column, creating a data frame
mutate(tbl, column_name = ...) Create or modify a column in a tibble
pivot_longer(tbl, column_range, names_to = ..., values_to = ...) Move values from column names to cells
pivot_wider(tbl, names_from = ..., values_from = ...) Move variables from cells to column names
pull(tbl, variable) Extract a single variable from a tibble
rowwise(tbl) Organize a tibble row-by-row for other functions
select(tbl, columns_as_arguments) Keep the given columns
semi_join(tbl, joining_tbl) Keep rows that have matching values in joining_tbl
separate(tbl, column, into, sep) Split values in a column into new columns based on a separator
summarize(tbl, summaries_as_arguments) Compute summary statistics on columns
ungroup(tbl) Undo the effect of group_by()

Functions used to convert one type to another:

Function Description
as_datetime(formatted_string) Convert a string to a Date object
as_factor(column) Convert a column to a factor / categorical variable
as_tibble(object) Convert an object to a tibble

Slicing functions:

Function Description
head(tbl) Get the first 6 rows of a tibble
slice(tbl, row_range) Keep rows in the given range
slice_max(tbl, ordering_column, n) Keep the n rows with the largest values of a variable
slice_min(tbl, ordering_column, n) Keep the n rows with the smallest values of a variable
unique(tbl) Delete duplicate rows
tail(tbl) Get the last 6 rows of a tibble

Functions used to manipulate strings:

Function Description
str_extract(string, pattern) Extract the first substring matching the given pattern
str_replace_all(string, pattern, replacement) Replace all substrings matching the given pattern
tolower(string) Convert a string to all-lowercase
toupper(string) Convert a string to all-uppercase

Visualization

A typical ggplot2 syntax for creating a new plot looks something like this:

library(tidyverse)

my_data |> ggplot(aes(x = column1, y = column2)) +
  geom_point()
Function Description
aes(x, y, ...) Specifies how variables in the data are mapped to properties of the plot
element_text(size, colour) Used with theme system to control text size, colour, etc.
facet_grid(rows, cols) Creates matrix panels with plots based on specified rows or cols variable
facet_wrap(facets) Creates a ribbon of panels wrapped in 2d using specified facets
ggmap(map) Used to display visual maps from Google Maps or Stamen Maps
ggpairs(tbl) Plots each variables against all the other variables in a scatterplot matrix
ggplot(tbl, mapping) Initialize a ggplot object, specifying the data and aesthetic mapping for the plot
ggsave(filename, plot) Saves specified plot with given filename to device
ggtitle(title) Adds specified title to the plot
labs(x, y, fill, colour, ...) Modifies labels on the plot, specifying what the new labels should be
scale_color_manual(values) Manually change the colour for plots by specifying the values
scale_fill_brewer(palette) Changes the fill colour palette to the specified palette
scale_fill_distiller(palette) Changes the fill colour palette for continuous scales
scale_x_continuous(limits) Customize x-axis scales for continuous x variables with specified options
scale_x_date(limits, breaks) Customize the x-axis scales for date or time variables in a plot
scale_y_continuous(limits) Customize y-axis scales for continuous y variables with specified options
theme(text) Used to modify the non-data components of the plot with specified options
xlab(label) Modifies the x-axis label to the specified label
xlim(lo, hi) Displays only the specified range on the x-axis of the plot
ylab(label) Modifies the y-axis label to the specified label
ylim(lo, hi) Displays only the specified range on the y-axis of the plot
vars(columns_as_arguments) Choose variables to split a plot on in facet_grid()

Commonly used geometric objects are listed below.

Function Description
geom_abline(slope, intercept) Adds a diagonal line to the plot with specified intercept and slope
geom_bar(stat) Used to create bar graphs with specified stat (often “identity” or “count”)
geom_density() Used to create a smoothened line version of a histogram
geom_freqpoly() Used to create a lined (not smooth) version of a histogram
geom_histogram(bins, binwidth) Creates histogram plots with a specified number of bins and bin width
geom_line() Adds lines to connect data points in the order of the x-axis
geom_point() Used to create a scatterplot graphs
geom_segment(x, y, xend, yend) Draws a straight line on plot connecting (x, y) to (xend, yend)
geom_vline(xintercept) Adds a vertical line to the plot at the specified x-intercept

Modeling

A typical tidymodels workflow looks something like this:

library(tidymodels)

knn_fit <- workflow() |>
  add_recipe(my_recipe) |>
  add_model(knn_spec) |>
  fit(data = my_data)

pred <- predict(knn_fit, new_data)

The functions below are relevant for Week 7 (classification1) and beyond.

Function Description
add_model(workflow, model_spec) Add a model to a workflow
add_recipe(workflow, model_recipe) Add a recipe to a workflow
add_row(data, col1, col2) Add rows to a dataframe
all_predictors() Select all predictors
bake(recipe, data) Applies the results of prep() into the data
bind_cols(df1, df2) Combine multiple dataframes together
dist(data, method) Computes and returns the distance matrix
as_factor(data, variable) Converts a variable to a factor type
fit(model, data) Add data to a workflow to build a fitted model
predict(fitted_model, new_obs) Predict values based on model and data
prep(recipe) Prepares data for preprocessing
nearest_neighbor(weight_func, neighbors) Specify that the model is K-Nearest-Neighbor
recipe(formula, data) Prepares data for modelling
set_engine(engine) Specify package to fit the model
set_mode(mode) Specify modelling context used
set.seed(n) Make randomization reproducible
step_center(recipe) Center variables in recipe
step_rm(recipe) Removes specified variables
step_scale(recipe) Scale variables in recipe
workflow() Create workflow

The functions below extend the above table for material in Week 8 (classification2) and beyond.

Function Description
apparent(data) Sampling for the apparent error rate
augment(fit, data) Add predictions/residuals/cluster assignments to dataframe
collect_metrics(fitted_model) Aggregate the mean and standard error of the model’s accuracy across the folds
conf_mat(data, truth, estimate) Computes and returns the confusion matrix
fit_resamples(model, resamples) Runs cross-validation on each train/validation split to build a fitted model
glance(fitted_model) Obtain total WSSD of a cluster model
initial_split(data, prop, strata) Splits the data
k_means(num_clusters) Specify that the model is kmeans clustering
kmeans(data, centers, nstart) Runs k-means clustering on the given data for the specified number of clusters and starts
list(objects) Create a list of elements of different types
linear_reg() Specify that the model is linear regression
metrics(data, truth, estimate) Returns the model’s accuracy metrics
testing(data) extract testing data
training(data) extract training data
tune() Tune neighbors
tune_cluster(model, resamples, grid) Run kmeans on multimple resamples of data
tune_grid(model, resamples, grid) Fit the model for each value in a range of parameter values
unlist(list) Convert a list to a vector
unnest(tbl, list_column) Expand a column containing a list of tibbles into rows and columns
vfold_cv(data, v, strata) Perform cross validation

Inference

Function Description
quantile(data, percentiles) Finds the specified percentiles in the given data
rep_sample_n(tbl, size, reps, replace) Takes samples of the table according to the size, reps, and replace options
sample_n(tbl, num) Random selects the specified number of rows from the table