[1] 6 3 2
[1] 11
DSCI 200
Katie Burak, Gabriela V. Cohen Freue
Last modified – 19 January 2026
\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]
This material is based on content adapted from
In a random phenomenon there is uncertainty about which of several potential outcomes will take place.
Probability of events: a number in \([0, 1]\) that measures the uncertainty of an event.
A random variable assigns numbers to outcomes of a random process
The distribution of a random variable is a function that describes its variability
Randomness can come from:
Which of these random processes can we easily repeat thousands of times? difficult or impossible to reproduce in practice?
Definition
Simulation is a way of recreating a random process with code so that we can repeatedly observe its outcomes and reproduce the same process under identical conditions.
Each simulation run randomly results in one possible outcome of the process.
Many runs reveal patterns of variability in the outcomes of the process.
However,
a real process may be difficult to recreate with code!
we may need to make many (and some unrealistic) assumptions to recreate a process.
even when a process is easy to describe, simulating it many times may be computationally costly.
We often use simulation to:
approximate a probability that it’s hard to compute analytically
approximate a distribution that is unknown or difficult to derive
understand variability (e.g., a standard error)
study properties of estimators
To simulate a process, we follow a clear set of steps.
A simulation always starts with a clearly defined process.
One of the oldest documented problems in probability asks the following question:
If three fair six-sided dice are rolled, what is more likely: a sum of 9 or a sum of 10?
What is the random process we would repeat many times using code?
In the three-dice problem, how does randomness enter the process?
Every time we run this code, we’re simulating one roll of three fair dice! the outcome varies randomly but the probability model is fixed.
sample())Every time we run this code, we’re simulating one roll of three fair dice! What we record from each run??
If three fair six-sided dice are rolled, what is more likely: a sum of 9 or a sum of 10?
You played with the random generator app and already made a decision of what to record. What did you record from each run?
Enter your answer here
We may record different results from the same random process!
If the goal is to study dice behavior → record all dice
If the goal is to compare sums → record the sum
If the goal is a probability → record an indicator
We may also record results in anticipation of future questions.
Later, we’ll see how we can make simulations reproducible by controlling the random number generator.
Write code to record the following outcomes from one run
Compare the result you obtained with others. Did you get the same result?
Everyone simulated the same process
Everyone followed the same instructions
Everyone most probably got different answers
So… which answer is correct?
All results are correct, but a single run tells us almost nothing about a probability.
You already repeated the process a few times by hand. Code lets us repeat it thousands of times, consistently!
Let’s repeat it \(10000\) times!
Once we have many runs, we summarize the results to answer our question.
We approximate each probability by the proportion of runs in which each sum occur.
We can compute the proportion of sums equal to 9 using:
mean()
sum()
table()
sums <- replicate(10000,
sum(sample(1:6, size = 3, replace = TRUE))
)
c( prob_9_hat = mean(sums==9),
prob_10_hat = mean(sums==10)) prob_9_hat prob_10_hat
0.1109 0.1250
Caution
Can you reproduce your result?? Run the code again!
R generates random outcomes using a random number generator.
If we fix the starting point of that generator, R can reproduce the same random outcomes!
library(ggplot2)
library(tidyr)
three_dice_df <- data.frame(
runs = seq_along(sums),
`prob_9_hat` = cumsum(sums == 9) / seq_along(sums),
`prob_10_hat` = cumsum(sums == 10) / seq_along(sums)
)
three_dice_long <- tidyr::pivot_longer(
three_dice_df,
-runs,
names_to = "event",
values_to = "estimate"
)
p_long_run <- ggplot(three_dice_long, aes(x = runs, y = estimate, color = event)) +
geom_line() +
geom_hline(yintercept = c(25, 27)/216,
linetype = "dashed",
color = c("#1b9e77", "#d95f02")) +
labs(
x = "Number of simulation runs",
y = "Estimated probability",
color = "",
title = "Approximating probabilities by simulation"
) +
theme(
text = element_text(size = 10),
axis.title = element_text(size = 10),
axis.text = element_text(size = 9),
legend.title = element_text(size = 9),
legend.text = element_text(size = 9),
plot.title = element_text(size = 11)
)In this example, the probabilities are not that difficult to compute analytically (horizontal lines).
Consider the following 2 codes:
and
A population mean can be estimated with a sample mean. But is this a good estimator?
Next class, we will use the wildfire data as an example of a population to answer this question. What is the random process we would simulate?
A simulation starts by defining a random process
What we record from each run depends on the question
A single run is not informative about probability
Repeating the same process many times is needed to study patterns.
Setting the seed to generate randomness allows reproducibility.
UBC DSCI 200