Simulations - part I

DSCI 200

Katie Burak, Gabriela V. Cohen Freue

Last modified – 19 January 2026

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Attribution



This material is based on content adapted from



Learning Objectives



  • Understand the mechanisms for generating and simulating data.
  • Contrast empirical and theoretical distributions.
  • Use simulations to approximate probability of events or distribution functions.
  • Use simulations to assess theoretical properties of random variables.
  • Explore the Central Limit Theorem (CLT) and Law of Large Numbers (LLN).
  • Write reproducible simulation code.
  • Interpret and reflect on simulation results using plots and summary statistics.

Last time: Randomness and distributions

  • In a random phenomenon there is uncertainty about which of several potential outcomes will take place.

    • Outcomes or events may not be equally likely.
    • Randomness may be the result of physical events, selection, assignment, or complex processes, among others.
  • Probability of events: a number in \([0, 1]\) that measures the uncertainty of an event.

  • A random variable assigns numbers to outcomes of a random process

    • discrete (countable values) or continuous (range of values)
  • The distribution of a random variable is a function that describes its variability

    • some of them have formulas and special names (e.g., Binomial, Uniform, Normal).

Today, we’ll use computer code to simulate randomness and recreate random processes, under our full control, many times!!

Recall…

Randomness can come from:

  • physical events (e.g, flipping a coin)
  • selection mechanisms (e.g., selecting individuals from a population)
  • experimental designs (random assignment in a controlled experiment)
  • complex systems (e.g., stock market fluctuations)
  • simulations or algorithms (e.g., random number generator in software)
  • biological processes (e.g., random assortment of chromosomes)

Which of these random processes can we easily repeat thousands of times? difficult or impossible to reproduce in practice?

Simulation lets us recreate random processes in a controlled and repeatable way.

Simulations

Definition

Simulation is a way of recreating a random process with code so that we can repeatedly observe its outcomes and reproduce the same process under identical conditions.

  • Each simulation run randomly results in one possible outcome of the process.

  • Many runs reveal patterns of variability in the outcomes of the process.

However,

  • a real process may be difficult to recreate with code!

  • we may need to make many (and some unrealistic) assumptions to recreate a process.

  • even when a process is easy to describe, simulating it many times may be computationally costly.

Simulation trades analytical difficulty or lack of feasibility for computational effort

What do we use simulation for?


We often use simulation to:

  • approximate a probability that it’s hard to compute analytically

  • approximate a distribution that is unknown or difficult to derive

  • understand variability (e.g., a standard error)

  • study properties of estimators

Simulation answers questions that cannot be answered from a single realization of a random process.

How do we build a simulation?

To simulate a process, we follow a clear set of steps.


    1. Define the process we want to simulate
    1. Specify how randomness enters the process
    1. Decide what to record from each run
    1. Repeat the process many times
    1. Summarize the results across runs

Photo by James Lee on Unsplash


Simulation is structured repetition, not trial and error.

1. Define the process


A simulation always starts with a clearly defined process.

  • What is the real or hypothetical process we want to mimic?
  • What assumptions are needed to describe this process?
  • What aspects of the process are held fixed across repetitions?


We simulate the mechanism that generates data, not a single observed outcome.

Example: approximating a probability by simulation


One of the oldest documented problems in probability asks the following question:

If three fair six-sided dice are rolled, what is more likely: a sum of 9 or a sum of 10?

  • use this link to virtually roll 3 dice
  • record if you get a sum of 9, of 10, or neither
  • repeat 15 times
  • use this link to share what you found


From Probability and Simulation

iCliker 1: define the random process


What is the random process we would repeat many times using code?

    1. Rolling three physical dice in the classroom
    1. Generating three independent outcomes from \(\{1, \ldots, 6\}\) with equal probability
    1. Counting how often sums of 9 and 10 appear in our observed rolls
    1. Computing the probabilities using a formula

iCliker 2: add randomness


In the three-dice problem, how does randomness enter the process?

    1. The probabilities of each sum are random
    1. The simulation randomly decides whether the sum is 9 or 10
    1. Each die outcome is random, but the chances of each outcome stay the same run to run
    1. The probability of each outcome changes run to run

in R

three_dice <- sample(1:6, size = 3, replace = TRUE)

three_dice
[1] 6 3 2
sum(three_dice)
[1] 11

Every time we run this code, we’re simulating one roll of three fair dice! the outcome varies randomly but the probability model is fixed.

The code defines the random process by specifying:

  • what varies
  • what is fixed
  • which function generates the random outcomes (e.g., sample())

Record from each run

Every time we run this code, we’re simulating one roll of three fair dice! What we record from each run??


Back to the 3-dice example

If three fair six-sided dice are rolled, what is more likely: a sum of 9 or a sum of 10?

You played with the random generator app and already made a decision of what to record. What did you record from each run?

Enter your answer here

All choices are valid, but some are more efficient for the question we want to answer.

Choice of what to record

We may record different results from the same random process!

What we record from each simulation run depends on the question we are trying to answer.

  • If the goal is to study dice behavior → record all dice

  • If the goal is to compare sums → record the sum

  • If the goal is a probability → record an indicator

We may also record results in anticipation of future questions.

Later, we’ll see how we can make simulations reproducible by controlling the random number generator.

in R

Write code to record the following outcomes from one run

  • individual outcomes of the 3 dice
  • the sum of the 3 dice
  • indicator variables to record whether the sum is 9 or whether it is 10
# One run of the process (3 fair dice)
sample(1:6, size = 3, replace = TRUE)
[1] 4 2 5

Repeat

Compare the result you obtained with others. Did you get the same result?

  • Everyone simulated the same process

  • Everyone followed the same instructions

  • Everyone most probably got different answers

So… which answer is correct?

All results are correct, but a single run tells us almost nothing about a probability.

To approximate probabilities we need the proportion of times events occur across many runs.

in R

You already repeated the process a few times by hand. Code lets us repeat it thousands of times, consistently!

Let’s repeat it \(10000\) times!

#sums <- ...(10000, ...(sample(1:6, ..., replace = TRUE)))

#head(sums)

Summarize

Once we have many runs, we summarize the results to answer our question.

We approximate each probability by the proportion of runs in which each sum occur.

But what we compute depends on what we recorded.

We can compute the proportion of sums equal to 9 using:

  • mean()

  • sum()

  • table()

in R

sums <- replicate(10000,
  sum(sample(1:6, size = 3, replace = TRUE))
)

c( prob_9_hat = mean(sums==9),
prob_10_hat = mean(sums==10))
 prob_9_hat prob_10_hat 
     0.1109      0.1250 


According to our approximation, it seems that the probability of getting a sum of 10 is higher than that of getting 9!

Caution

Can you reproduce your result?? Run the code again!

Reproducibility

  • R generates random outcomes using a random number generator.

  • If we fix the starting point of that generator, R can reproduce the same random outcomes!

To make a simulation reproducible, we set the seed at any value!

set.seed(200)

sums <- replicate(10000,
  sum(sample(1:6, size = 3, replace = TRUE))
)

c(prob_9_hat = mean(sums==9),
prob_10_hat = mean(sums==10))
 prob_9_hat prob_10_hat 
     0.1159      0.1213 

library(ggplot2)
library(tidyr)

three_dice_df <- data.frame(
  runs = seq_along(sums),
  `prob_9_hat`  = cumsum(sums == 9)  / seq_along(sums),
  `prob_10_hat` = cumsum(sums == 10) / seq_along(sums)
)

three_dice_long <- tidyr::pivot_longer(
  three_dice_df,
  -runs,
  names_to = "event",
  values_to = "estimate"
)

p_long_run <- ggplot(three_dice_long, aes(x = runs, y = estimate, color = event)) +
  geom_line() +
  geom_hline(yintercept = c(25, 27)/216,
             linetype = "dashed",
             color = c("#1b9e77", "#d95f02")) +
  labs(
    x = "Number of simulation runs",
    y = "Estimated probability",
    color = "",
    title = "Approximating probabilities by simulation"
  ) +
  theme(
  text = element_text(size = 10),
  axis.title = element_text(size = 10),
  axis.text  = element_text(size = 9),
  legend.title = element_text(size = 9),
  legend.text  = element_text(size = 9),
  plot.title = element_text(size = 11)
)

In this example, the probabilities are not that difficult to compute analytically (horizontal lines).

p_long_run

iClicker 3: reproducibility

Consider the following 2 codes:

answer1 <- replicate(10, sum(sample(1:6, size = 3, replace = TRUE)))

set.seed(200) 

replicate(1000, sum(sample(1:6, size = 3, replace = TRUE))) 

replicate(100, sum(sample(1:6, size = 3, replace = TRUE)))

replicate(10, sum(sample(1:6, size = 3, replace = TRUE)))

and

answer1 <- replicate(10, sum(sample(1:6, size = 3, replace = TRUE)))

set.seed(200)  

replicate(1000, sum(sample(1:6, size = 3, replace = TRUE))) 

replicate(100, sum(sample(1:6, size = 3, replace = TRUE))) 

answer1

Both codes give the same result. A: TRUE or B: FALSE

iClicker 4

A population mean can be estimated with a sample mean. But is this a good estimator?

Next class, we will use the wildfire data as an example of a population to answer this question. What is the random process we would simulate?

    1. Calculating the mean temperature from the full wildfire dataset
    1. Randomly sampling wildfire observations and computing their mean temperature
    1. Generating temperatures from a theoretical distribution
    1. Repeating the same wildfire observation many times

Key takeaways

  • A simulation starts by defining a random process

    • We must define what can vary, what is fixed, and how randomness is generated.
  • What we record from each run depends on the question

    • The same random process can produce many valid results
    • Choosing what to record is part of the modeling decision.
  • A single run is not informative about probability

    • Probabilities are approximated using frequencies across many runs, not individual outcomes.
  • Repeating the same process many times is needed to study patterns.

  • Setting the seed to generate randomness allows reproducibility.