Simulations-II

DSCI 200

Katie Burak, Gabriela V. Cohen Freue

Last modified – 21 January 2026

\[ \DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \DeclareMathOperator*{\minimize}{minimize} \DeclareMathOperator*{\maximize}{maximize} \DeclareMathOperator*{\find}{find} \DeclareMathOperator{\st}{subject\,\,to} \newcommand{\E}{E} \newcommand{\Expect}[1]{\E\left[ #1 \right]} \newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]} \newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]} \newcommand{\given}{\ \vert\ } \newcommand{\X}{\mathbf{X}} \newcommand{\x}{\mathbf{x}} \newcommand{\y}{\mathbf{y}} \newcommand{\P}{\mathcal{P}} \newcommand{\R}{\mathbb{R}} \newcommand{\norm}[1]{\left\lVert #1 \right\rVert} \newcommand{\snorm}[1]{\lVert #1 \rVert} \newcommand{\tr}[1]{\mbox{tr}(#1)} \newcommand{\brt}{\widehat{\beta}^R_{s}} \newcommand{\brl}{\widehat{\beta}^R_{\lambda}} \newcommand{\bls}{\widehat{\beta}_{ols}} \newcommand{\blt}{\widehat{\beta}^L_{s}} \newcommand{\bll}{\widehat{\beta}^L_{\lambda}} \newcommand{\U}{\mathbf{U}} \newcommand{\D}{\mathbf{D}} \newcommand{\V}{\mathbf{V}} \]

Attribution



Some of the material is based on content adapted from



Learning Objectives



  • Understand the mechanisms for generating and simulating data.
  • Contrast empirical and theoretical distributions.
  • Use simulations to approximate probability of events or distribution functions.
  • Use simulations to assess theoretical properties of random variables.
  • Explore the Central Limit Theorem (CLT) and Law of Large Numbers (LLN).
  • Write reproducible simulation code.
  • Interpret and reflect on simulation results using plots and summary statistics.

Last time: Simulations

Define the main steps of a simulation:

    1. Define the process we want to simulate
    1. Specify how randomness enters the process
    1. Decide what to record from each run
    1. Repeat the process many times
    1. Summarize the results across runs

We used simulation to approximate two probabilities when randomness came from a physical event

Today, we’ll use simulations to study properties of estimators and generate randomness from sampling and using distributions!!

Sampling from a finite population

In lecture 3, we sampled from the wildfire population and compared population parameters with sample estimators.

  • Population parameters are unknown but fixed

  • Sample estimators are random since they are based on a random sample.

Every time we select a new random sample we get a different estimate.

The Wildfire Population

library(tidyverse)
library(infer)
library(readr)

wildfire <- read_csv("data/wildfire.csv")

head(wildfire)
# A tibble: 6 × 35
   year fire_number current_size size_class latitude longitude fire_origin    
  <dbl> <chr>              <dbl> <chr>         <dbl>     <dbl> <chr>          
1  2006 PWF001              0.1  A              56.2     -117. Land Owner     
2  2006 EWF002              0.2  B              53.6     -116. Fire Department
3  2006 EWF001              0.5  B              53.6     -116. Fire Department
4  2006 EWF003              0.01 A              53.6     -116. Industry       
5  2006 PWF002              0.1  A              56.2     -117. Fire Department
6  2006 CWF001              0.2  B              51.2     -115. Fire Department
# ℹ 28 more variables: general_cause <chr>, responsible_group <chr>,
#   activity_class <chr>, true_cause <chr>, fire_start_date <dttm>,
#   detection_agent_type <chr>, detection_agent <chr>,
#   assessment_hectares <dbl>, fire_spread_rate <dbl>, fire_type <chr>,
#   fire_position_on_slope <chr>, weather_conditions_over_fire <chr>,
#   temperature <dbl>, relative_humidity <dbl>, wind_direction <chr>,
#   wind_speed <dbl>, fuel_type <chr>, initial_action_by <chr>, …
dim(wildfire)
[1] 26551    35

We’ll focus on the numerical variable temperature

Population parameters for temperature

wildfire |>
  summarise(
    mean_temp = mean(temperature, na.rm = TRUE),
    med_temp = median(temperature, na.rm = TRUE),
    var_temp = var(temperature, na.rm = TRUE),
    sd_temp = sd(temperature, na.rm = TRUE),
    min_temp = min(temperature, na.rm = TRUE),
    max_temp = max(temperature, na.rm = TRUE)
  )
# A tibble: 1 × 6
  mean_temp med_temp var_temp sd_temp min_temp max_temp
      <dbl>    <dbl>    <dbl>   <dbl>    <dbl>    <dbl>
1      17.9       19     60.1    7.75      -39       45

A random sample and sample estimates

set.seed(200)

wildfire_sample <- wildfire |> rep_sample_n(size = 100)

temperature_rv <- wildfire_sample$temperature
  
c(x_bar = mean(temperature_rv, na.rm = TRUE), 
  s = sd(temperature_rv, na.rm = TRUE))  
    x_bar         s 
17.118681  7.236496 

Temperature as a random variable

Every time we take a new sample, the values of temperature change.

temperature is a random variable and randomness comes from sampling!

Any statistic of temperature becomes a random variable as well.

for example, the mean temperature (\(\bar{x}\)) is a random variable

The changes in point estimates across samples is called sampling variability

Standard errors

The standard error of an estimator measures how much the sample estimate (like the mean) is expected to vary across different random samples.

In lecture 3 we said that the standard error of the sample mean is given by:

\[SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}\] and the population parameter \(\sigma\) can be estimated with the sample standard deviation \(s\).

Let’s build a simulation


Let’s use simulation to study the variability of the sample mean (standard error) when sampling from a finite population.

    1. Define the process we want to simulate
    1. Specify how randomness enters the process
    1. Decide what to record from each run
    1. Repeat the process many times
    1. Summarize the results across runs

iClicker 1: simulation design

Which of the following options best describes a simulation study?

  • A. Drawing a single sample from the population and computing the mean of temperature in the sample

  • B. Drawing many samples from the population and computing the mean of temperature in each sample

  • C. Computing the mean of temperature from the full population

  • D. Generating temperature values from a Normal distribution and computing its mean

in R

  1. Take a random sample of size \(n = 7000\)
  2. Compute the sample mean temperature
  3. Repeat this 1000 times
set.seed(200)

pop_temp_complete <- wildfire |>
 select(temperature)  |>
 drop_na()  

n <- 7000
N <- length(pop_temp_complete)

# sample_mean_reps <- pop_temp_complete |>
# ...(size = ..., reps = ...) |> 
#   group_by(...) |> 
#   summarise(temp_mean = ...(..., na.rm=TRUE))

We now have a list of 1000 sample mean temperatures!

Summarize

We can use the list of sample estimates to study the sampling variability.

Let’s approximate the SE computing the standard deviation of the sample estimates

# se_simulation <- ...(sampling_dist$sample_mean)
# se_formula <- ...(pop_temp_complete$...)/...

Something seems off ….

Sampling from a finite population

  • When sampling without replacement from a finite population observations are not independent.

  • The variability of the sample mean is smaller than in model-based sampling.

  • This reduction in variability is captured by the finite population correction (FPC):

\[ SE(\bar{x})= \frac{\sigma}{\sqrt{n}}*\sqrt{1-\frac{n}{N}} \]

where \(N\) and \(n\) are the population and the sample sizes, respectively.

library(ggplot2)
library(tidyr)

cum_sd <- sapply(
  seq_along(sample_mean_reps$temp_mean),
  function(k) sd(sample_mean_reps$temp_mean[1:k])
)

mean_temp_df <- data.frame(
  runs = seq_along(sample_mean_reps$temp_mean),
  se_mean_temp  = cum_sd
)

mean_temp_long <- tidyr::pivot_longer(
  mean_temp_df,
  -runs,
  names_to = "event",
  values_to = "estimate"
)

fcp <- sqrt(1 - (n / N))
se_corrected_fpc <- se_formula *fcp

ref_lines <- data.frame(
  se = c(se_simulation, se_corrected_fpc, se_formula),
  method = c("Simulation SD", "FPC-corrected formula", "Uncorrected formula")
)

t_long_run <- ggplot(mean_temp_long, aes(x = runs, y = estimate, color = event)) +
  geom_line() +
  geom_hline(
    data = ref_lines,
    aes(yintercept = se, color = method),
    linetype = "dashed"
  ) +
  scale_color_manual(
    values = c(
      "Simulation SD" = "#F8766D",        
      "FPC-corrected formula" = "#00BA38", 
      "Uncorrected formula" = "#619CFF" 
    )
  ) +
  labs(
    x = "Number of simulation runs",
    y = "Estimated SE",
    color = "",
    title = "Estimated SE of mean temperature"
  )

Generating values from a distribution

Another way of generating randon numbers is using well-known probability distributions like the Normal, Poisson, and binomial.

in R

  • rnorm: generate random Normal variates
    • arguments: n (sample size), mean and sd
  • rpois: generate random Poisson variates
    • arguments: n, lambda (rate)
  • runif: generate random Uniform variates
    • arguments: n, min, max (interval)
  • rbinom: generate random binomial variates
    • arguments: n, size, prob

Example

x <- runif(10, 2, 5) 
x
 [1] 3.601317 3.751295 3.768735 4.073120 4.001994 4.517881 4.134800 2.289504
 [9] 3.571474 2.706052
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.290   3.579   3.760   3.642   4.055   4.518 

Everytime we run this code, it generates new values

Setting the seed

set.seed(200)

runif(5, 2, 5) 
[1] 3.601317 3.751295 3.768735 4.073120 4.001994
rnorm(2, 0, 2)
[1]  1.983119 -2.603497
rbinom(3, 1, 0.5)
[1] 0 0 1
# changing the sequence

set.seed(200)

runif(5, 2, 5)
[1] 3.601317 3.751295 3.768735 4.073120 4.001994
rnorm(2, 0, 2)
[1]  1.983119 -2.603497
rpois(4,10)
[1] 7 6 8 6
rbinom(3, 1, 0.5)
[1] 0 0 0

Simulate a simple linear regression (SLR)

    1. Simulate a dataset from the following model:

\[y = \beta_0 + \beta_1 \times x + \varepsilon\] where \(\varepsilon \sim \mathcal{N}(0,1)\) and \(x \sim \text{Unif}(1,3)\).

In your simulation, set the population coefficients \(\beta_0 = 1\) and \(\beta_1 = 2\). In real data, these are unknown and you use a sample to estimate them.

    1. Plot the data
    1. Use the function lm() to estimate the parameters of the model using least squares estimation

Check this application to recall LS estimation

Key takeaways

  • Simulations can be used to:
    • approximate probabilities,
    • study properties of estimators,
    • approximate distributions
  • Randomness can be introduced using code that:
    • mimics physical processes
    • randomly select outcomes from a finite population
    • generates random values from a distribution