Simulations-II
DSCI 200
Katie Burak, Gabriela V. Cohen Freue
Last modified – 21 January 2026
\[
\DeclareMathOperator*{\argmin}{argmin}
\DeclareMathOperator*{\argmax}{argmax}
\DeclareMathOperator*{\minimize}{minimize}
\DeclareMathOperator*{\maximize}{maximize}
\DeclareMathOperator*{\find}{find}
\DeclareMathOperator{\st}{subject\,\,to}
\newcommand{\E}{E}
\newcommand{\Expect}[1]{\E\left[ #1 \right]}
\newcommand{\Var}[1]{\mathrm{Var}\left[ #1 \right]}
\newcommand{\Cov}[2]{\mathrm{Cov}\left[#1,\ #2\right]}
\newcommand{\given}{\ \vert\ }
\newcommand{\X}{\mathbf{X}}
\newcommand{\x}{\mathbf{x}}
\newcommand{\y}{\mathbf{y}}
\newcommand{\P}{\mathcal{P}}
\newcommand{\R}{\mathbb{R}}
\newcommand{\norm}[1]{\left\lVert #1 \right\rVert}
\newcommand{\snorm}[1]{\lVert #1 \rVert}
\newcommand{\tr}[1]{\mbox{tr}(#1)}
\newcommand{\brt}{\widehat{\beta}^R_{s}}
\newcommand{\brl}{\widehat{\beta}^R_{\lambda}}
\newcommand{\bls}{\widehat{\beta}_{ols}}
\newcommand{\blt}{\widehat{\beta}^L_{s}}
\newcommand{\bll}{\widehat{\beta}^L_{\lambda}}
\newcommand{\U}{\mathbf{U}}
\newcommand{\D}{\mathbf{D}}
\newcommand{\V}{\mathbf{V}}
\]
Attribution
Some of the material is based on content adapted from
Learning Objectives
Understand the mechanisms for generating and simulating data.
Contrast empirical and theoretical distributions.
Use simulations to approximate probability of events or distribution functions.
Use simulations to assess theoretical properties of random variables.
Explore the Central Limit Theorem (CLT) and Law of Large Numbers (LLN).
Write reproducible simulation code.
Interpret and reflect on simulation results using plots and summary statistics.
Last time: Simulations
Define the main steps of a simulation:
Define the process we want to simulate
Specify how randomness enters the process
Decide what to record from each run
Repeat the process many times
Summarize the results across runs
We used simulation to approximate two probabilities when randomness came from a physical event
Today, we’ll use simulations to study properties of estimators and generate randomness from sampling and using distributions!!
Sampling from a finite population
In lecture 3, we sampled from the wildfire population and compared population parameters with sample estimators.
Every time we select a new random sample we get a different estimate.
The Wildfire Population
library (tidyverse)
library (infer)
library (readr)
wildfire <- read_csv ("data/wildfire.csv" )
head (wildfire)
# A tibble: 6 × 35
year fire_number current_size size_class latitude longitude fire_origin
<dbl> <chr> <dbl> <chr> <dbl> <dbl> <chr>
1 2006 PWF001 0.1 A 56.2 -117. Land Owner
2 2006 EWF002 0.2 B 53.6 -116. Fire Department
3 2006 EWF001 0.5 B 53.6 -116. Fire Department
4 2006 EWF003 0.01 A 53.6 -116. Industry
5 2006 PWF002 0.1 A 56.2 -117. Fire Department
6 2006 CWF001 0.2 B 51.2 -115. Fire Department
# ℹ 28 more variables: general_cause <chr>, responsible_group <chr>,
# activity_class <chr>, true_cause <chr>, fire_start_date <dttm>,
# detection_agent_type <chr>, detection_agent <chr>,
# assessment_hectares <dbl>, fire_spread_rate <dbl>, fire_type <chr>,
# fire_position_on_slope <chr>, weather_conditions_over_fire <chr>,
# temperature <dbl>, relative_humidity <dbl>, wind_direction <chr>,
# wind_speed <dbl>, fuel_type <chr>, initial_action_by <chr>, …
We’ll focus on the numerical variable temperature
Population parameters for temperature
wildfire |>
summarise (
mean_temp = mean (temperature, na.rm = TRUE ),
med_temp = median (temperature, na.rm = TRUE ),
var_temp = var (temperature, na.rm = TRUE ),
sd_temp = sd (temperature, na.rm = TRUE ),
min_temp = min (temperature, na.rm = TRUE ),
max_temp = max (temperature, na.rm = TRUE )
)
# A tibble: 1 × 6
mean_temp med_temp var_temp sd_temp min_temp max_temp
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 17.9 19 60.1 7.75 -39 45
A random sample and sample estimates
set.seed (200 )
wildfire_sample <- wildfire |> rep_sample_n (size = 100 )
temperature_rv <- wildfire_sample$ temperature
c (x_bar = mean (temperature_rv, na.rm = TRUE ),
s = sd (temperature_rv, na.rm = TRUE ))
x_bar s
17.118681 7.236496
Temperature as a random variable
Every time we take a new sample, the values of temperature change.
temperature is a random variable and randomness comes from sampling!
Any statistic of temperature becomes a random variable as well.
for example, the mean temperature (\(\bar{x}\) ) is a random variable
The changes in point estimates across samples is called sampling variability
Standard errors
The standard error of an estimator measures how much the sample estimate (like the mean) is expected to vary across different random samples.
In lecture 3 we said that the standard error of the sample mean is given by:
\[SE(\bar{x}) = \frac{\sigma}{\sqrt{n}}\] and the population parameter \(\sigma\) can be estimated with the sample standard deviation \(s\) .
Let’s build a simulation
Let’s use simulation to study the variability of the sample mean (standard error) when sampling from a finite population.
Define the process we want to simulate
Specify how randomness enters the process
Decide what to record from each run
Repeat the process many times
Summarize the results across runs
iClicker 1: simulation design
Which of the following options best describes a simulation study?
A. Drawing a single sample from the population and computing the mean of temperature in the sample
B. Drawing many samples from the population and computing the mean of temperature in each sample
C. Computing the mean of temperature from the full population
D. Generating temperature values from a Normal distribution and computing its mean
in R
Take a random sample of size \(n = 7000\)
Compute the sample mean temperature
Repeat this 1000 times
set.seed (200 )
pop_temp_complete <- wildfire |>
select (temperature) |>
drop_na ()
n <- 7000
N <- length (pop_temp_complete)
# sample_mean_reps <- pop_temp_complete |>
# ...(size = ..., reps = ...) |>
# group_by(...) |>
# summarise(temp_mean = ...(..., na.rm=TRUE))
We now have a list of 1000 sample mean temperatures!
Summarize
We can use the list of sample estimates to study the sampling variability.
Let’s approximate the SE computing the standard deviation of the sample estimates
# se_simulation <- ...(sampling_dist$sample_mean)
# se_formula <- ...(pop_temp_complete$...)/...
Something seems off ….
Sampling from a finite population
When sampling without replacement from a finite population observations are not independent.
The variability of the sample mean is smaller than in model-based sampling.
This reduction in variability is captured by the finite population correction (FPC):
\[ SE(\bar{x})= \frac{\sigma}{\sqrt{n}}*\sqrt{1-\frac{n}{N}} \]
where \(N\) and \(n\) are the population and the sample sizes, respectively.
library (ggplot2)
library (tidyr)
cum_sd <- sapply (
seq_along (sample_mean_reps$ temp_mean),
function (k) sd (sample_mean_reps$ temp_mean[1 : k])
)
mean_temp_df <- data.frame (
runs = seq_along (sample_mean_reps$ temp_mean),
se_mean_temp = cum_sd
)
mean_temp_long <- tidyr:: pivot_longer (
mean_temp_df,
- runs,
names_to = "event" ,
values_to = "estimate"
)
fcp <- sqrt (1 - (n / N))
se_corrected_fpc <- se_formula * fcp
ref_lines <- data.frame (
se = c (se_simulation, se_corrected_fpc, se_formula),
method = c ("Simulation SD" , "FPC-corrected formula" , "Uncorrected formula" )
)
t_long_run <- ggplot (mean_temp_long, aes (x = runs, y = estimate, color = event)) +
geom_line () +
geom_hline (
data = ref_lines,
aes (yintercept = se, color = method),
linetype = "dashed"
) +
scale_color_manual (
values = c (
"Simulation SD" = "#F8766D" ,
"FPC-corrected formula" = "#00BA38" ,
"Uncorrected formula" = "#619CFF"
)
) +
labs (
x = "Number of simulation runs" ,
y = "Estimated SE" ,
color = "" ,
title = "Estimated SE of mean temperature"
)
Generating values from a distribution
Another way of generating randon numbers is using well-known probability distributions like the Normal, Poisson, and binomial.
in R
rnorm: generate random Normal variates
arguments: n (sample size), mean and sd
rpois: generate random Poisson variates
arguments: n, lambda (rate)
runif: generate random Uniform variates
arguments: n, min, max (interval)
rbinom: generate random binomial variates
Example
[1] 3.601317 3.751295 3.768735 4.073120 4.001994 4.517881 4.134800 2.289504
[9] 3.571474 2.706052
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.290 3.579 3.760 3.642 4.055 4.518
Everytime we run this code, it generates new values
Setting the seed
set.seed (200 )
runif (5 , 2 , 5 )
[1] 3.601317 3.751295 3.768735 4.073120 4.001994
# changing the sequence
set.seed (200 )
runif (5 , 2 , 5 )
[1] 3.601317 3.751295 3.768735 4.073120 4.001994
Simulate a simple linear regression (SLR)
Simulate a dataset from the following model:
\[y = \beta_0 + \beta_1 \times x + \varepsilon\] where \(\varepsilon \sim \mathcal{N}(0,1)\) and \(x \sim \text{Unif}(1,3)\) .
In your simulation, set the population coefficients \(\beta_0 = 1\) and \(\beta_1 = 2\) . In real data, these are unknown and you use a sample to estimate them.
Plot the data
Use the function lm() to estimate the parameters of the model using least squares estimation
Check this application to recall LS estimation
Key takeaways
Simulations can be used to:
approximate probabilities,
study properties of estimators,
approximate distributions
Randomness can be introduced using code that:
mimics physical processes
randomly select outcomes from a finite population
generates random values from a distribution