Objective

The example shows how to determine the sample size required to accurately distinguish between a one-dimensional and a two-dimensional Graded Response Model (GRM). The two models are estimated and compared using a likelihood ratio test.

I. Determining the data generation for the complete dataset

Number and distribution of factors (unidimensional vs. multidimensional)
Number of items and item parameters (discriminations, difficulties)
Item type (dichotomous, polytomous)

The example describes a fictitious 10-item questionnaire that measures attitudes towards IRT models on a four point scale, ranging from 0 (Strongly not agree) to 3 (Strongly agree). Half of the items are formulated positively (open-mindedness towards IRT), the other half negatively (skepticism towards IRT). The two dimensions are highly negatively correlated ( $ρ = - .90$ ).

# Clear work space and load necessary packages
rm(list = ls())
library(tidyverse)   # General data handling and manipulation
library(mirt)        # Estimate item response model
library(mice)        # Induce missingness at random to the data

a_i <- data.frame(
  a1 = c(1.72, 0, 1.47, 0.77, 0, 0, 1.54, 0, 0, 1.30),
  a2 = c(0, 1.98, 0, 0, 1.21, 1.73, 0, 1.45, 1.20, 0)) %>% as.matrix
b_i <- data.frame(
  b1 = c(-1.12, -0.89, -1.70, -2.26, -0.98, -0.64, -1.29, -1.33, -0.92, -0.72),
  b2 = c(0.04, -0.18, -0.45, -0.85, 0.14, -0.10, -0.44, 0.12, 0.30, -0.11),
  b3 = c(0.84, 0.83, 0.55, 1.00, 0.78, 0.75, 0.56, 1.23, 1.39, 1.13)) %>% as.matrix

The function generate_data_grm uses mirt::simdata() to simulate ordered categorical data given the item discriminations ( $a$ ), the item difficulties ( $b$ ), and the number of observations ( $n$ ). For polytomous items (i.e., multiple ordered categories) in the GRM, the probability of obtaining a category score is given:

$P (X_{p i} \geq k | a_{i 1}, a_{i 2}, b_{i k}, θ_{p 1}, θ_{p 2}) = \frac{1}{1 + \exp (- (a_{i 1} θ_{p 1} + a_{i 2} θ_{p 2} - b_{i k}))}$

where $b_{i k}$ is the difficulty parameter for modeling the probability of scoring at or above category $k$ on item $i$ . Note that in the mirt package the transformation $d = - a * b$ is used to match the intercept parameterization of the logistic model.

# Simulate dichotomous item responses based on the generated theta values
  # - 'a' denotes the item discriminations
  # - 'b' denotes the item difficulties
  # - 'n' denotes the number of observations
# The remaining arguments are preset
  # - 'mu' denotes the mean vector
  # - 'sigma' denotes the covariance matrix 
  # - 'itemtype' specifies the type of item (i.e., graded)

generate_data_grm <- function(a, b, n) {
  resp <- 
    simdata(a = a, 
            d = -(a_i[, 1]*b_i + a_i[, 2]*b_i), 
            N = n,
            mu = c(0, 0), 
            sigma = matrix(c(1, -0.90, -0.90, 1), nrow = 2),
            itemtype = "graded") %>% 
    as.data.frame()
  
  return(resp)
}

II. Defining the test design and the process of missing values

Pattern of missingness (e.g., type of missingness, linking design)
Amount of missing data

20% of the values are missing completely at random (MCAR).

# Induce missingness completely at random (MCAR) to the complete simulated data set
  # - 'resp' denotes the complete data set
  # - 'miss_rates' denotes the amount of missingness [0;1]

data_MCAR_design <- function(resp, miss_rates = 0) {
  resp <- t(apply(resp, 1, function(row) {
    row[sample(seq_along(row), length(row) * miss_rates)] <- NA;
    return(row)
  }))
  
  return(resp)
}

III. Selecting the IRT model and the parameter of interest

Underlying IRT model (e.g., 1PL, 2PL)
IRT modeling software and estimation method
Parameters to extract

# Estimate item response model
# - 'resp' denotes the data set

estimate_irt <- function(resp) {

  # Estimate a uni-dimensional GRM
  mod1 <- tryCatch(
    mirt(data = resp,
         1,
         itemtype = "graded",
         verbose = FALSE),
    error = function(e) NULL
  )
  
  # Estimate a two-dimensional model
  twofac <- ' F1 = 1, 3-4, 7, 10
              F2 = 2, 5-6, 8-9
              COV = F1 * F2 '
  mod2 <- tryCatch(
    mirt(data = resp,
         model = twofac,
         itemtype = "graded",
         verbose = TRUE),
    error = function(e) NULL
  )
    
  # Return p-value
  est <- if (!is.null(mod1) & !is.null(mod2)) {
    anova(mod1, mod2)$p[2]
    print(anova(mod1, mod2)$p[2])
  } else {
    NA
  }
  return(est)
}

IV. Setting up the Monte Carlo simulation

Number of iterations
Sample sizes to evaluate

The Monte Carlo simulation runs n_iterations times, including the previous steps of (i) determining the data generation for the complete dataset, (ii) defining the test design and the process of missing values, (iii) selecting the IRT model and the parameter of interest. The simulation is run for different sample sizes between 300 and 900 (in increments of 100).

Robey and Barcikowski (1992) list the sample sizes for different nominal Type I error rates. For example, a liberal deviation of $α \pm 1 / 2 α$ from a nominal Type I error rate of .05 requires a total of 729 iterations for an a priori Type I error rate of $ω = .05$ and a power of $(1 - β) = .80$ . The same logic for determining the number of replications can also be used for Type II errors.

# Number of iterations
n_iterations <- 729

# Set the seed for random number generation to ensure reproducibility
set.seed(2024)

# Create data frame for results (res)
res <- data.frame()

# Check if result file already exists
if (file.exists("example_5_res.rds")) {
  res <- readRDS("example_5_res.rds") 
} else {
  # Loop over different sample sizes (from 300 to 900, in steps of 100)
  for (n_persons in seq(300, 900, 100)) {
  
    # Nested loop, running 'n_iterations' times
    for (iter in 1:n_iterations) {
      dat <- generate_data_grm(a_i, b_i, n_persons) %>% 
        data_MCAR_design(., miss_rates = .20)
    
      # results
      res <- bind_rows(res, 
                       data.frame(iteration = iter,
                                  n_persons = n_persons,
                                  p_val = estimate_irt(dat)))
    }
  }
 
  # Save the results
  saveRDS(res, file = "example_5_res.rds")
}

Results

The figure shows the statistical power (= 1 - Type II error) as a function of sample size. Note that due to the high correlation between the dimensions, a certain proportion of the models lead to convergence problems. The average of all converged models is shown here.

# Preparation and aggregation of results
res_plot <- res %>%
  group_by(n_persons) %>%
  summarise(p_count = mean(p_val < 0.05, na.rm=TRUE))

# Plot the number of significant differences
ggplot(data=res_plot, aes(x=n_persons, y = p_count)) +
  geom_line(linewidth = 0.8, alpha = 0.8, col="darkcyan") +
  geom_point(size = 1.5, alpha = 0.8, col="darkcyan") +
  geom_abline(intercept = .95, slope = 0, col="red", lty = "twodash") +
  labs(
    x = "Sample size",
    y = "Percentage of sig. decisions"
  ) +
  ylim(0, 1.05) +
  xlim(200, 1000) +
  theme_bw()  +
  theme(
    axis.title = element_text(size = 14),
    axis.text = element_text(size = 12),
    legend.title = element_text(size = 14),
    legend.text = element_text(size = 12),
    legend.position = "inside", 
    legend.position.inside =  c(.90, .80)
  )

# Documentation for transparency and reproducibility
print(sessionInfo(), locale=FALSE)

R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26100)

Matrix products: default


attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] mice_3.17.0     mirt_1.42.6     lattice_0.22-6  lubridate_1.9.3
 [5] forcats_1.0.0   stringr_1.5.1   dplyr_1.1.4     purrr_1.0.2    
 [9] readr_2.1.5     tidyr_1.3.1     tibble_3.2.1    ggplot2_3.5.1  
[13] tidyverse_2.0.0

loaded via a namespace (and not attached):
 [1] tidyselect_1.2.1     farver_2.1.2         R.utils_2.12.3      
 [4] fastmap_1.2.0        rpart_4.1.23         digest_0.6.37       
 [7] timechange_0.3.0     lifecycle_1.0.4      Deriv_4.1.6         
[10] dcurver_0.9.2        cluster_2.1.6        survival_3.6-4      
[13] magrittr_2.0.3       compiler_4.4.1       rlang_1.1.4         
[16] tools_4.4.1          utf8_1.2.4           yaml_2.3.10         
[19] knitr_1.49           labeling_0.4.3       htmlwidgets_1.6.4   
[22] curl_5.2.2           withr_3.0.2          R.oo_1.27.0         
[25] nnet_7.3-19          grid_4.4.1           fansi_1.0.6         
[28] jomo_2.7-6           colorspace_2.1-1     future_1.34.0       
[31] progressr_0.15.0     GPArotation_2024.3-1 iterators_1.0.14    
[34] globals_0.16.3       scales_1.3.0         MASS_7.3-60.2       
[37] cli_3.6.3            rmarkdown_2.29       vegan_2.6-8         
[40] generics_0.1.3       rstudioapi_0.16.0    future.apply_1.11.3 
[43] SimDesign_2.17.1     tzdb_0.4.0           sessioninfo_1.2.2   
[46] minqa_1.2.8          pbapply_1.7-2        audio_0.1-11        
[49] splines_4.4.1        parallel_4.4.1       beepr_2.0           
[52] vctrs_0.6.5          boot_1.3-30          glmnet_4.1-8        
[55] Matrix_1.7-0         jsonlite_1.8.9       hms_1.1.3           
[58] mitml_0.4-5          listenv_0.9.1        testthat_3.2.1.1    
[61] foreach_1.5.2        snow_0.4-4           glue_1.8.0          
[64] parallelly_1.38.0    pan_1.9              nloptr_2.1.1        
[67] codetools_0.2-20     shape_1.4.6.1        stringi_1.8.4       
[70] gtable_0.3.6         lme4_1.1-35.5        munsell_0.5.1       
[73] pillar_1.9.0         htmltools_0.5.8.1    brio_1.1.5          
[76] R6_2.5.1             evaluate_1.0.1       R.methodsS3_1.8.2   
[79] backports_1.5.0      RPushbullet_0.3.4    broom_1.0.7         
[82] Rcpp_1.0.13-1        gridExtra_2.3        nlme_3.1-164        
[85] permute_0.9-7        mgcv_1.9-1           xfun_0.49           
[88] pkgconfig_2.0.3