Ib — Theory of Mind (MASC) & Empathy (IRI): descriptive analysis

Descriptive Analyses · GTEMO Experiment

Author

Eric Guerci

Published

March 22, 2026

1 Background

1.1 Theory of Mind and the MASC

Theory of Mind (ToM) is the ability to attribute mental states — beliefs, intentions, desires, emotions — to others and to understand that these may differ from one’s own. It is a core dimension of social cognition and underlies strategic behaviour in interactive settings: anticipating what others know, want, and believe is a prerequisite for effective communication, negotiation, and cooperation.

The Movie for the Assessment of Social Cognition (MASC) is a validated film-based instrument developed by Dziobek et al. (2006). Participants watch short video clips of social interactions and answer multiple-choice questions about the characters’ thoughts and feelings. The MASC is designed to capture ecological ToM by embedding mental-state inference in naturalistic, dynamic social scenes — closer to real-world interaction than classic vignette-based tasks.

The instrument yields five scores:

Variable Description Scale
MASC_ToM_score Total correct ToM responses 0 – 40
MASC_dimToM_score Diminishing errors — under-mentalising 0 – 36
MASC_excToM_score Exceeding errors — over-mentalising 0 – 36
MASC_noToM_score No ToM errors — no mental-state attribution 0 – 36
MASC_attention_score Correct attention-check items (control) 0 – 15

Items are further classified as affective (emotion inference) or cognitive (belief/intention inference), yielding two proportion scores (MASC_affective_perc_score, MASC_cognitive_perc_score) that allow dissociation of the two ToM components.

1.2 Interpersonal Reactivity Index (IRI)

The IRI (Davis, 1983) is the standard multi-dimensional self-report measure of empathy. It distinguishes between cognitive and affective aspects of empathy across four subscales (each 0–28):

Subscale Description
IRI_perspectiveTaking Cognitive: spontaneous tendency to adopt others’ point of view
IRI_empathicConcern Affective: other-oriented feelings of warmth and concern
IRI_fantasy Tendency to imaginatively transpose into fictional characters
IRI_personalDistress Self-oriented distress in response to others’ suffering

The IRI is analysed here alongside the MASC because both instruments tap social-cognitive ability (albeit via different channels — implicit film-based behaviour vs explicit self-report), and both may moderate strategic behaviour in the GTEMO games.

2 Data overview

Show code
df |>
  select(game_id,
         MASC_ToM_score, MASC_dimToM_score, MASC_excToM_score,
         MASC_noToM_score, MASC_attention_score,
         MASC_affective_perc_score, MASC_cognitive_perc_score) |>
  skim()
Data summary
Name select(…)
Number of rows 122
Number of columns 8
_______________________
Column type frequency:
factor 1
numeric 7
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
game_id 0 1 FALSE 4 BS: 32, SH: 32, MP: 30, PD: 28

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
MASC_ToM_score 0 1 31.90 3.35 21.00 30.00 32.00 34.00 39 ▁▂▅▇▂
MASC_dimToM_score 0 1 5.96 2.60 0.00 4.00 6.00 7.00 14 ▂▆▇▂▁
MASC_excToM_score 0 1 5.59 2.48 0.00 4.00 5.00 7.00 15 ▂▇▃▁▁
MASC_noToM_score 0 1 1.55 1.44 0.00 0.00 1.00 2.00 7 ▇▃▂▁▁
MASC_attention_score 0 1 4.28 1.05 1.00 4.00 4.00 5.00 6 ▁▃▆▇▂
MASC_affective_perc_score 0 1 0.64 0.09 0.39 0.61 0.67 0.67 1 ▂▇▇▂▁
MASC_cognitive_perc_score 0 1 0.58 0.08 0.41 0.52 0.56 0.63 1 ▅▇▂▁▁

Descriptive skim of MASC variables including the attention control score.

3 MASC analysis

3.1 Descriptive statistics by game

Show code
tab_masc
Characteristic Overall
N = 1221
BS
N = 321
MP
N = 301
PD
N = 281
SH
N = 321
p-value2 Effect size3
Correct ToM (0–45) 32.000 (30.000, 34.000) 31.500 (29.500, 34.500) 31.000 (29.000, 34.000) 33.000 (30.000, 34.000) 33.500 (31.000, 35.000) 0.126 η² = 0.023 (small)
Diminishing — under-mentalising 6.000 (4.000, 7.000) 6.000 (4.500, 7.000) 7.000 (6.000, 8.000) 5.000 (4.000, 6.000) 6.000 (4.000, 7.000) 0.026 η² = 0.053 (small)
Exceeding — over-mentalising 5.000 (4.000, 7.000) 5.000 (4.000, 7.000) 6.000 (4.000, 7.000) 5.500 (5.000, 7.000) 5.000 (4.000, 6.000) 0.479 η² = -0.004 (small)
No ToM (wrong)




0.360
    0 33 (27%) 9 (28%) 6 (20%) 7 (25%) 11 (34%)

    1 33 (27%) 7 (22%) 9 (30%) 7 (25%) 10 (31%)

    2 31 (25%) 7 (22%) 6 (20%) 9 (32%) 9 (28%)

    3 14 (11%) 3 (9.4%) 6 (20%) 3 (11%) 2 (6.3%)

    4 8 (6.6%) 4 (13%) 3 (10%) 1 (3.6%) 0 (0%)

    6 1 (0.8%) 0 (0%) 0 (0%) 1 (3.6%) 0 (0%)

    7 2 (1.6%) 2 (6.3%) 0 (0%) 0 (0%) 0 (0%)

Attention checks correct




0.916
    1 1 (0.8%) 0 (0%) 0 (0%) 0 (0%) 1 (3.1%)

    2 5 (4.1%) 1 (3.1%) 2 (6.7%) 1 (3.6%) 1 (3.1%)

    3 22 (18%) 4 (13%) 5 (17%) 7 (25%) 6 (19%)

    4 37 (30%) 9 (28%) 7 (23%) 11 (39%) 10 (31%)

    5 45 (37%) 13 (41%) 13 (43%) 8 (29%) 11 (34%)

    6 12 (9.8%) 5 (16%) 3 (10%) 1 (3.6%) 3 (9.4%)

Affective ToM (proportion correct) 0.667 (0.611, 0.667) 0.611 (0.556, 0.667) 0.611 (0.556, 0.667) 0.667 (0.611, 0.667) 0.667 (0.611, 0.722) 0.407 η² = -0.001 (small)
Cognitive ToM (proportion correct) 0.556 (0.519, 0.630) 0.593 (0.519, 0.630) 0.593 (0.556, 0.630) 0.556 (0.500, 0.611) 0.556 (0.519, 0.611) 0.465 η² = -0.004 (small)
1 Median (Q1, Q3); n (%)
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates)
3 η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.
Note

Statistics are median (Q1, Q3). The Kruskal-Wallis test checks whether distributions differ across the 4 games; η² quantifies the effect size (small ≥ 0.01, medium ≥ 0.06, large ≥ 0.14). A significant p indicates heterogeneity in ToM profiles across games — relevant for interpreting group-level strategic differences in Parts II–III.

3.2 Affective vs cognitive ToM comparison

Show code
df |>
  select(`Affective ToM` = MASC_affective_perc_score,
         `Cognitive ToM` = MASC_cognitive_perc_score) |>
  pivot_longer(everything(), names_to = "Dimension", values_to = "score") |>
  group_by(Dimension) |>
  summarise(
    Median = median(score, na.rm = TRUE),
    Q1     = quantile(score, 0.25, na.rm = TRUE),
    Q3     = quantile(score, 0.75, na.rm = TRUE),
    .groups = "drop"
  ) |>
  gt() |>
  fmt_number(columns = c(Median, Q1, Q3), decimals = 3) |>
  tab_header(title = "Affective vs Cognitive ToM: sample-level summary (median, IQR)")
Affective vs Cognitive ToM: sample-level summary (median, IQR)
Dimension Median Q1 Q3
Affective ToM 0.667 0.611 0.667
Cognitive ToM 0.556 0.519 0.630

The following tests whether, at the sample level, affective and cognitive ToM accuracy differ within individuals (paired Wilcoxon signed-rank, as scores are bounded proportions).

Show code
# Results computed in code.R: wilcox_res (Hodges-Lehmann CI) + wilcox_es (effect size r)
tibble(
  Statistic      = c("V (Wilcoxon)", "p-value", "Pseudo-median diff. (H-L)",
                     "95% CI lower", "95% CI upper",
                     "Effect size r", "Magnitude"),
  Value          = c(
    round(wilcox_res$statistic,  1),
    signif(wilcox_res$p.value,   3),
    round(wilcox_res$estimate,   4),
    round(wilcox_res$conf.int[1],4),
    round(wilcox_res$conf.int[2],4),
    round(wilcox_es$effsize,     3),
    as.character(wilcox_es$magnitude)
  )
) |>
  gt() |>
  tab_header(
    title    = "Wilcoxon signed-rank: Affective vs Cognitive ToM",
    subtitle = "Pseudo-median difference = Hodges-Lehmann estimator (Affective − Cognitive)"
  ) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels())
Wilcoxon signed-rank: Affective vs Cognitive ToM
Pseudo-median difference = Hodges-Lehmann estimator (Affective − Cognitive)
Statistic Value
V (Wilcoxon) 5451
p-value 5.29e-08
Pseudo-median diff. (H-L) 0.0649
95% CI lower 0.0463
95% CI upper 0.0834
Effect size r 0.494
Magnitude moderate

4 Figures — MASC

4.1 Response type distribution

Show code
p_stacked
Figure 1: Average proportion of the 4 MASC response types per experimental condition. Correct responses dominate; diminishing (under-mentalising) is the most frequent error type, consistent with non-clinical samples.

4.2 ToM score distribution by game

Show code
p_violin_tom
Figure 2: Distribution of total correct ToM score (0–40) by experimental condition.

4.3 MASC dimensions heatmap

Show code
p_heat_masc
Figure 3: Within-variable standardised means (z-scores) across games — colour encodes relative position within each dimension, making incompatible scales (0–40 count vs 0–1 proportion) comparable. Cell labels show raw means; Affective (%) and Cognitive (%) labels are multiplied ×100 for readability.

4.4 Affective vs Cognitive ToM by game

Show code
p_aff_cog
Figure 4: Distributions of affective and cognitive ToM accuracy by experimental condition. Accuracy displayed as proportion (0–1).

4.5 Cognitive vs affective ToM scatter (pooled sample)

Show code
p_scatter
Figure 5: Pooled scatter of cognitive vs affective ToM accuracy with a single OLS regression line (grey). The top-left label reports the slope (β), variance explained (R²), and significance of the linear fit. Points are coloured by game.

4.6 Attention control vs ToM scores

The MASC includes attention-check items that do not require mental-state inference. Correlating the attention score with ToM scores helps assess whether performance differences are driven by general task engagement / comprehension rather than ToM ability per se.

Show code
p_attention_panel
Figure 6: Scatter plots of the MASC attention-check score against overall ToM score (left), affective ToM accuracy (centre), and cognitive ToM accuracy (right). Each panel shows an OLS line (grey band = 95% CI) and a top-left label reporting β, R², and p-value. Points are coloured by game.
Note

A strong positive association between attention score and ToM scores would indicate that overall task engagement (rather than ToM specifically) drives performance. A weak or absent association is more consistent with ToM scores reflecting the construct of interest.

5 IRI — Interpersonal Reactivity Index

5.1 Descriptive statistics by game

Show code
tab_iri
Characteristic Overall
N = 1221
BS
N = 321
MP
N = 301
PD
N = 281
SH
N = 321
p-value2 Effect size3
IRI – Empathic Concern (0–28) 19.0 (16.0, 22.0) 18.5 (16.0, 22.0) 19.5 (15.0, 22.0) 20.0 (16.0, 22.0) 19.5 (15.5, 21.0) 0.884 η² = -0.02 (small)
IRI – Perspective Taking (0–28) 19.0 (17.0, 23.0) 19.0 (16.0, 22.0) 21.0 (17.0, 24.0) 19.0 (15.5, 23.0) 20.0 (17.0, 23.0) 0.548 η² = -0.007 (small)
IRI – Fantasy (0–28) 18.0 (14.0, 23.0) 19.5 (13.0, 23.5) 17.5 (13.0, 22.0) 21.0 (14.5, 24.0) 17.0 (14.5, 20.5) 0.245 η² = 0.01 (small)
IRI – Personal Distress (0–28) 11.0 (8.0, 15.0) 12.5 (7.5, 15.5) 11.0 (6.0, 14.0) 11.0 (8.5, 13.5) 11.5 (7.5, 15.0) 0.794 η² = -0.017 (small)
1 Median (Q1, Q3)
2 Kruskal-Wallis rank sum test
3 η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.
Note

Statistics are median (Q1, Q3). Kruskal-Wallis tests between games; η² effect sizes reported. Random assignment should yield comparable IRI profiles across conditions — any significant differences are relevant as potential confounders in subsequent analyses.

5.2 IRI subscale distributions

Show code
p_iri_violin
Figure 7: Distribution of the four IRI subscales across experimental conditions. Scores range from 0 to 28 per subscale.

5.3 Internal consistency

Cronbach’s α for the four-subscale block:

Show code
# psych::alpha() computed in code.R
print(iri_alpha, digits = 3)

Reliability analysis   
Call: psych::alpha(x = iri_items)

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
     0.539     0.545   0.518     0.231 1.2 0.0672 16.9 3.13    0.281

    95% confidence boundaries 
         lower alpha upper
Feldt    0.389 0.539 0.659
Duhachek 0.407 0.539 0.671

 Reliability if an item is dropped:
                      raw_alpha std.alpha G6(smc) average_r   S/N alpha se
IRI_empathicConcern       0.349     0.330   0.300     0.141 0.493   0.0987
IRI_perspectiveTaking     0.568     0.577   0.480     0.313 1.366   0.0669
IRI_fantasy               0.354     0.378   0.356     0.168 0.608   0.1029
IRI_personalDistress      0.553     0.563   0.470     0.300 1.288   0.0690
                        var.r med.r
IRI_empathicConcern   0.03737 0.217
IRI_perspectiveTaking 0.00308 0.285
IRI_fantasy           0.04605 0.277
IRI_personalDistress  0.00640 0.307

 Item statistics 
                        n raw.r std.r r.cor r.drop mean   sd
IRI_empathicConcern   122 0.719 0.754 0.642  0.486 18.9 4.17
IRI_perspectiveTaking 122 0.505 0.556 0.315  0.187 19.7 4.27
IRI_fantasy           122 0.756 0.722 0.575  0.439 18.1 5.47
IRI_personalDistress  122 0.611 0.570 0.340  0.233 11.0 5.28
Note

Note that Cronbach’s α across the four IRI subscales reflects the internal consistency of the battery as a whole (treating the four subscales as items). High α indicates overlap between subscales; low α is expected and appropriate when the subscales capture distinct facets of empathy (the IRI was designed as a multi-dimensional instrument). Per-subscale reliability would typically be assessed at the item level.

6 MASC × IRI: correlations and regressions

This section tests whether self-reported empathy (IRI) is associated with film-based ToM performance (MASC). Two complementary analyses are reported: targeted Spearman correlations to check whether matching pairs (affective ToM ↔︎ affective empathy; cognitive ToM ↔︎ cognitive empathy) are stronger than crossing ones, followed by binomial GLMs predicting MASC accuracy from the four IRI subscales simultaneously.

6.1 Level A — Spearman correlations

Show code
tab_spearman |>
  gt() |>
  cols_label(
    MASC_dim = "MASC dimension",
    Pair     = "Pair",
    rho      = "\u03c1",
    p_fmt    = "p",
    sig      = "Sig."
  ) |>
  tab_header(
    title    = "Spearman correlations: MASC \u00d7 IRI",
    subtitle = paste0("N = ", n_mi,
                      " complete cases. Exact = FALSE (ties present).")
  ) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(columns = sig, rows = sig != "ns")) |>
  tab_footnote("* p < .05  ** p < .01  *** p < .001  ns = not significant")
Spearman correlations: MASC × IRI
N = 122 complete cases. Exact = FALSE (ties present).
MASC dimension Pair ρ p Sig.
Affective ToM Affective ToM × Empathic Concern -0.002 0.982 ns
Affective ToM Affective ToM × Personal Distress -0.047 0.605 ns
Affective ToM Affective ToM × Perspective Taking -0.148 0.103 ns
Cognitive ToM Cognitive ToM × Perspective Taking 0.041 0.652 ns
Cognitive ToM Cognitive ToM × Empathic Concern -0.030 0.742 ns
Cognitive ToM Cognitive ToM × Fantasy -0.045 0.622 ns
Cognitive ToM Cognitive ToM × Personal Distress 0.025 0.787 ns
* p < .05 ** p < .01 *** p < .001 ns = not significant
Note

Matching vs crossing hypothesis. Affective ToM (emotion inference from film clips) is theorised to align more strongly with affective empathy (Empathic Concern, Personal Distress). Cognitive ToM (belief/intention inference) should align more with cognitive empathy (Perspective Taking). Pairs that cross the affective/cognitive boundary serve as a discriminant validity check — weaker or non-significant ρ there supports construct differentiation.

6.2 Correlation heatmap

Show code
p_cor_heat
Figure 8: Spearman ρ between the two MASC dimensions (rows) and the four IRI subscales (columns). Red = positive association, blue = negative. Significance stars: * p < .05 ** p < .01 *** p < .001.

6.3 Level B — Binomial GLMs

IRI subscales entered simultaneously as predictors of MASC accuracy. The response is modelled as a binomial count of correct answers (17 affective items; 28 cognitive items, total = 45). Coefficients are on the log-odds scale; the forest plot shows exponentiated odds ratios (OR) with 95% Wald CIs.

Show code
tab_glm |>
  select(Outcome, Predictor, beta, SE, OR, OR_lo, OR_hi, stat, p_fmt, sig) |>
  gt() |>
  tab_header(
    title    = "Binomial GLM: IRI subscales predicting MASC accuracy",
    subtitle = "Family: binomial (logit link). Wald 95% CI."
  ) |>
  cols_label(beta = "\u03b2", SE = "SE", OR = "OR",
             OR_lo = "95% CI lo", OR_hi = "95% CI hi",
             stat = "z", p_fmt = "p", sig = "Sig.") |>
  tab_row_group(label = "Outcome: Cognitive ToM (28 items)",
                rows = Outcome == "Cognitive ToM") |>
  tab_row_group(label = "Outcome: Affective ToM (17 items)",
                rows = Outcome == "Affective ToM") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(columns = sig, rows = sig != "")) |>
  tab_style(style = cell_text(weight = "bold", color = "#2d7a3a"),
            locations = cells_row_groups()) |>
  tab_footnote("\u03b2 = log-odds coefficient. OR = exp(\u03b2). Wald 95% CI. * p < .05  ** p < .01  *** p < .001.")
Binomial GLM: IRI subscales predicting MASC accuracy
Family: binomial (logit link). Wald 95% CI.
Outcome Predictor β SE OR 95% CI lo 95% CI hi z p Sig.
Outcome: Affective ToM (17 items)
Affective ToM Empathic Concern 0.0061 0.0124 1.006 0.982 1.031 0.488 0.626
Affective ToM Perspective Taking -0.0160 0.0116 0.984 0.962 1.007 -1.380 0.167
Affective ToM Fantasy 0.0001 0.0093 1.000 0.982 1.018 0.015 0.988
Affective ToM Personal Distress -0.0041 0.0094 0.996 0.978 1.014 -0.444 0.657
Outcome: Cognitive ToM (28 items)
Cognitive ToM Empathic Concern 0.0001 0.0096 1.000 0.981 1.019 0.014 0.989
Cognitive ToM Perspective Taking 0.0076 0.0088 1.008 0.990 1.025 0.867 0.386
Cognitive ToM Fantasy -0.0081 0.0071 0.992 0.978 1.006 -1.133 0.257
Cognitive ToM Personal Distress 0.0062 0.0072 1.006 0.992 1.020 0.860 0.39
β = log-odds coefficient. OR = exp(β). Wald 95% CI. * p < .05 ** p < .01 *** p < .001.
Note

Overdispersion check. A binomial GLM assumes variance = μ(1−μ)/n; real data often show extra-binomial variation (overdispersion). The dispersion parameter φ is estimated by the quasi-binomial fit: φ(affective) = 0.615, φ(cognitive) = 0.817. φ ≈ 1 means the binomial assumption holds; φ >> 1 means SEs from the standard binomial are underestimated. The quasi-binomial robustness check below quantifies the difference.

Show code
p_glm_forest
Figure 9: Forest plot: odds ratios from the binomial GLMs. Error bars = 95% Wald CI. Dashed line = OR 1 (null effect).

6.4 Quasi-binomial robustness check

The quasi-binomial model uses the same formula but estimates a free dispersion parameter φ, inflating standard errors by √φ. Coefficients (β) and odds ratios are identical to the binomial — only SEs and p-values change. The comparison table shows directly where overdispersion changes inference.

Show code
tab_glm_compare |>
  select(Outcome, Predictor, beta, OR,
         SE_binom, SE_quasi, SE_ratio,
         p_binom, sig_binom, p_quasi, sig_quasi) |>
  gt() |>
  tab_header(
    title    = "Binomial vs quasi-binomial: SE and p-value comparison",
    subtitle = paste0("φ (dispersion): Affective = ", disp_aff,
                      ", Cognitive = ", disp_cog,
                      ". SE ratio \u2248 \u221a\u03c6.")
  ) |>
  cols_label(
    beta      = "\u03b2", OR = "OR",
    SE_binom  = "SE (binom)", SE_quasi = "SE (quasi)", SE_ratio = "SE ratio",
    p_binom   = "p (binom)",  sig_binom = "Sig. (binom)",
    p_quasi   = "p (quasi)",  sig_quasi = "Sig. (quasi)"
  ) |>
  tab_row_group(label = "Outcome: Cognitive ToM",
                rows = Outcome == "Cognitive ToM") |>
  tab_row_group(label = "Outcome: Affective ToM",
                rows = Outcome == "Affective ToM") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold", color = "#2d7a3a"),
            locations = cells_row_groups()) |>
  tab_style(
    style = cell_fill(color = "#fff3cd"),
    locations = cells_body(
      columns = c(sig_binom, sig_quasi),
      rows = sig_binom != sig_quasi
    )
  ) |>
  tab_footnote("Yellow highlight = significance changes between models. SE ratio = SE\u2098\u1d64\u1d43\u02e2\u1d35 / SE\u1d47\u1d35\u207f\u1d52\u1d50.")
Binomial vs quasi-binomial: SE and p-value comparison
φ (dispersion): Affective = 0.615, Cognitive = 0.817. SE ratio ≈ √φ.
Outcome Predictor β OR SE (binom) SE (quasi) SE ratio p (binom) Sig. (binom) p (quasi) Sig. (quasi)
Outcome: Affective ToM
Affective ToM Empathic Concern 0.0061 1.006 0.0124 0.0098 0.79 0.626 0.535
Affective ToM Perspective Taking -0.0160 0.984 0.0116 0.0091 0.78 0.167 0.081
Affective ToM Fantasy 0.0001 1.000 0.0093 0.0073 0.78 0.988 0.985
Affective ToM Personal Distress -0.0041 0.996 0.0094 0.0073 0.78 0.657 0.573
Outcome: Cognitive ToM
Cognitive ToM Empathic Concern 0.0001 1.000 0.0096 0.0087 0.91 0.989 0.988
Cognitive ToM Perspective Taking 0.0076 1.008 0.0088 0.0080 0.91 0.386 0.34
Cognitive ToM Fantasy -0.0081 0.992 0.0071 0.0065 0.92 0.257 0.213
Cognitive ToM Personal Distress 0.0062 1.006 0.0072 0.0065 0.90 0.39 0.344
Yellow highlight = significance changes between models. SE ratio = SEₘᵤᵃˢᴵ / SEᵇᴵⁿᵒᵐ.
Show code
p_glm_forest_quasi
Figure 10: Forest plot: odds ratios from the quasi-binomial GLMs. Wider CIs reflect SE inflation by √φ. Compare with the binomial forest plot above.

7 Conditioning on gender and role

The preceding analyses compare MASC and IRI scores across experimental games without accounting for sample composition. Since participants were not stratified by demographics at assignment, observed game-level differences in ToM and empathy scores may be confounded by gender composition or by the structural difference between experimental sites (P1 = LEEN laboratory; P2 = CoCoLab). This section (i) visualises distributions stratified by gender and role, and (ii) fits OLS models with game, gender, and role entered simultaneously as predictors. The reference category for all models is: game = BS, gender = Male, role = P1 (LEEN).

7.1 MASC by gender

Show code
p_masc_gender
Figure 11: MASC overall ToM score by gender within each game condition. Violin + box plot; no legend (Male = blue, Female = orange).
Show code
p_masc_dim_gender
Figure 12: MASC affective and cognitive ToM accuracy by gender, faceted by game (columns) and dimension (rows).

7.2 MASC by role

Show code
p_masc_role
Figure 13: MASC overall ToM score by experimental role (P1 LEEN vs P2 CoCoLab) within each game. Violin + box plot.
Show code
p_masc_dim_role
Figure 14: MASC affective and cognitive ToM accuracy by role, faceted by game (columns) and dimension (rows).

7.3 IRI by gender

Show code
p_iri_gender
Figure 15: IRI four subscales by gender (pooled sample). All subscales on the same y-axis (0–28) for comparability.

7.4 IRI by role

Show code
p_iri_role
Figure 16: IRI four subscales by experimental role: P1 (LEEN) vs P2 (CoCoLab), pooled across games.

7.5 OLS regressions with demographic controls

MASC models — outcome variables are overall ToM score (0–40) and the two proportion scores (affective, cognitive), each regressed on game condition, gender, and role simultaneously.

Show code
gt_ols_masc
OLS: MASC accuracy ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome Predictor β SE 95% CI lo 95% CI hi t p Sig.
Outcome: Cognitive ToM (%)
Cognitive ToM (%) Game: MP vs BS 0.006 0.020 -0.034 0.046 0.306 0.76
Cognitive ToM (%) Game: PD vs BS -0.021 0.021 -0.062 0.020 -1.030 0.305
Cognitive ToM (%) Game: SH vs BS 0.001 0.020 -0.038 0.040 0.058 0.954
Cognitive ToM (%) genderMale 0.014 0.014 -0.015 0.042 0.962 0.338
Cognitive ToM (%) Role: CoCoLab vs LEEN 0.001 0.014 -0.028 0.029 0.042 0.966
Outcome: Affective ToM (%)
Affective ToM (%) Game: MP vs BS -0.011 0.023 -0.057 0.035 -0.473 0.637
Affective ToM (%) Game: PD vs BS 0.016 0.024 -0.030 0.063 0.690 0.492
Affective ToM (%) Game: SH vs BS 0.030 0.023 -0.015 0.075 1.299 0.197
Affective ToM (%) genderMale 0.026 0.017 -0.007 0.059 1.575 0.118
Affective ToM (%) Role: CoCoLab vs LEEN -0.019 0.016 -0.052 0.013 -1.162 0.248
Outcome: Overall ToM (0–40)
Overall ToM (0–40) Game: MP vs BS -0.788 0.854 -2.479 0.903 -0.923 0.358
Overall ToM (0–40) Game: PD vs BS 0.633 0.870 -1.091 2.356 0.727 0.469
Overall ToM (0–40) Game: SH vs BS 0.750 0.840 -0.913 2.413 0.893 0.374
Overall ToM (0–40) genderMale 0.140 0.610 -1.068 1.348 0.230 0.819
Overall ToM (0–40) Role: CoCoLab vs LEEN 0.131 0.608 -1.073 1.336 0.216 0.83
β = OLS coefficient. * p < .05 ** p < .01 *** p < .001.
Show code
p_forest_masc8
Figure 17: Forest plot: OLS β coefficients with 95% CI for MASC outcomes. Dashed line = 0 (null effect). All three outcomes shown simultaneously; note that scales differ (0–40 vs proportion).

IRI models — each of the four subscales (0–28) regressed on game, gender, and role.

Show code
gt_ols_iri
OLS: IRI subscales ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome Predictor β SE 95% CI lo 95% CI hi t p Sig.
Outcome: Personal Distress
Personal Distress Game: MP vs BS -1.243 1.243 -3.705 1.218 -1.000 0.319
Personal Distress Game: PD vs BS -0.664 1.267 -3.173 1.845 -0.524 0.601
Personal Distress Game: SH vs BS -0.688 1.222 -3.109 1.734 -0.562 0.575
Personal Distress genderMale -4.265 0.888 -6.024 -2.506 -4.804 < 0.001 ***
Personal Distress Role: CoCoLab vs LEEN -0.738 0.885 -2.491 1.016 -0.833 0.406
Outcome: Fantasy
Fantasy Game: MP vs BS -1.310 1.273 -3.832 1.212 -1.029 0.306
Fantasy Game: PD vs BS 1.086 1.298 -1.484 3.656 0.837 0.404
Fantasy Game: SH vs BS -1.375 1.252 -3.855 1.105 -1.098 0.274
Fantasy genderMale -4.450 0.909 -6.252 -2.649 -4.893 < 0.001 ***
Fantasy Role: CoCoLab vs LEEN 0.262 0.907 -1.534 2.058 0.289 0.773
Outcome: Perspective Taking
Perspective Taking Game: MP vs BS 1.577 1.087 -0.576 3.729 1.451 0.149
Perspective Taking Game: PD vs BS 0.307 1.108 -1.886 2.501 0.277 0.782
Perspective Taking Game: SH vs BS 0.781 1.069 -1.335 2.898 0.731 0.466
Perspective Taking genderMale -1.114 0.776 -2.651 0.424 -1.435 0.154
Perspective Taking Role: CoCoLab vs LEEN 0.443 0.774 -1.090 1.976 0.572 0.569
Outcome: Empathic Concern
Empathic Concern Game: MP vs BS -0.144 1.056 -2.235 1.947 -0.136 0.892
Empathic Concern Game: PD vs BS 0.549 1.076 -1.582 2.680 0.510 0.611
Empathic Concern Game: SH vs BS -0.719 1.038 -2.775 1.338 -0.692 0.49
Empathic Concern genderMale -1.625 0.754 -3.118 -0.131 -2.154 0.033 *
Empathic Concern Role: CoCoLab vs LEEN 0.377 0.752 -1.112 1.866 0.501 0.617
β = OLS coefficient. * p < .05 ** p < .01 *** p < .001.
Show code
p_forest_iri8
Figure 18: Forest plot: OLS β coefficients with 95% CI for IRI subscales. Game effects (vs BS), gender effect (Female vs Male), and role effect (CoCoLab vs LEEN) shown side by side.
Note

Interpretation note. Game coefficients in these models represent the conditional effect of game assignment given equal gender and role composition. A game coefficient that is significant unconditionally (Kruskal-Wallis in sections 4–5) but non-significant here suggests partial confounding by demographics. Conversely, a gender or role coefficient reveals systematic differences in MASC/IRI scores attributable to those characteristics independently of game.

8 Response times & processing speed

Cognitive and affective tasks vary in the time required for deliberation and response. This section examines whether speed of processing correlates with accuracy across the MASC (ToM), IRI (empathy), and CRT (reflection), and whether games differ in time investment.

8.1 MASC response times by dimension

Show code
tab_resp_times
Characteristic Overall
N = 1221
BS
N = 321
MP
N = 301
PD
N = 281
SH
N = 321
p-value2 Effect size3
MASC – avg response time (all items) 11.4 (9.6, 13.7) 11.6 (9.6, 14.0) 11.1 (9.4, 13.9) 11.1 (9.5, 13.3) 11.7 (10.1, 14.0) 0.913 η² = -0.021 (small)
MASC – avg response time (affective) 11.5 (9.9, 14.1) 11.4 (9.7, 14.3) 11.3 (9.2, 14.1) 11.4 (9.7, 13.5) 11.6 (10.3, 14.3) 0.842 η² = -0.018 (small)
MASC – avg response time (cognitive) 11.4 (9.5, 13.7) 11.7 (9.2, 14.5) 11.3 (9.2, 14.4) 11.1 (9.4, 13.3) 11.6 (9.7, 13.4) 0.955 η² = -0.023 (small)
IRI – total time (28 items) 190.0 (158.0, 237.0) 184.0 (149.0, 237.5) 175.5 (159.0, 233.0) 192.0 (157.0, 230.0) 196.5 (165.5, 238.5) 0.740 η² = -0.015 (small)
CRT – total time (4 items) 58.0 (46.0, 76.0) 51.5 (42.0, 73.0) 57.0 (45.0, 76.0) 62.0 (51.5, 73.5) 58.0 (52.0, 83.0) 0.365 η² = 0.002 (small)
1 Median (Q1, Q3)
2 Kruskal-Wallis rank sum test
3 η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.

8.2 MASC: speed–accuracy trade-off

Show code
p_masc_rt
Figure 19: Distribution of MASC response times by dimension (overall / affective / cognitive) across games. Violin width = density; box plot = quartiles. Faster response times may reflect overconfidence or heuristic use; slower times suggest deliberative mentalising.
Show code
p_masc_speed_accuracy
Figure 20: MASC: average response time vs overall ToM accuracy. Does faster responding predict worse accuracy (speed–accuracy trade-off)? OLS line fitted on pooled sample; top-left label reports β, R², p-value.
Note

Speed–accuracy trade-off in ToM. If participants who respond faster are less accurate, this suggests a speed–accuracy trade-off: quick responses may rely on superficial heuristics rather than genuine mentalising. Conversely, a positive correlation (faster = more accurate) would indicate fluent, confident mentalising. A near-zero correlation indicates speed and accuracy are independent — both may reflect trait differences in responding style (e.g. impulsivity) rather than true mentalising ability.

8.3 IRI: time spent vs all subscales

Show code
p_iri_speed_panel
Figure 21: IRI total completion time vs each of the four subscales (Empathic Concern, Perspective Taking, Fantasy, Personal Distress). Each panel shows an OLS line with 95% CI and a top-left annotation reporting β, R², and p-value. All subscales share the same y-axis scale (0–28). Points coloured by game condition.

9 Preliminary interpretation

The sample shows a median MASC ToM score of 32 (IQR = 4) out of 40 items, consistent with adequate mentalising ability in a non-clinical adult population. The affective component (median 66.7%) and the cognitive component (median 55.6%) are compared within individuals: the Wilcoxon signed-rank test yields p = 5.3^{-8}, with a moderate effect size (r = 0.49), suggesting a statistically significant difference between the two ToM dimensions at the sample level.

The attention-scatter plots provide a first check on whether task engagement confounds ToM performance — interpretation depends on the slope and confidence interval of the regression lines.

Differences in MASC profiles across games are informative to the extent that randomisation was imperfect or that participant sorting occurred. Any significant Kruskal-Wallis effects will be noted as potential covariates in the inferential sections (Parts II–III).

For the IRI, randomly assigned groups should show comparable empathy profiles. Significant game differences would flag imbalance that warrants covariate adjustment in the main analyses.