Ib — Theory of Mind (MASC) & Empathy (IRI): descriptive analysis

Descriptive Analyses · GTEMO Experiment

Author

Eric Guerci

Published

March 22, 2026

1 Background

1.1 Theory of Mind and the MASC

Theory of Mind (ToM) is the ability to attribute mental states — beliefs, intentions, desires, emotions — to others and to understand that these may differ from one’s own. It is a core dimension of social cognition and underlies strategic behaviour in interactive settings: anticipating what others know, want, and believe is a prerequisite for effective communication, negotiation, and cooperation.

The Movie for the Assessment of Social Cognition (MASC) is a validated film-based instrument developed by Dziobek et al. (2006). Participants watch short video clips of social interactions and answer multiple-choice questions about the characters’ thoughts and feelings. The MASC is designed to capture ecological ToM by embedding mental-state inference in naturalistic, dynamic social scenes — closer to real-world interaction than classic vignette-based tasks.

The instrument yields five scores:

Variable	Description	Scale
`MASC_ToM_score`	Total correct ToM responses	0 – 40
`MASC_dimToM_score`	Diminishing errors — under-mentalising	0 – 36
`MASC_excToM_score`	Exceeding errors — over-mentalising	0 – 36
`MASC_noToM_score`	No ToM errors — no mental-state attribution	0 – 36
`MASC_attention_score`	Correct attention-check items (control)	0 – 15

Items are further classified as affective (emotion inference) or cognitive (belief/intention inference), yielding two proportion scores (MASC_affective_perc_score, MASC_cognitive_perc_score) that allow dissociation of the two ToM components.

1.2 Interpersonal Reactivity Index (IRI)

The IRI (Davis, 1983) is the standard multi-dimensional self-report measure of empathy. It distinguishes between cognitive and affective aspects of empathy across four subscales (each 0–28):

Subscale	Description
`IRI_perspectiveTaking`	Cognitive: spontaneous tendency to adopt others’ point of view
`IRI_empathicConcern`	Affective: other-oriented feelings of warmth and concern
`IRI_fantasy`	Tendency to imaginatively transpose into fictional characters
`IRI_personalDistress`	Self-oriented distress in response to others’ suffering

The IRI is analysed here alongside the MASC because both instruments tap social-cognitive ability (albeit via different channels — implicit film-based behaviour vs explicit self-report), and both may moderate strategic behaviour in the GTEMO games.

2 Data overview

Show code

df |>
  select(game_id,
         MASC_ToM_score, MASC_dimToM_score, MASC_excToM_score,
         MASC_noToM_score, MASC_attention_score,
         MASC_affective_perc_score, MASC_cognitive_perc_score) |>
  skim()

Data summary
Name	select(…)
Number of rows	122
Number of columns	8
_______________________
Column type frequency:
factor	1
numeric	7
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
game_id	0	1	FALSE	4	BS: 32, SH: 32, MP: 30, PD: 28

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
MASC_ToM_score	1	31.90	3.35	21.00	30.00	32.00	34.00	39	▁▂▅▇▂
MASC_dimToM_score	1	5.96	2.60	0.00	4.00	6.00	7.00	14	▂▆▇▂▁
MASC_excToM_score	1	5.59	2.48	0.00	4.00	5.00	7.00	15	▂▇▃▁▁
MASC_noToM_score	1	1.55	1.44	0.00	0.00	1.00	2.00	7	▇▃▂▁▁
MASC_attention_score	1	4.28	1.05	1.00	4.00	4.00	5.00	6	▁▃▆▇▂
MASC_affective_perc_score	1	0.64	0.09	0.39	0.61	0.67	0.67	1	▂▇▇▂▁
MASC_cognitive_perc_score	1	0.58	0.08	0.41	0.52	0.56	0.63	1	▅▇▂▁▁

Descriptive skim of MASC variables including the attention control score.

3 MASC analysis

3.1 Descriptive statistics by game

Show code

tab_masc

Characteristic	Overall N = 122¹	BS N = 32¹	MP N = 30¹	PD N = 28¹	SH N = 32¹	p-value²	Effect size³
Correct ToM (0–45)	32.000 (30.000, 34.000)	31.500 (29.500, 34.500)	31.000 (29.000, 34.000)	33.000 (30.000, 34.000)	33.500 (31.000, 35.000)	0.126	η² = 0.023 (small)
Diminishing — under-mentalising	6.000 (4.000, 7.000)	6.000 (4.500, 7.000)	7.000 (6.000, 8.000)	5.000 (4.000, 6.000)	6.000 (4.000, 7.000)	0.026	η² = 0.053 (small)
Exceeding — over-mentalising	5.000 (4.000, 7.000)	5.000 (4.000, 7.000)	6.000 (4.000, 7.000)	5.500 (5.000, 7.000)	5.000 (4.000, 6.000)	0.479	η² = -0.004 (small)
No ToM (wrong)						0.360
0	33 (27%)	9 (28%)	6 (20%)	7 (25%)	11 (34%)
1	33 (27%)	7 (22%)	9 (30%)	7 (25%)	10 (31%)
2	31 (25%)	7 (22%)	6 (20%)	9 (32%)	9 (28%)
3	14 (11%)	3 (9.4%)	6 (20%)	3 (11%)	2 (6.3%)
4	8 (6.6%)	4 (13%)	3 (10%)	1 (3.6%)	0 (0%)
6	1 (0.8%)	0 (0%)	0 (0%)	1 (3.6%)	0 (0%)
7	2 (1.6%)	2 (6.3%)	0 (0%)	0 (0%)	0 (0%)
Attention checks correct						0.916
1	1 (0.8%)	0 (0%)	0 (0%)	0 (0%)	1 (3.1%)
2	5 (4.1%)	1 (3.1%)	2 (6.7%)	1 (3.6%)	1 (3.1%)
3	22 (18%)	4 (13%)	5 (17%)	7 (25%)	6 (19%)
4	37 (30%)	9 (28%)	7 (23%)	11 (39%)	10 (31%)
5	45 (37%)	13 (41%)	13 (43%)	8 (29%)	11 (34%)
6	12 (9.8%)	5 (16%)	3 (10%)	1 (3.6%)	3 (9.4%)
Affective ToM (proportion correct)	0.667 (0.611, 0.667)	0.611 (0.556, 0.667)	0.611 (0.556, 0.667)	0.667 (0.611, 0.667)	0.667 (0.611, 0.722)	0.407	η² = -0.001 (small)
Cognitive ToM (proportion correct)	0.556 (0.519, 0.630)	0.593 (0.519, 0.630)	0.593 (0.556, 0.630)	0.556 (0.500, 0.611)	0.556 (0.519, 0.611)	0.465	η² = -0.004 (small)
¹ Median (Q1, Q3); n (%)
² Kruskal-Wallis rank sum test; Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates)
³ η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.

Note

Statistics are median (Q1, Q3). The Kruskal-Wallis test checks whether distributions differ across the 4 games; η² quantifies the effect size (small ≥ 0.01, medium ≥ 0.06, large ≥ 0.14). A significant p indicates heterogeneity in ToM profiles across games — relevant for interpreting group-level strategic differences in Parts II–III.

3.2 Affective vs cognitive ToM comparison

Show code

df |>
  select(`Affective ToM` = MASC_affective_perc_score,
         `Cognitive ToM` = MASC_cognitive_perc_score) |>
  pivot_longer(everything(), names_to = "Dimension", values_to = "score") |>
  group_by(Dimension) |>
  summarise(
    Median = median(score, na.rm = TRUE),
    Q1     = quantile(score, 0.25, na.rm = TRUE),
    Q3     = quantile(score, 0.75, na.rm = TRUE),
    .groups = "drop"
  ) |>
  gt() |>
  fmt_number(columns = c(Median, Q1, Q3), decimals = 3) |>
  tab_header(title = "Affective vs Cognitive ToM: sample-level summary (median, IQR)")

Dimension	Median	Q1	Q3
Affective vs Cognitive ToM: sample-level summary (median, IQR)
Affective ToM	0.667	0.611	0.667
Cognitive ToM	0.556	0.519	0.630

The following tests whether, at the sample level, affective and cognitive ToM accuracy differ within individuals (paired Wilcoxon signed-rank, as scores are bounded proportions).

Show code

# Results computed in code.R: wilcox_res (Hodges-Lehmann CI) + wilcox_es (effect size r)
tibble(
  Statistic      = c("V (Wilcoxon)", "p-value", "Pseudo-median diff. (H-L)",
                     "95% CI lower", "95% CI upper",
                     "Effect size r", "Magnitude"),
  Value          = c(
    round(wilcox_res$statistic,  1),
    signif(wilcox_res$p.value,   3),
    round(wilcox_res$estimate,   4),
    round(wilcox_res$conf.int[1],4),
    round(wilcox_res$conf.int[2],4),
    round(wilcox_es$effsize,     3),
    as.character(wilcox_es$magnitude)
  )
) |>
  gt() |>
  tab_header(
    title    = "Wilcoxon signed-rank: Affective vs Cognitive ToM",
    subtitle = "Pseudo-median difference = Hodges-Lehmann estimator (Affective − Cognitive)"
  ) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels())

Statistic	Value
Wilcoxon signed-rank: Affective vs Cognitive ToM
Pseudo-median difference = Hodges-Lehmann estimator (Affective − Cognitive)
V (Wilcoxon)	5451
p-value	5.29e-08
Pseudo-median diff. (H-L)	0.0649
95% CI lower	0.0463
95% CI upper	0.0834
Effect size r	0.494
Magnitude	moderate

4 Figures — MASC

4.1 Response type distribution

Show code

p_stacked

Figure 1: Average proportion of the 4 MASC response types per experimental condition. Correct responses dominate; diminishing (under-mentalising) is the most frequent error type, consistent with non-clinical samples.

4.2 ToM score distribution by game

Show code

p_violin_tom

Figure 2: Distribution of total correct ToM score (0–40) by experimental condition.

4.3 MASC dimensions heatmap

Show code

p_heat_masc

Figure 3: Within-variable standardised means (z-scores) across games — colour encodes relative position within each dimension, making incompatible scales (0–40 count vs 0–1 proportion) comparable. Cell labels show raw means; Affective (%) and Cognitive (%) labels are multiplied ×100 for readability.

4.4 Affective vs Cognitive ToM by game

Show code

p_aff_cog

Figure 4: Distributions of affective and cognitive ToM accuracy by experimental condition. Accuracy displayed as proportion (0–1).

4.5 Cognitive vs affective ToM scatter (pooled sample)

Show code

p_scatter

Figure 5: Pooled scatter of cognitive vs affective ToM accuracy with a single OLS regression line (grey). The top-left label reports the slope (β), variance explained (R²), and significance of the linear fit. Points are coloured by game.

4.6 Attention control vs ToM scores

The MASC includes attention-check items that do not require mental-state inference. Correlating the attention score with ToM scores helps assess whether performance differences are driven by general task engagement / comprehension rather than ToM ability per se.

Show code

p_attention_panel

Figure 6: Scatter plots of the MASC attention-check score against overall ToM score (left), affective ToM accuracy (centre), and cognitive ToM accuracy (right). Each panel shows an OLS line (grey band = 95% CI) and a top-left label reporting β, R², and p-value. Points are coloured by game.

Note

A strong positive association between attention score and ToM scores would indicate that overall task engagement (rather than ToM specifically) drives performance. A weak or absent association is more consistent with ToM scores reflecting the construct of interest.

5 IRI — Interpersonal Reactivity Index

5.1 Descriptive statistics by game

Show code

tab_iri

Characteristic	Overall N = 122¹	BS N = 32¹	MP N = 30¹	PD N = 28¹	SH N = 32¹	p-value²	Effect size³
IRI – Empathic Concern (0–28)	19.0 (16.0, 22.0)	18.5 (16.0, 22.0)	19.5 (15.0, 22.0)	20.0 (16.0, 22.0)	19.5 (15.5, 21.0)	0.884	η² = -0.02 (small)
IRI – Perspective Taking (0–28)	19.0 (17.0, 23.0)	19.0 (16.0, 22.0)	21.0 (17.0, 24.0)	19.0 (15.5, 23.0)	20.0 (17.0, 23.0)	0.548	η² = -0.007 (small)
IRI – Fantasy (0–28)	18.0 (14.0, 23.0)	19.5 (13.0, 23.5)	17.5 (13.0, 22.0)	21.0 (14.5, 24.0)	17.0 (14.5, 20.5)	0.245	η² = 0.01 (small)
IRI – Personal Distress (0–28)	11.0 (8.0, 15.0)	12.5 (7.5, 15.5)	11.0 (6.0, 14.0)	11.0 (8.5, 13.5)	11.5 (7.5, 15.0)	0.794	η² = -0.017 (small)
¹ Median (Q1, Q3)
² Kruskal-Wallis rank sum test
³ η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.

Note

Statistics are median (Q1, Q3). Kruskal-Wallis tests between games; η² effect sizes reported. Random assignment should yield comparable IRI profiles across conditions — any significant differences are relevant as potential confounders in subsequent analyses.

5.2 IRI subscale distributions

Show code

p_iri_violin

Figure 7: Distribution of the four IRI subscales across experimental conditions. Scores range from 0 to 28 per subscale.

5.3 Internal consistency

Cronbach’s α for the four-subscale block:

Show code

# psych::alpha() computed in code.R
print(iri_alpha, digits = 3)


Reliability analysis   
Call: psych::alpha(x = iri_items)

  raw_alpha std.alpha G6(smc) average_r S/N    ase mean   sd median_r
     0.539     0.545   0.518     0.231 1.2 0.0672 16.9 3.13    0.281

    95% confidence boundaries 
         lower alpha upper
Feldt    0.389 0.539 0.659
Duhachek 0.407 0.539 0.671

 Reliability if an item is dropped:
                      raw_alpha std.alpha G6(smc) average_r   S/N alpha se
IRI_empathicConcern       0.349     0.330   0.300     0.141 0.493   0.0987
IRI_perspectiveTaking     0.568     0.577   0.480     0.313 1.366   0.0669
IRI_fantasy               0.354     0.378   0.356     0.168 0.608   0.1029
IRI_personalDistress      0.553     0.563   0.470     0.300 1.288   0.0690
                        var.r med.r
IRI_empathicConcern   0.03737 0.217
IRI_perspectiveTaking 0.00308 0.285
IRI_fantasy           0.04605 0.277
IRI_personalDistress  0.00640 0.307

 Item statistics 
                        n raw.r std.r r.cor r.drop mean   sd
IRI_empathicConcern   122 0.719 0.754 0.642  0.486 18.9 4.17
IRI_perspectiveTaking 122 0.505 0.556 0.315  0.187 19.7 4.27
IRI_fantasy           122 0.756 0.722 0.575  0.439 18.1 5.47
IRI_personalDistress  122 0.611 0.570 0.340  0.233 11.0 5.28

Note

Note that Cronbach’s α across the four IRI subscales reflects the internal consistency of the battery as a whole (treating the four subscales as items). High α indicates overlap between subscales; low α is expected and appropriate when the subscales capture distinct facets of empathy (the IRI was designed as a multi-dimensional instrument). Per-subscale reliability would typically be assessed at the item level.

6 MASC × IRI: correlations and regressions

This section tests whether self-reported empathy (IRI) is associated with film-based ToM performance (MASC). Two complementary analyses are reported: targeted Spearman correlations to check whether matching pairs (affective ToM ↔︎ affective empathy; cognitive ToM ↔︎ cognitive empathy) are stronger than crossing ones, followed by binomial GLMs predicting MASC accuracy from the four IRI subscales simultaneously.

6.1 Level A — Spearman correlations

Show code

tab_spearman |>
  gt() |>
  cols_label(
    MASC_dim = "MASC dimension",
    Pair     = "Pair",
    rho      = "\u03c1",
    p_fmt    = "p",
    sig      = "Sig."
  ) |>
  tab_header(
    title    = "Spearman correlations: MASC \u00d7 IRI",
    subtitle = paste0("N = ", n_mi,
                      " complete cases. Exact = FALSE (ties present).")
  ) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(columns = sig, rows = sig != "ns")) |>
  tab_footnote("* p < .05  ** p < .01  *** p < .001  ns = not significant")

MASC dimension	Pair	ρ	p	Sig.
Spearman correlations: MASC × IRI
N = 122 complete cases. Exact = FALSE (ties present).
Affective ToM	Affective ToM × Empathic Concern	-0.002	0.982	ns
Affective ToM	Affective ToM × Personal Distress	-0.047	0.605	ns
Affective ToM	Affective ToM × Perspective Taking	-0.148	0.103	ns
Cognitive ToM	Cognitive ToM × Perspective Taking	0.041	0.652	ns
Cognitive ToM	Cognitive ToM × Empathic Concern	-0.030	0.742	ns
Cognitive ToM	Cognitive ToM × Fantasy	-0.045	0.622	ns
Cognitive ToM	Cognitive ToM × Personal Distress	0.025	0.787	ns
* p < .05 p < .01 * p < .001 ns = not significant

Note

Matching vs crossing hypothesis. Affective ToM (emotion inference from film clips) is theorised to align more strongly with affective empathy (Empathic Concern, Personal Distress). Cognitive ToM (belief/intention inference) should align more with cognitive empathy (Perspective Taking). Pairs that cross the affective/cognitive boundary serve as a discriminant validity check — weaker or non-significant ρ there supports construct differentiation.

6.2 Correlation heatmap

Show code

p_cor_heat

Figure 8: Spearman ρ between the two MASC dimensions (rows) and the four IRI subscales (columns). Red = positive association, blue = negative. Significance stars: * p < .05 ** p < .01 *** p < .001.

6.3 Level B — Binomial GLMs

IRI subscales entered simultaneously as predictors of MASC accuracy. The response is modelled as a binomial count of correct answers (17 affective items; 28 cognitive items, total = 45). Coefficients are on the log-odds scale; the forest plot shows exponentiated odds ratios (OR) with 95% Wald CIs.

Show code

tab_glm |>
  select(Outcome, Predictor, beta, SE, OR, OR_lo, OR_hi, stat, p_fmt, sig) |>
  gt() |>
  tab_header(
    title    = "Binomial GLM: IRI subscales predicting MASC accuracy",
    subtitle = "Family: binomial (logit link). Wald 95% CI."
  ) |>
  cols_label(beta = "\u03b2", SE = "SE", OR = "OR",
             OR_lo = "95% CI lo", OR_hi = "95% CI hi",
             stat = "z", p_fmt = "p", sig = "Sig.") |>
  tab_row_group(label = "Outcome: Cognitive ToM (28 items)",
                rows = Outcome == "Cognitive ToM") |>
  tab_row_group(label = "Outcome: Affective ToM (17 items)",
                rows = Outcome == "Affective ToM") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_body(columns = sig, rows = sig != "")) |>
  tab_style(style = cell_text(weight = "bold", color = "#2d7a3a"),
            locations = cells_row_groups()) |>
  tab_footnote("\u03b2 = log-odds coefficient. OR = exp(\u03b2). Wald 95% CI. * p < .05  ** p < .01  *** p < .001.")

Outcome	Predictor	β	SE	OR	95% CI lo	95% CI hi	z	p
Binomial GLM: IRI subscales predicting MASC accuracy
Family: binomial (logit link). Wald 95% CI.
Outcome: Affective ToM (17 items)
Affective ToM	Empathic Concern	0.0061	0.0124	1.006	0.982	1.031	0.488	0.626
Affective ToM	Perspective Taking	-0.0160	0.0116	0.984	0.962	1.007	-1.380	0.167
Affective ToM	Fantasy	0.0001	0.0093	1.000	0.982	1.018	0.015	0.988
Affective ToM	Personal Distress	-0.0041	0.0094	0.996	0.978	1.014	-0.444	0.657
Outcome: Cognitive ToM (28 items)
Cognitive ToM	Empathic Concern	0.0001	0.0096	1.000	0.981	1.019	0.014	0.989
Cognitive ToM	Perspective Taking	0.0076	0.0088	1.008	0.990	1.025	0.867	0.386
Cognitive ToM	Fantasy	-0.0081	0.0071	0.992	0.978	1.006	-1.133	0.257
Cognitive ToM	Personal Distress	0.0062	0.0072	1.006	0.992	1.020	0.860	0.39
β = log-odds coefficient. OR = exp(β). Wald 95% CI. * p < .05 p < .01 * p < .001.

Note

Overdispersion check. A binomial GLM assumes variance = μ(1−μ)/n; real data often show extra-binomial variation (overdispersion). The dispersion parameter φ is estimated by the quasi-binomial fit: φ(affective) = 0.615, φ(cognitive) = 0.817. φ ≈ 1 means the binomial assumption holds; φ >> 1 means SEs from the standard binomial are underestimated. The quasi-binomial robustness check below quantifies the difference.

Show code

p_glm_forest

Figure 9: Forest plot: odds ratios from the binomial GLMs. Error bars = 95% Wald CI. Dashed line = OR 1 (null effect).

6.4 Quasi-binomial robustness check

The quasi-binomial model uses the same formula but estimates a free dispersion parameter φ, inflating standard errors by √φ. Coefficients (β) and odds ratios are identical to the binomial — only SEs and p-values change. The comparison table shows directly where overdispersion changes inference.

Show code

tab_glm_compare |>
  select(Outcome, Predictor, beta, OR,
         SE_binom, SE_quasi, SE_ratio,
         p_binom, sig_binom, p_quasi, sig_quasi) |>
  gt() |>
  tab_header(
    title    = "Binomial vs quasi-binomial: SE and p-value comparison",
    subtitle = paste0("φ (dispersion): Affective = ", disp_aff,
                      ", Cognitive = ", disp_cog,
                      ". SE ratio \u2248 \u221a\u03c6.")
  ) |>
  cols_label(
    beta      = "\u03b2", OR = "OR",
    SE_binom  = "SE (binom)", SE_quasi = "SE (quasi)", SE_ratio = "SE ratio",
    p_binom   = "p (binom)",  sig_binom = "Sig. (binom)",
    p_quasi   = "p (quasi)",  sig_quasi = "Sig. (quasi)"
  ) |>
  tab_row_group(label = "Outcome: Cognitive ToM",
                rows = Outcome == "Cognitive ToM") |>
  tab_row_group(label = "Outcome: Affective ToM",
                rows = Outcome == "Affective ToM") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels()) |>
  tab_style(style = cell_text(weight = "bold", color = "#2d7a3a"),
            locations = cells_row_groups()) |>
  tab_style(
    style = cell_fill(color = "#fff3cd"),
    locations = cells_body(
      columns = c(sig_binom, sig_quasi),
      rows = sig_binom != sig_quasi
    )
  ) |>
  tab_footnote("Yellow highlight = significance changes between models. SE ratio = SE\u2098\u1d64\u1d43\u02e2\u1d35 / SE\u1d47\u1d35\u207f\u1d52\u1d50.")

Outcome	Predictor	β	OR	SE (binom)	SE (quasi)	SE ratio	p (binom)	p (quasi)
Binomial vs quasi-binomial: SE and p-value comparison
φ (dispersion): Affective = 0.615, Cognitive = 0.817. SE ratio ≈ √φ.
Outcome: Affective ToM
Affective ToM	Empathic Concern	0.0061	1.006	0.0124	0.0098	0.79	0.626	0.535
Affective ToM	Perspective Taking	-0.0160	0.984	0.0116	0.0091	0.78	0.167	0.081
Affective ToM	Fantasy	0.0001	1.000	0.0093	0.0073	0.78	0.988	0.985
Affective ToM	Personal Distress	-0.0041	0.996	0.0094	0.0073	0.78	0.657	0.573
Outcome: Cognitive ToM
Cognitive ToM	Empathic Concern	0.0001	1.000	0.0096	0.0087	0.91	0.989	0.988
Cognitive ToM	Perspective Taking	0.0076	1.008	0.0088	0.0080	0.91	0.386	0.34
Cognitive ToM	Fantasy	-0.0081	0.992	0.0071	0.0065	0.92	0.257	0.213
Cognitive ToM	Personal Distress	0.0062	1.006	0.0072	0.0065	0.90	0.39	0.344
Yellow highlight = significance changes between models. SE ratio = SEₘᵤᵃˢᴵ / SEᵇᴵⁿᵒᵐ.

Show code

p_glm_forest_quasi

Figure 10: Forest plot: odds ratios from the quasi-binomial GLMs. Wider CIs reflect SE inflation by √φ. Compare with the binomial forest plot above.

7 Conditioning on gender and role

The preceding analyses compare MASC and IRI scores across experimental games without accounting for sample composition. Since participants were not stratified by demographics at assignment, observed game-level differences in ToM and empathy scores may be confounded by gender composition or by the structural difference between experimental sites (P1 = LEEN laboratory; P2 = CoCoLab). This section (i) visualises distributions stratified by gender and role, and (ii) fits OLS models with game, gender, and role entered simultaneously as predictors. The reference category for all models is: game = BS, gender = Male, role = P1 (LEEN).

7.1 MASC by gender

Show code

p_masc_gender

Figure 11: MASC overall ToM score by gender within each game condition. Violin + box plot; no legend (Male = blue, Female = orange).

Show code

p_masc_dim_gender

Figure 12: MASC affective and cognitive ToM accuracy by gender, faceted by game (columns) and dimension (rows).

7.2 MASC by role

Show code

p_masc_role

Figure 13: MASC overall ToM score by experimental role (P1 LEEN vs P2 CoCoLab) within each game. Violin + box plot.

Show code

p_masc_dim_role

Figure 14: MASC affective and cognitive ToM accuracy by role, faceted by game (columns) and dimension (rows).

7.3 IRI by gender

Show code

p_iri_gender

Figure 15: IRI four subscales by gender (pooled sample). All subscales on the same y-axis (0–28) for comparability.

7.4 IRI by role

Show code

p_iri_role

Figure 16: IRI four subscales by experimental role: P1 (LEEN) vs P2 (CoCoLab), pooled across games.

7.5 OLS regressions with demographic controls

MASC models — outcome variables are overall ToM score (0–40) and the two proportion scores (affective, cognitive), each regressed on game condition, gender, and role simultaneously.

Show code

gt_ols_masc

Outcome	Predictor	β	SE	95% CI lo	95% CI hi	t	p
OLS: MASC accuracy ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome: Cognitive ToM (%)
Cognitive ToM (%)	Game: MP vs BS	0.006	0.020	-0.034	0.046	0.306	0.76
Cognitive ToM (%)	Game: PD vs BS	-0.021	0.021	-0.062	0.020	-1.030	0.305
Cognitive ToM (%)	Game: SH vs BS	0.001	0.020	-0.038	0.040	0.058	0.954
Cognitive ToM (%)	genderMale	0.014	0.014	-0.015	0.042	0.962	0.338
Cognitive ToM (%)	Role: CoCoLab vs LEEN	0.001	0.014	-0.028	0.029	0.042	0.966
Outcome: Affective ToM (%)
Affective ToM (%)	Game: MP vs BS	-0.011	0.023	-0.057	0.035	-0.473	0.637
Affective ToM (%)	Game: PD vs BS	0.016	0.024	-0.030	0.063	0.690	0.492
Affective ToM (%)	Game: SH vs BS	0.030	0.023	-0.015	0.075	1.299	0.197
Affective ToM (%)	genderMale	0.026	0.017	-0.007	0.059	1.575	0.118
Affective ToM (%)	Role: CoCoLab vs LEEN	-0.019	0.016	-0.052	0.013	-1.162	0.248
Outcome: Overall ToM (0–40)
Overall ToM (0–40)	Game: MP vs BS	-0.788	0.854	-2.479	0.903	-0.923	0.358
Overall ToM (0–40)	Game: PD vs BS	0.633	0.870	-1.091	2.356	0.727	0.469
Overall ToM (0–40)	Game: SH vs BS	0.750	0.840	-0.913	2.413	0.893	0.374
Overall ToM (0–40)	genderMale	0.140	0.610	-1.068	1.348	0.230	0.819
Overall ToM (0–40)	Role: CoCoLab vs LEEN	0.131	0.608	-1.073	1.336	0.216	0.83
β = OLS coefficient. * p < .05 p < .01 * p < .001.

Show code

p_forest_masc8

Figure 17: Forest plot: OLS β coefficients with 95% CI for MASC outcomes. Dashed line = 0 (null effect). All three outcomes shown simultaneously; note that scales differ (0–40 vs proportion).

IRI models — each of the four subscales (0–28) regressed on game, gender, and role.

Show code

gt_ols_iri

Outcome	Predictor	β	SE	95% CI lo	95% CI hi	t	p	Sig.
OLS: IRI subscales ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome: Personal Distress
Personal Distress	Game: MP vs BS	-1.243	1.243	-3.705	1.218	-1.000	0.319
Personal Distress	Game: PD vs BS	-0.664	1.267	-3.173	1.845	-0.524	0.601
Personal Distress	Game: SH vs BS	-0.688	1.222	-3.109	1.734	-0.562	0.575
Personal Distress	genderMale	-4.265	0.888	-6.024	-2.506	-4.804	< 0.001	***
Personal Distress	Role: CoCoLab vs LEEN	-0.738	0.885	-2.491	1.016	-0.833	0.406
Outcome: Fantasy
Fantasy	Game: MP vs BS	-1.310	1.273	-3.832	1.212	-1.029	0.306
Fantasy	Game: PD vs BS	1.086	1.298	-1.484	3.656	0.837	0.404
Fantasy	Game: SH vs BS	-1.375	1.252	-3.855	1.105	-1.098	0.274
Fantasy	genderMale	-4.450	0.909	-6.252	-2.649	-4.893	< 0.001	***
Fantasy	Role: CoCoLab vs LEEN	0.262	0.907	-1.534	2.058	0.289	0.773
Outcome: Perspective Taking
Perspective Taking	Game: MP vs BS	1.577	1.087	-0.576	3.729	1.451	0.149
Perspective Taking	Game: PD vs BS	0.307	1.108	-1.886	2.501	0.277	0.782
Perspective Taking	Game: SH vs BS	0.781	1.069	-1.335	2.898	0.731	0.466
Perspective Taking	genderMale	-1.114	0.776	-2.651	0.424	-1.435	0.154
Perspective Taking	Role: CoCoLab vs LEEN	0.443	0.774	-1.090	1.976	0.572	0.569
Outcome: Empathic Concern
Empathic Concern	Game: MP vs BS	-0.144	1.056	-2.235	1.947	-0.136	0.892
Empathic Concern	Game: PD vs BS	0.549	1.076	-1.582	2.680	0.510	0.611
Empathic Concern	Game: SH vs BS	-0.719	1.038	-2.775	1.338	-0.692	0.49
Empathic Concern	genderMale	-1.625	0.754	-3.118	-0.131	-2.154	0.033	*
Empathic Concern	Role: CoCoLab vs LEEN	0.377	0.752	-1.112	1.866	0.501	0.617
β = OLS coefficient. * p < .05 p < .01 * p < .001.

Show code

p_forest_iri8

Figure 18: Forest plot: OLS β coefficients with 95% CI for IRI subscales. Game effects (vs BS), gender effect (Female vs Male), and role effect (CoCoLab vs LEEN) shown side by side.

Note

Interpretation note. Game coefficients in these models represent the conditional effect of game assignment given equal gender and role composition. A game coefficient that is significant unconditionally (Kruskal-Wallis in sections 4–5) but non-significant here suggests partial confounding by demographics. Conversely, a gender or role coefficient reveals systematic differences in MASC/IRI scores attributable to those characteristics independently of game.

8 Response times & processing speed

Cognitive and affective tasks vary in the time required for deliberation and response. This section examines whether speed of processing correlates with accuracy across the MASC (ToM), IRI (empathy), and CRT (reflection), and whether games differ in time investment.

8.1 MASC response times by dimension

Show code

tab_resp_times

Characteristic	Overall N = 122¹	BS N = 32¹	MP N = 30¹	PD N = 28¹	SH N = 32¹	p-value²	Effect size³
MASC – avg response time (all items)	11.4 (9.6, 13.7)	11.6 (9.6, 14.0)	11.1 (9.4, 13.9)	11.1 (9.5, 13.3)	11.7 (10.1, 14.0)	0.913	η² = -0.021 (small)
MASC – avg response time (affective)	11.5 (9.9, 14.1)	11.4 (9.7, 14.3)	11.3 (9.2, 14.1)	11.4 (9.7, 13.5)	11.6 (10.3, 14.3)	0.842	η² = -0.018 (small)
MASC – avg response time (cognitive)	11.4 (9.5, 13.7)	11.7 (9.2, 14.5)	11.3 (9.2, 14.4)	11.1 (9.4, 13.3)	11.6 (9.7, 13.4)	0.955	η² = -0.023 (small)
IRI – total time (28 items)	190.0 (158.0, 237.0)	184.0 (149.0, 237.5)	175.5 (159.0, 233.0)	192.0 (157.0, 230.0)	196.5 (165.5, 238.5)	0.740	η² = -0.015 (small)
CRT – total time (4 items)	58.0 (46.0, 76.0)	51.5 (42.0, 73.0)	57.0 (45.0, 76.0)	62.0 (51.5, 73.5)	58.0 (52.0, 83.0)	0.365	η² = 0.002 (small)
¹ Median (Q1, Q3)
² Kruskal-Wallis rank sum test
³ η² (Kruskal-Wallis). Small / medium / large: η² ≥ 0.01 / 0.06 / 0.14.

8.2 MASC: speed–accuracy trade-off

Show code

p_masc_rt

Figure 19: Distribution of MASC response times by dimension (overall / affective / cognitive) across games. Violin width = density; box plot = quartiles. Faster response times may reflect overconfidence or heuristic use; slower times suggest deliberative mentalising.

Show code

p_masc_speed_accuracy

Figure 20: MASC: average response time vs overall ToM accuracy. Does faster responding predict worse accuracy (speed–accuracy trade-off)? OLS line fitted on pooled sample; top-left label reports β, R², p-value.

Note

Speed–accuracy trade-off in ToM. If participants who respond faster are less accurate, this suggests a speed–accuracy trade-off: quick responses may rely on superficial heuristics rather than genuine mentalising. Conversely, a positive correlation (faster = more accurate) would indicate fluent, confident mentalising. A near-zero correlation indicates speed and accuracy are independent — both may reflect trait differences in responding style (e.g. impulsivity) rather than true mentalising ability.

8.3 IRI: time spent vs all subscales

Show code

p_iri_speed_panel

Figure 21: IRI total completion time vs each of the four subscales (Empathic Concern, Perspective Taking, Fantasy, Personal Distress). Each panel shows an OLS line with 95% CI and a top-left annotation reporting β, R², and p-value. All subscales share the same y-axis scale (0–28). Points coloured by game condition.

9 Preliminary interpretation

The sample shows a median MASC ToM score of 32 (IQR = 4) out of 40 items, consistent with adequate mentalising ability in a non-clinical adult population. The affective component (median 66.7%) and the cognitive component (median 55.6%) are compared within individuals: the Wilcoxon signed-rank test yields p = 5.3^{-8}, with a moderate effect size (r = 0.49), suggesting a statistically significant difference between the two ToM dimensions at the sample level.

The attention-scatter plots provide a first check on whether task engagement confounds ToM performance — interpretation depends on the slope and confidence interval of the regression lines.

Differences in MASC profiles across games are informative to the extent that randomisation was imperfect or that participant sorting occurred. Any significant Kruskal-Wallis effects will be noted as potential covariates in the inferential sections (Parts II–III).

For the IRI, randomly assigned groups should show comparable empathy profiles. Significant game differences would flag imbalance that warrants covariate adjustment in the main analyses.