Ic — Task comprehension & Cognitive Reflection (CRT4)

Descriptive Analyses · GTEMO Experiment

Author

Eric Guerci

Published

March 22, 2026

1 Objective

Describe task comprehension (quiz errors) and cognitive reflection ability (CRT4), and assess whether higher cognitive reflection is associated with better rule comprehension during the experiment.

Variable Scale Description
GT_error_q1_to_6 ≥ 0 Errors on game-rules quiz (Q1–Q6)
GT_error_q7_8_9 ≥ 0 Errors on signal-comprehension quiz (Q7–Q9)
quiz_errors_total ≥ 0 Total cumulative errors across all questions
CRT_totCorrect_corrected 0–4 Correct answers on the CRT4
CRT_totIntuitive_corrected 0–4 Intuitive (wrong) answers on the CRT4
Note

CRT4. This experiment used a 4-item version of the Cognitive Reflection Test (Frederick, 2005). Each item presents a problem with an intuitively compelling but incorrect answer; the correct answer requires overriding the intuitive response through deliberate reasoning. It is not a general IQ test but a specific measure of reflective thinking. CRT_totCorrect_corrected and CRT_totIntuitive_corrected are the corrected variables — do not use CRT_totCorrect / CRT_totIntuitive, which are legacy estimates from a 3-item scoring.

2 Data overview

Show code
df |>
  select(GT_error_q1_to_6, GT_error_q7_8_9, quiz_errors_total,
         CRT_totCorrect_corrected, CRT_totIntuitive_corrected) |>
  skimr::skim()
Data summary
Name select(…)
Number of rows 122
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
GT_error_q1_to_6 0 1 3.16 7.45 0 0 1 3 59 ▇▁▁▁▁
GT_error_q7_8_9 0 1 0.20 0.56 0 0 0 0 4 ▇▁▁▁▁
quiz_errors_total 0 1 3.37 7.46 0 0 1 3 59 ▇▁▁▁▁
CRT_totCorrect_corrected 0 1 2.38 0.87 0 2 3 3 4 ▁▂▆▇▁
CRT_totIntuitive_corrected 0 1 0.97 0.99 0 0 1 2 4 ▇▇▅▁▁

3 Descriptive statistics by game

Show code
tab_quest
Characteristic Overall
N = 1221
BS
N = 321
MP
N = 301
PD
N = 281
SH
N = 321
p-value2 Effect size3
Quiz errors: rules (Q1–6) 1.0 (0.0, 3.0) 0.5 (0.0, 1.0) 2.0 (0.0, 5.0) 1.0 (0.0, 5.5) 0.0 (0.0, 1.0) 0.025 η² = 0.053 (small)
Quiz errors: signals (Q7–9)




0.634 V = 0.145
    0 103 (84%) 24 (75%) 26 (87%) 25 (89%) 28 (88%)

    1 15 (12%) 5 (16%) 4 (13%) 2 (7.1%) 4 (13%)

    2 3 (2.5%) 2 (6.3%) 0 (0%) 1 (3.6%) 0 (0%)

    4 1 (0.8%) 1 (3.1%) 0 (0%) 0 (0%) 0 (0%)

Quiz errors: total 1.0 (0.0, 3.0) 1.0 (0.0, 2.0) 2.0 (0.0, 5.0) 1.5 (0.0, 6.0) 0.0 (0.0, 1.0) 0.053 η² = 0.04 (small)
CRT4 – Correct answers (0–4)




0.479 V = 0.179
    0 5 (4.1%) 2 (6.3%) 0 (0%) 3 (11%) 0 (0%)

    1 12 (9.8%) 3 (9.4%) 2 (6.7%) 3 (11%) 4 (13%)

    2 41 (34%) 8 (25%) 12 (40%) 9 (32%) 12 (38%)

    3 60 (49%) 17 (53%) 16 (53%) 13 (46%) 14 (44%)

    4 4 (3.3%) 2 (6.3%) 0 (0%) 0 (0%) 2 (6.3%)

CRT4 – Intuitive (wrong) answers (0–4)




0.337 V = 0.191
    0 47 (39%) 11 (34%) 12 (40%) 11 (39%) 13 (41%)

    1 43 (35%) 15 (47%) 12 (40%) 7 (25%) 9 (28%)

    2 24 (20%) 4 (13%) 6 (20%) 5 (18%) 9 (28%)

    3 5 (4.1%) 1 (3.1%) 0 (0%) 3 (11%) 1 (3.1%)

    4 3 (2.5%) 1 (3.1%) 0 (0%) 2 (7.1%) 0 (0%)

1 Median (Q1, Q3); n (%)
2 Kruskal-Wallis rank sum test; Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates)
3 Continuous: η² (Kruskal-Wallis). Categorical: Cramér’s V (χ²). Small/medium/large: η² ≥ 0.01/0.06/0.14; V ≥ 0.10/0.30/0.50.
Note

Median (Q1, Q3) for continuous; n (%) for binary. Kruskal-Wallis + η² for continuous; χ² + Cramér’s V for categorical. Non-significant differences across games support successful randomisation with respect to cognitive ability and task comprehension.

4 Signal comprehension

4.1 Errors by quiz section — violin plots

4.1.1 Rules comprehension (Q1–Q6)

Show code
p_violin_rules
Figure 1: Distribution of errors on game-rules questions (Q1–Q6) by game (log1p y-axis). Violin width shows density; box plot overlay shows quartiles. The log(1+x) transformation compresses the right tail; tick labels show raw error counts.

4.1.2 Signal comprehension (Q7–Q9)

Show code
p_violin_signals
Figure 2: Distribution of errors on signal-comprehension questions (Q7–Q9) by game (log1p y-axis). Violin width shows density; box plot overlay shows quartiles. Tick labels show raw error counts.
Tip

Comparing the two. The rules violin (Q1–Q6) and signals violin (Q7–Q9) show different difficulty profiles. If rules errors are concentrated near zero across all games, comprehension of basic mechanics is solid. If signals errors are more dispersed, interpreting signal information during the task is less uniform across participants.

4.2 Total error distribution (ridgeline)

Note

Log transformation for count data. Quiz errors are count data (0, 1, 2, …) with a strong right skew: most participants make few errors, but some make many. The log(1+x) transformation (log1p) is applied throughout: it compresses the right tail, maps zero exactly to zero (no offset needed), and is consistent with all other error plots in this section. Tick labels always show raw error counts.

Show code
p_ridgeline
Figure 3: Ridgeline density of total quiz errors by game [log(1+x) scale]. Tick labels show raw error counts. Strong right skew is typical: most participants made few errors, with a long tail in every condition.

5 CRT4 performance

5.1 Distribution by game

Show code
p_crt_hist
Figure 4: Distribution of CRT4 correct answers (0–4) by game. Histogram with overlaid kernel density. Scores range 0–4 (4-item version).

5.2 Correct vs intuitive responses

Note

Statistical test. Because both variables are discrete (0–4) with a built-in constraint (correct + intuitive ≤ 4), a linear regression interaction would be biased by the bounded scale. Instead, we summarise each participant’s reflective performance with a single net CRT score (CRTnet = correct − intuitive, range −4 to +4) and compare groups with a Mann-Whitney U test (reported in the annotation). As a robustness check, a binomial GLM with interaction (correct/4 ~ intuitive × group, family = Binomial) tests whether the correct~intuitive slope differs by group; results are reported below.

Show code
p_crt_scatter
Figure 5: CRT4 correct vs intuitive answers, pooled across game conditions. Left: by gender; right: by role. Annotation: Mann-Whitney U test on CRTnet (= correct − intuitive). Bottom-right = high reflective performance; top-left = predominantly intuitive.
Show code
gt_binom_interaction
Binomial GLM robustness check: interaction term testing whether the correct ~ intuitive slope differs by group. A non-significant interaction (ns) means the negative correct–intuitive relationship is equally strong in both groups.
Binomial GLM: interaction correct ~ intuitive × group
Tests whether the correct ∼ intuitive slope differs by group. Bounded outcome: correct/4 ~ Binomial. Interaction term only.
Group Term log(OR) OR SE z p Sig.
Gender (Female vs Male) CRT_totIntuitive_corrected:genderMale 0.115 1.122 0.225 0.512 0.608 ns
Role (CoCoLab vs LEEN) CRT_totIntuitive_corrected:roleP2 (CoCoLab) −0.196 0.822 0.227 −0.866 0.386 ns
OR = exp(log-OR) of the interaction term. ns = not significant at α = .05. Main test: MW on CRT_net.

6 CRT4 vs quiz comprehension

Does higher cognitive reflection predict fewer comprehension errors?

Show code
p_quiz_vs_crt
Figure 6: Total quiz errors [log(1+x) scale] against CRT4 correct answers. OLS line fitted on the log1p-transformed outcome; top-right label reports β, R², and p-value. Tick labels show raw error counts. Points coloured by game.
Show code
cor_res <- cor.test(df$CRT_totCorrect_corrected, df$quiz_errors_total,
                    method = "spearman", exact = FALSE)

tibble(
  Test      = "Spearman \u03c1",
  Statistic = round(cor_res$statistic, 2),
  rho       = round(cor_res$estimate, 3),
  p_value   = signif(cor_res$p.value, 3)
) |>
  gt() |>
  tab_header(title = "Correlation: CRT4 correct answers vs total quiz errors") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels())
Correlation: CRT4 correct answers vs total quiz errors
Test Statistic rho p_value
Spearman ρ 313663 -0.036 0.69
Note

A negative ρ indicates that participants who scored higher on the CRT4 tended to make fewer quiz errors — consistent with cognitive reflection ability supporting better comprehension of complex experimental instructions.

7 CRT time vs cognitive reflection: non-linear analysis

On the CRT, longer deliberation time is theoretically associated with successful override of intuitive errors. However, this relationship may be non-monotonic: while moderate reflection time helps participants override intuitive (wrong) answers, participants who spend very long times may be those who struggle with the task — extra time reflects difficulty rather than productive deliberation. We investigate this hypothesis progressively: linear baseline → non-parametric smooth → quadratic test → segmented regression with estimated breakpoint.

Note

Note on quiz timing. Quiz completion times (GT_time_quiz_total_sec) contain negative values due to clock drift in the oTree data collection and have been excluded from analysis. CRT response times are clean and retained.

7.1 OLS linear baseline

Show code
p_crt_ols
Figure 7: OLS linear fit: CRT4 time vs correct answers. A positive slope suggests that more time = more correct answers on average, but this model assumes monotonicity.

The linear model provides the simplest summary: does spending more time on the CRT predict better performance? However, it imposes a constant marginal effect of time across the entire range and cannot capture the hypothesised threshold.

7.2 GAM smooth — visual evidence of non-linearity

Show code
p_crt_gam
Figure 8: GAM non-parametric smooth (mgcv, thin plate spline). If the effective degrees of freedom (edf) exceed 1, the data suggest a non-linear relationship. The 95% confidence band shows uncertainty in the smooth.
Tip

Reading the GAM. The effective degrees of freedom (edf) reported in the annotation quantify the wiggliness of the smooth. edf ≈ 1 means the relationship is essentially linear; edf > 1 indicates curvature. If the smooth clearly bends — rising then flattening or dropping — this motivates a parametric model with a turning point.

7.3 Quadratic OLS — parametric test of concavity

Show code
p_crt_quad
Figure 9: Quadratic OLS: y = β₀ + β₁·time + β₂·time². A significant negative β₂ confirms an inverted-U shape. The dashed vertical line marks the vertex (estimated turning point).

The quadratic model adds a single parameter (β₂) to test for concavity. A significant negative β₂ confirms the inverted-U hypothesis. The vertex of the parabola provides a symmetric estimate of the turning point — but symmetry is a strong assumption that the segmented model relaxes.

7.4 Segmented regression — threshold estimation

Show code
p_crt_seg
Figure 10: Segmented (piecewise) regression with endogenously estimated breakpoint. The vertical line marks the estimated threshold; shaded band shows the 95% CI. Slopes before and after the breakpoint are estimated independently.
Important

Segmented regression (Muggeo, 2003) estimates the breakpoint ψ endogenously and fits two separate linear slopes — one before and one after ψ. Unlike the quadratic model, it does not impose symmetry around the turning point. This makes it the most appropriate model if the mechanism differs qualitatively on either side of the threshold: productive reflection (slope 1) vs struggling without learning (slope 2).

7.5 Model comparison

Show code
tab_model_comp
Model comparison: CRT time → accuracy
Model AIC BIC Adj. R²
OLS linear 310.9 319.3 0.0331
Quadratic OLS 312.6 323.8 0.0274
GAM (spline) 310.9 319.3 0.0331
Segmented 312.5 326.5 0.0356

7.6 Summary panel (all four models)

Show code
p_crt_panel
Figure 11: Comparison of four approaches to modelling CRT time → accuracy. Top-left: OLS linear. Top-right: GAM smooth. Bottom-left: Quadratic OLS. Bottom-right: Segmented regression with estimated breakpoint.

8 Conditioning on gender and role

Note

The following analyses condition quiz errors and CRT4 performance on gender (Male / Female) and lab role (P1 LEEN / P2 CoCoLab) separately — pooled across games. One stratifying variable is shown per plot: no cross-tabulation with game_id.

Each violin/boxplot pair is annotated with a Mann-Whitney U test (two-sided, unpaired). The stacked-bar perfection charts show a χ² test (Monte Carlo p-value, 2000 replicates). Regression models include all three predictors simultaneously (game + gender + role).

8.1 Quiz errors by gender

Show code
p_quiz_errors_gender
Figure 12: Distribution of quiz errors (rules Q1–6, signals Q7–9, total) by gender [log(1+x) y-axis]. Violin + boxplot overlay, pooled across all games. Tick labels show raw error counts. Mann-Whitney U test p-value shown at the top-left of each panel.
Show code
p_perf_gender
Figure 13: Proportion achieving a perfect signal-quiz score (Q7–9), by gender. χ² test (Monte Carlo) shown in subtitle.

8.2 Quiz errors by role

Show code
p_quiz_errors_role
Figure 14: Distribution of quiz errors (rules Q1–6, signals Q7–9, total) by lab role [log(1+x) y-axis]. Violin + boxplot overlay, pooled across all games. Tick labels show raw error counts. Mann-Whitney U test p-value shown at the top-left of each panel.
Show code
p_perf_role
Figure 15: Proportion achieving a perfect signal-quiz score, by lab role. χ² test (Monte Carlo) shown in subtitle.

8.3 CRT4 by gender and role

Show code
p_crt4_demo_panel
Figure 16: CRT4 score distribution by gender (left) and by lab role (right). Stacked proportion bars show the share of participants at each discrete score level (0–4): rosso = 0 correct → teal = 4 correct. Labels show % where ≥ 7%. Annotation: Mann-Whitney U on CRT_net = correct − intuitive with rank-biserial r.

8.4 Regressions with demographic controls

8.4.1 Quiz errors — OLS

The table reports OLS estimates of the three quiz-error outcomes regressed simultaneously on game, gender, and role. Reference categories: game = BS, gender = Male, role = P1 (LEEN).

Show code
gt_ols8_quiz
OLS: quiz errors ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome Predictor β SE 95% CI lo 95% CI hi t p Sig.
Outcome: Signal errors (Q7–9)
Signal errors (Q7–9) Game: MP vs BS -0.275 0.141 -0.555 0.005 -1.946 0.054
Signal errors (Q7–9) Game: PD vs BS -0.268 0.144 -0.554 0.017 -1.861 0.065
Signal errors (Q7–9) Game: SH vs BS -0.281 0.139 -0.557 -0.006 -2.022 0.045 *
Signal errors (Q7–9) genderMale 0.069 0.101 -0.131 0.269 0.679 0.498
Signal errors (Q7–9) Role: CoCoLab vs LEEN 0.049 0.101 -0.150 0.249 0.488 0.626
Outcome: Rules errors (Q1–6)
Rules errors (Q1–6) Game: MP vs BS 2.545 1.869 -1.157 6.247 1.362 0.176
Rules errors (Q1–6) Game: PD vs BS 4.025 1.905 0.252 7.798 2.113 0.037 *
Rules errors (Q1–6) Game: SH vs BS -0.688 1.838 -4.328 2.953 -0.374 0.709
Rules errors (Q1–6) genderMale -0.722 1.335 -3.367 1.922 -0.541 0.59
Rules errors (Q1–6) Role: CoCoLab vs LEEN 0.590 1.331 -2.047 3.227 0.443 0.658
Outcome: Total errors
Total errors Game: MP vs BS 2.270 1.874 -1.442 5.982 1.211 0.228
Total errors Game: PD vs BS 3.757 1.910 -0.027 7.540 1.966 0.052
Total errors Game: SH vs BS -0.969 1.843 -4.619 2.682 -0.526 0.6
Total errors genderMale -0.654 1.339 -3.306 1.998 -0.488 0.626
Total errors Role: CoCoLab vs LEEN 0.639 1.335 -2.005 3.283 0.479 0.633
β = OLS coefficient. * p < .05 ** p < .01 *** p < .001.

8.4.2 CRT4 — binary logistic regression

Note

CRT4 correct answers take only 5 discrete values (0–4), making OLS inappropriate for inference on a near-bounded outcome. Responses are dichotomised into high reflectors (correct ≥ 3) vs low/intuitive (correct ≤ 2), and a logistic regression is estimated. Odds ratios > 1 indicate a higher probability of being a high reflector relative to the reference category (BS, Male, LEEN). Wald 95% CIs are reported.

Show code
gt_logit8_crt
Logistic regression: CRT4 high reflector ~ game + gender + role
Binary outcome: correct ≥ 3 (high reflector) vs ≤ 2 (low/intuitive). OR and 95% Wald CI. Reference: BS, Male, LEEN.
Predictor log(OR) OR 95% CI lo 95% CI hi z p Sig.
Game: MP vs BS −0.248 0.780 0.285 2.134 −0.484 0.629
Game: PD vs BS −0.528 0.590 0.211 1.646 −1.008 0.313
Game: SH vs BS −0.380 0.684 0.254 1.839 −0.752 0.452
genderMale 0.072 1.074 0.525 2.199 0.197 0.844
Role: CoCoLab vs LEEN 0.000 1.000 0.490 2.042 0.000 1
OR = odds ratio = exp(log-odds). Wald 95% CI. * p < .05 ** p < .01 *** p < .001.

8.5 Forest plots — partial effects

Show code
p_forest8_rules
Figure 17: OLS β ± 95% CI for each predictor on rules-quiz errors (Q1–6; game + gender + role). Reference: BS, Male, LEEN. Dashed line at zero.
Show code
p_forest8_signals
Figure 18: OLS β ± 95% CI for each predictor on signal-comprehension errors (Q7–9; game + gender + role). Reference: BS, Male, LEEN. Dashed line at zero.
Show code
p_forest8_crt
Figure 19: Logistic regression odds ratios (OR) with 95% Wald CI for CRT4 high-reflector outcome (correct ≥ 3). Log scale: values > 1 = higher odds of being a high reflector; values < 1 = lower odds. Reference: BS, Male, LEEN. Dashed line at OR = 1.

9 Preliminary interpretation

84.4% of participants answered all signal-comprehension questions correctly. Total errors show the expected right skew (median = 1): most participants made few mistakes, but a high-error tail warrants monitoring in robustness checks.

On the CRT4, the sample median is 3 correct answers (IQR = 1) out of 4. The Spearman correlation between CRT4 score and quiz errors is ρ = -0.04 — close to zero, indicating the two constructs are largely independent in this sample.