Ic — Task comprehension & Cognitive Reflection (CRT4)

Descriptive Analyses · GTEMO Experiment

Author

Eric Guerci

Published

March 22, 2026

1 Objective

Describe task comprehension (quiz errors) and cognitive reflection ability (CRT4), and assess whether higher cognitive reflection is associated with better rule comprehension during the experiment.

Variable	Scale	Description
`GT_error_q1_to_6`	≥ 0	Errors on game-rules quiz (Q1–Q6)
`GT_error_q7_8_9`	≥ 0	Errors on signal-comprehension quiz (Q7–Q9)
`quiz_errors_total`	≥ 0	Total cumulative errors across all questions
`CRT_totCorrect_corrected`	0–4	Correct answers on the CRT4
`CRT_totIntuitive_corrected`	0–4	Intuitive (wrong) answers on the CRT4

Note

CRT4. This experiment used a 4-item version of the Cognitive Reflection Test (Frederick, 2005). Each item presents a problem with an intuitively compelling but incorrect answer; the correct answer requires overriding the intuitive response through deliberate reasoning. It is not a general IQ test but a specific measure of reflective thinking. CRT_totCorrect_corrected and CRT_totIntuitive_corrected are the corrected variables — do not use CRT_totCorrect / CRT_totIntuitive, which are legacy estimates from a 3-item scoring.

2 Data overview

Show code

df |>
  select(GT_error_q1_to_6, GT_error_q7_8_9, quiz_errors_total,
         CRT_totCorrect_corrected, CRT_totIntuitive_corrected) |>
  skimr::skim()

Data summary
Name	select(…)
Number of rows	122
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p25	p50	p75	p100	hist
GT_error_q1_to_6	1	3.16	7.45	0	1	3	59	▇▁▁▁▁
GT_error_q7_8_9	1	0.20	0.56	0	0	0	4	▇▁▁▁▁
quiz_errors_total	1	3.37	7.46	0	1	3	59	▇▁▁▁▁
CRT_totCorrect_corrected	1	2.38	0.87	2	3	3	4	▁▂▆▇▁
CRT_totIntuitive_corrected	1	0.97	0.99	0	1	2	4	▇▇▅▁▁

3 Descriptive statistics by game

Show code

tab_quest

Characteristic	Overall N = 122¹	BS N = 32¹	MP N = 30¹	PD N = 28¹	SH N = 32¹	p-value²	Effect size³
Quiz errors: rules (Q1–6)	1.0 (0.0, 3.0)	0.5 (0.0, 1.0)	2.0 (0.0, 5.0)	1.0 (0.0, 5.5)	0.0 (0.0, 1.0)	0.025	η² = 0.053 (small)
Quiz errors: signals (Q7–9)						0.634	V = 0.145
0	103 (84%)	24 (75%)	26 (87%)	25 (89%)	28 (88%)
1	15 (12%)	5 (16%)	4 (13%)	2 (7.1%)	4 (13%)
2	3 (2.5%)	2 (6.3%)	0 (0%)	1 (3.6%)	0 (0%)
4	1 (0.8%)	1 (3.1%)	0 (0%)	0 (0%)	0 (0%)
Quiz errors: total	1.0 (0.0, 3.0)	1.0 (0.0, 2.0)	2.0 (0.0, 5.0)	1.5 (0.0, 6.0)	0.0 (0.0, 1.0)	0.053	η² = 0.04 (small)
CRT4 – Correct answers (0–4)						0.479	V = 0.179
0	5 (4.1%)	2 (6.3%)	0 (0%)	3 (11%)	0 (0%)
1	12 (9.8%)	3 (9.4%)	2 (6.7%)	3 (11%)	4 (13%)
2	41 (34%)	8 (25%)	12 (40%)	9 (32%)	12 (38%)
3	60 (49%)	17 (53%)	16 (53%)	13 (46%)	14 (44%)
4	4 (3.3%)	2 (6.3%)	0 (0%)	0 (0%)	2 (6.3%)
CRT4 – Intuitive (wrong) answers (0–4)						0.337	V = 0.191
0	47 (39%)	11 (34%)	12 (40%)	11 (39%)	13 (41%)
1	43 (35%)	15 (47%)	12 (40%)	7 (25%)	9 (28%)
2	24 (20%)	4 (13%)	6 (20%)	5 (18%)	9 (28%)
3	5 (4.1%)	1 (3.1%)	0 (0%)	3 (11%)	1 (3.1%)
4	3 (2.5%)	1 (3.1%)	0 (0%)	2 (7.1%)	0 (0%)
¹ Median (Q1, Q3); n (%)
² Kruskal-Wallis rank sum test; Pearson’s Chi-squared test with simulated p-value (based on 2000 replicates)
³ Continuous: η² (Kruskal-Wallis). Categorical: Cramér’s V (χ²). Small/medium/large: η² ≥ 0.01/0.06/0.14; V ≥ 0.10/0.30/0.50.

Note

Median (Q1, Q3) for continuous; n (%) for binary. Kruskal-Wallis + η² for continuous; χ² + Cramér’s V for categorical. Non-significant differences across games support successful randomisation with respect to cognitive ability and task comprehension.

4 Signal comprehension

4.1 Errors by quiz section — violin plots

4.1.1 Rules comprehension (Q1–Q6)

Show code

p_violin_rules

Figure 1: Distribution of errors on game-rules questions (Q1–Q6) by game (log1p y-axis). Violin width shows density; box plot overlay shows quartiles. The log(1+x) transformation compresses the right tail; tick labels show raw error counts.

4.1.2 Signal comprehension (Q7–Q9)

Show code

p_violin_signals

Figure 2: Distribution of errors on signal-comprehension questions (Q7–Q9) by game (log1p y-axis). Violin width shows density; box plot overlay shows quartiles. Tick labels show raw error counts.

Tip

Comparing the two. The rules violin (Q1–Q6) and signals violin (Q7–Q9) show different difficulty profiles. If rules errors are concentrated near zero across all games, comprehension of basic mechanics is solid. If signals errors are more dispersed, interpreting signal information during the task is less uniform across participants.

4.2 Total error distribution (ridgeline)

Note

Log transformation for count data. Quiz errors are count data (0, 1, 2, …) with a strong right skew: most participants make few errors, but some make many. The log(1+x) transformation (log1p) is applied throughout: it compresses the right tail, maps zero exactly to zero (no offset needed), and is consistent with all other error plots in this section. Tick labels always show raw error counts.

Show code

p_ridgeline

Figure 3: Ridgeline density of total quiz errors by game [log(1+x) scale]. Tick labels show raw error counts. Strong right skew is typical: most participants made few errors, with a long tail in every condition.

5 CRT4 performance

5.1 Distribution by game

Show code

p_crt_hist

Figure 4: Distribution of CRT4 correct answers (0–4) by game. Histogram with overlaid kernel density. Scores range 0–4 (4-item version).

5.2 Correct vs intuitive responses

Note

Statistical test. Because both variables are discrete (0–4) with a built-in constraint (correct + intuitive ≤ 4), a linear regression interaction would be biased by the bounded scale. Instead, we summarise each participant’s reflective performance with a single net CRT score (CRT_net = correct − intuitive, range −4 to +4) and compare groups with a Mann-Whitney U test (reported in the annotation). As a robustness check, a binomial GLM with interaction (correct/4 ~ intuitive × group, family = Binomial) tests whether the correct~intuitive slope differs by group; results are reported below.

Show code

p_crt_scatter

Figure 5: CRT4 correct vs intuitive answers, pooled across game conditions. Left: by gender; right: by role. Annotation: Mann-Whitney U test on CRT_net (= correct − intuitive). Bottom-right = high reflective performance; top-left = predominantly intuitive.

Show code

gt_binom_interaction

Binomial GLM robustness check: interaction term testing whether the correct ~ intuitive slope differs by group. A non-significant interaction (ns) means the negative correct–intuitive relationship is equally strong in both groups.
Group	Term	log(OR)	OR	SE	z	p	Sig.
Binomial GLM: interaction correct ~ intuitive × group
Tests whether the correct ∼ intuitive slope differs by group. Bounded outcome: correct/4 ~ Binomial. Interaction term only.
Gender (Female vs Male)	CRT_totIntuitive_corrected:genderMale	0.115	1.122	0.225	0.512	0.608	ns
Role (CoCoLab vs LEEN)	CRT_totIntuitive_corrected:roleP2 (CoCoLab)	−0.196	0.822	0.227	−0.866	0.386	ns
OR = exp(log-OR) of the interaction term. ns = not significant at α = .05. Main test: MW on CRT_net.

6 CRT4 vs quiz comprehension

Does higher cognitive reflection predict fewer comprehension errors?

Show code

p_quiz_vs_crt

Figure 6: Total quiz errors [log(1+x) scale] against CRT4 correct answers. OLS line fitted on the log1p-transformed outcome; top-right label reports β, R², and p-value. Tick labels show raw error counts. Points coloured by game.

Show code

cor_res <- cor.test(df$CRT_totCorrect_corrected, df$quiz_errors_total,
                    method = "spearman", exact = FALSE)

tibble(
  Test      = "Spearman \u03c1",
  Statistic = round(cor_res$statistic, 2),
  rho       = round(cor_res$estimate, 3),
  p_value   = signif(cor_res$p.value, 3)
) |>
  gt() |>
  tab_header(title = "Correlation: CRT4 correct answers vs total quiz errors") |>
  tab_style(style = cell_text(weight = "bold"),
            locations = cells_column_labels())

Test	Statistic	rho	p_value
Correlation: CRT4 correct answers vs total quiz errors
Spearman ρ	313663	-0.036	0.69

Note

A negative ρ indicates that participants who scored higher on the CRT4 tended to make fewer quiz errors — consistent with cognitive reflection ability supporting better comprehension of complex experimental instructions.

7 CRT time vs cognitive reflection: non-linear analysis

On the CRT, longer deliberation time is theoretically associated with successful override of intuitive errors. However, this relationship may be non-monotonic: while moderate reflection time helps participants override intuitive (wrong) answers, participants who spend very long times may be those who struggle with the task — extra time reflects difficulty rather than productive deliberation. We investigate this hypothesis progressively: linear baseline → non-parametric smooth → quadratic test → segmented regression with estimated breakpoint.

Note

Note on quiz timing. Quiz completion times (GT_time_quiz_total_sec) contain negative values due to clock drift in the oTree data collection and have been excluded from analysis. CRT response times are clean and retained.

7.1 OLS linear baseline

Show code

p_crt_ols

Figure 7: OLS linear fit: CRT4 time vs correct answers. A positive slope suggests that more time = more correct answers on average, but this model assumes monotonicity.

The linear model provides the simplest summary: does spending more time on the CRT predict better performance? However, it imposes a constant marginal effect of time across the entire range and cannot capture the hypothesised threshold.

7.2 GAM smooth — visual evidence of non-linearity

Show code

p_crt_gam

Figure 8: GAM non-parametric smooth (mgcv, thin plate spline). If the effective degrees of freedom (edf) exceed 1, the data suggest a non-linear relationship. The 95% confidence band shows uncertainty in the smooth.

Tip

Reading the GAM. The effective degrees of freedom (edf) reported in the annotation quantify the wiggliness of the smooth. edf ≈ 1 means the relationship is essentially linear; edf > 1 indicates curvature. If the smooth clearly bends — rising then flattening or dropping — this motivates a parametric model with a turning point.

7.3 Quadratic OLS — parametric test of concavity

Show code

p_crt_quad

Figure 9: Quadratic OLS: y = β₀ + β₁·time + β₂·time². A significant negative β₂ confirms an inverted-U shape. The dashed vertical line marks the vertex (estimated turning point).

The quadratic model adds a single parameter (β₂) to test for concavity. A significant negative β₂ confirms the inverted-U hypothesis. The vertex of the parabola provides a symmetric estimate of the turning point — but symmetry is a strong assumption that the segmented model relaxes.

7.4 Segmented regression — threshold estimation

Show code

p_crt_seg

Figure 10: Segmented (piecewise) regression with endogenously estimated breakpoint. The vertical line marks the estimated threshold; shaded band shows the 95% CI. Slopes before and after the breakpoint are estimated independently.

Important

Segmented regression (Muggeo, 2003) estimates the breakpoint ψ endogenously and fits two separate linear slopes — one before and one after ψ. Unlike the quadratic model, it does not impose symmetry around the turning point. This makes it the most appropriate model if the mechanism differs qualitatively on either side of the threshold: productive reflection (slope 1) vs struggling without learning (slope 2).

7.5 Model comparison

Show code

tab_model_comp

Model	AIC	BIC	Adj. R²
Model comparison: CRT time → accuracy
OLS linear	310.9	319.3	0.0331
Quadratic OLS	312.6	323.8	0.0274
GAM (spline)	310.9	319.3	0.0331
Segmented	312.5	326.5	0.0356

7.6 Summary panel (all four models)

Show code

p_crt_panel

Figure 11: Comparison of four approaches to modelling CRT time → accuracy. Top-left: OLS linear. Top-right: GAM smooth. Bottom-left: Quadratic OLS. Bottom-right: Segmented regression with estimated breakpoint.

8 Conditioning on gender and role

Note

The following analyses condition quiz errors and CRT4 performance on gender (Male / Female) and lab role (P1 LEEN / P2 CoCoLab) separately — pooled across games. One stratifying variable is shown per plot: no cross-tabulation with game_id.

Each violin/boxplot pair is annotated with a Mann-Whitney U test (two-sided, unpaired). The stacked-bar perfection charts show a χ² test (Monte Carlo p-value, 2000 replicates). Regression models include all three predictors simultaneously (game + gender + role).

8.1 Quiz errors by gender

Show code

p_quiz_errors_gender

Figure 12: Distribution of quiz errors (rules Q1–6, signals Q7–9, total) by gender [log(1+x) y-axis]. Violin + boxplot overlay, pooled across all games. Tick labels show raw error counts. Mann-Whitney U test p-value shown at the top-left of each panel.

Show code

p_perf_gender

Figure 13: Proportion achieving a perfect signal-quiz score (Q7–9), by gender. χ² test (Monte Carlo) shown in subtitle.

8.2 Quiz errors by role

Show code

p_quiz_errors_role

Figure 14: Distribution of quiz errors (rules Q1–6, signals Q7–9, total) by lab role [log(1+x) y-axis]. Violin + boxplot overlay, pooled across all games. Tick labels show raw error counts. Mann-Whitney U test p-value shown at the top-left of each panel.

Show code

p_perf_role

Figure 15: Proportion achieving a perfect signal-quiz score, by lab role. χ² test (Monte Carlo) shown in subtitle.

8.3 CRT4 by gender and role

Show code

p_crt4_demo_panel

Figure 16: CRT4 score distribution by gender (left) and by lab role (right). Stacked proportion bars show the share of participants at each discrete score level (0–4): rosso = 0 correct → teal = 4 correct. Labels show % where ≥ 7%. Annotation: Mann-Whitney U on CRT_net = correct − intuitive with rank-biserial r.

8.4 Regressions with demographic controls

8.4.1 Quiz errors — OLS

The table reports OLS estimates of the three quiz-error outcomes regressed simultaneously on game, gender, and role. Reference categories: game = BS, gender = Male, role = P1 (LEEN).

Show code

gt_ols8_quiz

Outcome	Predictor	β	SE	95% CI lo	95% CI hi	t	p	Sig.
OLS: quiz errors ~ game + gender + role
OLS. Reference: game = BS, gender = Male, role = P1 (LEEN). 95% CI from confint().
Outcome: Signal errors (Q7–9)
Signal errors (Q7–9)	Game: MP vs BS	-0.275	0.141	-0.555	0.005	-1.946	0.054
Signal errors (Q7–9)	Game: PD vs BS	-0.268	0.144	-0.554	0.017	-1.861	0.065
Signal errors (Q7–9)	Game: SH vs BS	-0.281	0.139	-0.557	-0.006	-2.022	0.045	*
Signal errors (Q7–9)	genderMale	0.069	0.101	-0.131	0.269	0.679	0.498
Signal errors (Q7–9)	Role: CoCoLab vs LEEN	0.049	0.101	-0.150	0.249	0.488	0.626
Outcome: Rules errors (Q1–6)
Rules errors (Q1–6)	Game: MP vs BS	2.545	1.869	-1.157	6.247	1.362	0.176
Rules errors (Q1–6)	Game: PD vs BS	4.025	1.905	0.252	7.798	2.113	0.037	*
Rules errors (Q1–6)	Game: SH vs BS	-0.688	1.838	-4.328	2.953	-0.374	0.709
Rules errors (Q1–6)	genderMale	-0.722	1.335	-3.367	1.922	-0.541	0.59
Rules errors (Q1–6)	Role: CoCoLab vs LEEN	0.590	1.331	-2.047	3.227	0.443	0.658
Outcome: Total errors
Total errors	Game: MP vs BS	2.270	1.874	-1.442	5.982	1.211	0.228
Total errors	Game: PD vs BS	3.757	1.910	-0.027	7.540	1.966	0.052
Total errors	Game: SH vs BS	-0.969	1.843	-4.619	2.682	-0.526	0.6
Total errors	genderMale	-0.654	1.339	-3.306	1.998	-0.488	0.626
Total errors	Role: CoCoLab vs LEEN	0.639	1.335	-2.005	3.283	0.479	0.633
β = OLS coefficient. * p < .05 p < .01 * p < .001.

8.4.2 CRT4 — binary logistic regression

Note

CRT4 correct answers take only 5 discrete values (0–4), making OLS inappropriate for inference on a near-bounded outcome. Responses are dichotomised into high reflectors (correct ≥ 3) vs low/intuitive (correct ≤ 2), and a logistic regression is estimated. Odds ratios > 1 indicate a higher probability of being a high reflector relative to the reference category (BS, Male, LEEN). Wald 95% CIs are reported.

Show code

gt_logit8_crt

Predictor	log(OR)	OR	95% CI lo	95% CI hi	z	p
Logistic regression: CRT4 high reflector ~ game + gender + role
Binary outcome: correct ≥ 3 (high reflector) vs ≤ 2 (low/intuitive). OR and 95% Wald CI. Reference: BS, Male, LEEN.
Game: MP vs BS	−0.248	0.780	0.285	2.134	−0.484	0.629
Game: PD vs BS	−0.528	0.590	0.211	1.646	−1.008	0.313
Game: SH vs BS	−0.380	0.684	0.254	1.839	−0.752	0.452
genderMale	0.072	1.074	0.525	2.199	0.197	0.844
Role: CoCoLab vs LEEN	0.000	1.000	0.490	2.042	0.000	1
OR = odds ratio = exp(log-odds). Wald 95% CI. * p < .05 p < .01 * p < .001.

8.5 Forest plots — partial effects

Show code

p_forest8_rules

Figure 17: OLS β ± 95% CI for each predictor on rules-quiz errors (Q1–6; game + gender + role). Reference: BS, Male, LEEN. Dashed line at zero.

Show code

p_forest8_signals

Figure 18: OLS β ± 95% CI for each predictor on signal-comprehension errors (Q7–9; game + gender + role). Reference: BS, Male, LEEN. Dashed line at zero.

Show code

p_forest8_crt

Figure 19: Logistic regression odds ratios (OR) with 95% Wald CI for CRT4 high-reflector outcome (correct ≥ 3). Log scale: values > 1 = higher odds of being a high reflector; values < 1 = lower odds. Reference: BS, Male, LEEN. Dashed line at OR = 1.

9 Preliminary interpretation

84.4% of participants answered all signal-comprehension questions correctly. Total errors show the expected right skew (median = 1): most participants made few mistakes, but a high-error tail warrants monitoring in robustness checks.

On the CRT4, the sample median is 3 correct answers (IQR = 1) out of 4. The Spearman correlation between CRT4 score and quiz errors is ρ = -0.04 — close to zero, indicating the two constructs are largely independent in this sample.