LLM Codebook Generation Analysis | Multi-Model Comparison

Multi-Model Comparison Abstract

This comparative analysis evaluates three leading large language models for inductive thematic analysis: GPT-4o Claude Sonnet 4 and Grok 3. Using identical focus group data and prompts, we conducted 200 iterations per model (600 total) to assess consistency, theme generation patterns, and cross-model agreement. Results reveal highly significant differences in theme counts across models (F(2, 597) = 89.47, p < .001, η² = .23), with Claude producing the most themes (M = 12.77), followed by Grok (M = 11.33) and GPT-4o (M = 10.03). Despite quantitative differences, semantic analysis revealed 98% convergence on core themes across all models, supporting the validity of LLM-assisted qualitative analysis.

Model	M	SD	Range	95% CI	CV%
Claude Sonnet 4	12.77	2.07	7–20	[12.48, 13.05]	16.2%
Grok 3	11.33	1.52	10–16	[11.12, 11.54]	12.3%
GPT-4o	10.03	1.55	7–14	[9.81, 10.24]	15.8%

Model

Range

95% CI

CV%

Claude Sonnet 4

12.77

2.07

7–20

[12.48, 13.05]

16.2%

Grok 3

11.33

1.52

10–16

[11.12, 11.54]

12.3%

GPT-4o

10.03

1.55

7–14

[9.81, 10.24]

15.8%

Comparison	Mean Diff	t	p	Cohen's d	Interpretation
Claude vs GPT-4o	2.74	14.96	<.001***	1.50	Very large
Claude vs Grok	1.44	7.91	<.001***	0.79	Large
Grok vs GPT-4o	1.30	8.46	<.001***	0.85	Large

Comparison

Mean Diff

Cohen's d

Interpretation

Claude vs GPT-4o

2.74

14.96

<.001***

1.50

Very large

Claude vs Grok

1.44

7.91

<.001***

0.79

Large

Grok vs GPT-4o

1.30

8.46

<.001***

0.85

Large

Model	Contextual M (SD)	Minimal M (SD)	t(198)	p	d
GPT-4o	10.48 (1.63)	9.57 (1.32)	4.34	<.001***	0.61
Claude	13.34 (2.05)	12.19 (1.91)	4.11	<.001***	0.58
Grok 3	11.47 (1.59)	11.19 (1.44)	1.31	.193	0.18

Model

Contextual M (SD)

Minimal M (SD)

t(198)

GPT-4o

10.48 (1.63)

9.57 (1.32)

4.34

<.001***

0.61

Claude

13.34 (2.05)

12.19 (1.91)

4.11

<.001***

0.58

Grok 3

11.47 (1.59)

11.19 (1.44)

1.31

.193

0.18

Semantic Cluster	GPT-4o	Claude	Grok	Mean Agreement
Usability & Accessibility	99.5%	100%	100%	99.8%
Visual Design	100%	100%	100%	100%
Information & Education	100%	100%	100%	100%
Community Engagement	100%	100%	100%	100%
Notifications & Alerts	91.0%	97.0%	100%	96.0%
Motivation & Incentives	98.5%	96.0%	99.5%	98.0%
Trust & Privacy	59.5%	98.0%	97.0%	84.8%
Cultural & Language	66.0%	75.5%	84.0%	75.2%
Technical Issues	55.5%	43.5%	48.5%	49.2%

Semantic Cluster

GPT-4o

Claude

Grok

Mean Agreement

Usability & Accessibility

99.5%

100%

99.8%

Visual Design

100%

Information & Education

100%

Community Engagement

100%

Notifications & Alerts

91.0%

97.0%

100%

96.0%

Motivation & Incentives

98.5%

96.0%

99.5%

98.0%

Trust & Privacy

59.5%

98.0%

97.0%

84.8%

Cultural & Language

66.0%

75.5%

84.0%

75.2%

Technical Issues

55.5%

43.5%

48.5%

49.2%

LLM vs Traditional Methods Abstract

This analysis compares LLM-assisted thematic analysis with traditional human-led coding using MAXQDA. Two qualitative researchers coded the Kidenga focus group data across 6 focus groups, identifying 14 top-level themes with a hierarchical codebook containing sub-codes and 1,146 coded segments. Comparing human-generated codebooks with LLM outputs revealed strong agreement on 6 of 9 semantic categories (67%), with LLMs identifying all major human-coded themes. Notably, LLMs detected Cultural & Language themes (75% of iterations) that were not explicitly coded by human researchers, while under-detecting Technical Issues (49% vs 100%).

Study Overview

Human Coders

Focus Groups

Human Top-Level Themes

Coded Segments

1,146

LLM Mean Themes

11.4

Semantic Agreement

67%

Head-to-Head Comparison

Traditional (MAXQDA)

Top-level themes identified

LLM-Assisted (Multi-Model)

11.4

Mean themes per iteration

Traditional (MAXQDA)

1,146

Total coded segments

LLM-Assisted (Multi-Model)

600

Total iterations analyzed

Table 5 Comprehensive Comparison: Traditional vs LLM Methods

Metric	Traditional (MAXQDA)	LLM (Multi-Model)	Notes
Coder(s)	2 human coders	3 LLM models	—
Top-level themes	14	10–13 per iteration	Comparable range
Hierarchical depth	14 themes + sub-codes	Flat structure	Human more nuanced
Coded segments	1,146	N/A (theme-level)	Different granularity
Core category agreement	9/9 (100%)	8/9 (89%)	Strong overlap
Unique contributions	Technical Issues depth	Cultural & Language	Complementary

Interpretation

The human coders produced a hierarchical structure (14 top-level themes with sub-codes) compared to the LLMs' flatter output (10–13 themes per iteration). Both approaches converged on the same core constructs. The human approach excels at nuanced sub-categorization and contextual depth, while LLMs provide rapid, consistent theme identification across iterations. The methods are complementary rather than competing.

Semantic Category Agreement

We mapped both human and LLM themes to 9 predefined semantic categories to assess conceptual alignment. For the human codebook, each top-level theme was assigned to a semantic category; presence was marked if at least one theme mapped to that category. For LLMs, we calculated the percentage of iterations (out of 600 total) in which generated themes contained keywords matching each semantic category. Agreement was then assessed by comparing human presence with LLM detection rates.

Theme Category Detection: Human vs LLM % detection rate

Table 6 Semantic Category Agreement Analysis

Semantic Category	Human	LLM Mean	Agreement
Usability & Accessibility	✓ Present	99.8%	Full ✓
Visual Design	✓ Present	100%	Full ✓
Information & Education	✓ Present	100%	Full ✓
Notifications & Alerts	✓ Present	96.0%	Full ✓
Community Engagement	✓ Present	100%	Full ✓
Motivation & Incentives	✓ Present	98.0%	Full ✓
Trust & Privacy	✓ Present	84.8%	Partial
Technical Issues	✓ Present	49.2%	LLM under-detect
Cultural & Language	✗ Absent	75.2%	LLM extra

Human = presence of at least one theme in category | LLM Mean = % of 600 iterations detecting category | Full = LLM ≥90% | Partial = 50-89% | LLM extra = detected by LLM but not human

Key Findings

Strong agreement (6/9 categories): Both methods converged on core themes including Usability, Visual Design, Information, Notifications, Community, and Motivation. LLM under-detection: Technical Issues were coded by the human researcher but detected in only 49% of LLM iterations, suggesting LLMs may miss operational/functional concerns. LLM extra detection: Cultural & Language themes appeared in 75% of LLM iterations but were not explicitly coded by the human researcher — this could represent either LLM over-generation or a genuine pattern the human coder subsumed under other categories.

Human Codebook Structure

The human-generated codebook demonstrates the hierarchical depth possible with traditional methods. The 14 top-level themes encompass multiple sub-codes, providing nuanced categorization that flat LLM outputs do not capture.

Human Codebook: Top-Level Themes by Sub-code Count hierarchical structure

Codebook Structure Comparison

A fundamental difference between human and LLM coding lies in structural organization. Human coders naturally develop hierarchical codebooks with parent-child relationships, while LLMs produce flat theme lists without nesting.

Table 7 Structural Comparison: Human vs LLM Codebooks

Dimension	Human (MAXQDA)	LLM (Multi-Model)
Structure type	Hierarchical (2 levels)	Flat (1 level)
Top-level themes	14	10–13 per iteration
Sub-codes (child themes)	Multiple per parent	None
Themes with sub-themes	Most parent themes	0 (flat structure)
Coded segments	1,146	N/A
Coding granularity	Segment-level	Theme-level only
Code relationships	Parent-child links	Independent themes

"Themes with sub-themes" refers to parent themes that contain nested child codes (e.g., "Reporting" → "Privacy concerns", "Motivation to report")

Human Codebook hierarchical

▸ Comprehensibility and Usability
└─ Accessibility, Aesthetics, Engagement, Navigation, Simplicity...
▸ Reporting and Data sharing
└─ Privacy concerns, Motivation, Rewards, Willingness, Frequency...
▸ Visual Design and Aesthetic
└─ Color/Font, Clarity, Visual Aids, Appreciation...
▸ Information Relevance
└─ Context, Data Sources, Practical utility...
▸ Feedback and Functionality
└─ Features, Navigation, Location awareness...
▸ Notifications Preferences
└─ Timing, Methods, Location-based...
... +8 more parent themes with sub-codes

LLM Codebook flat

Usability Issues
Visual Design
Information Clarity
Educational Value
Motivations for Use
Barriers to Use
Notification Preferences
Community Engagement
Cultural Sensitivity
Rewards and Incentives

                                    No sub-codes or hierarchy.

                                    Each theme is independent.

                                    Themes vary across iterations.
                                

Structural Implications

Human advantage: The hierarchical structure (e.g., "Reporting and Data sharing" → "Privacy concerns," "Motivation to report," "Rewards") enables nuanced analysis and navigation. Sub-codes capture distinctions that flat LLM themes collapse. LLM limitation: Without hierarchy, LLMs produce themes like "Reporting and Data sharing" but cannot distinguish sub-dimensions without explicit prompting. Practical implication: For studies requiring detailed sub-categorization, human coding or a hybrid approach (LLM initial themes → human hierarchical refinement) is recommended.

Recommendations

Based on our comparative analysis of real human and LLM coding outputs, we offer the following recommendations:

Use LLMs for rapid theme discovery — they reliably identify 89% of semantic categories that humans code, making them excellent for initial exploration.
Human coding for hierarchical depth — when sub-categorization and nuanced distinctions matter, human coding remains essential.
Cross-validate Technical Issues — LLMs consistently under-detect operational/functional themes; ensure prompts explicitly request these.
Leverage LLM "extra" themes — Cultural & Language themes detected by LLMs but not humans may reveal blind spots worth investigating.
Hybrid workflow recommended: LLM generates initial themes → human validates → human develops hierarchy → LLM applies codes at scale.

👤

Human Coding Strengths

Hierarchical structure (14 themes + sub-codes), contextual nuance, technical/operational depth, segment-level precision, interpretive richness

🤖

LLM Coding Strengths

Speed and consistency, cross-iteration validation, detecting Cultural & Language themes, scalability, cost-efficiency

⚠️

Human Coding Limitations

Time-intensive, potential inter-coder variability, may miss patterns visible across many iterations, Cultural & Language not explicitly coded

⚠️

LLM Coding Limitations

Flat structure only, under-detects Technical Issues (49%), no segment-level coding, naming inconsistency across iterations

Data analysis, visualization, and report generation were conducted using Claude Opus 4.5, as a peer-coding assistant with a human-in-the-loop (HITL) approach. The LLM was utilized to extract and parse theme counts from multi-model outputs, perform statistical analyses (ANOVA, t-tests, effect size calculations), map themes to semantic categories, and generate interactive visualizations.

All results were cross-checked by the research team against the original MAXQDA codebook and raw LLM outputs to ensure accuracy. The HITL approach ensured that automated analyses were validated through iterative review, with human researchers confirming theme mappings, verifying statistical interpretations, and correcting any discrepancies identified during the analytical process.

Analysis	Method	Purpose
Multi-model comparison (Table 1)	One-way ANOVA, Kruskal-Wallis	Compare mean theme counts across 3 LLMs
Pairwise comparisons (Table 2)	Independent t-tests, Bonferroni correction, Cohen's d	Compare mean theme counts between pairs of LLMs
Prompt type effects (Table 3)	Independent t-tests, Cohen's d	Compare mean theme counts between contextual vs minimal prompts within each LLM
Semantic agreement (Table 4, 6)	Keyword matching, category mapping	Calculate % of iterations detecting each semantic category
Structural comparison (Table 7)	Descriptive comparison	Compare hierarchical vs flat codebook structures

Analysis

Method

Purpose

Multi-model comparison (Table 1)

One-way ANOVA, Kruskal-Wallis

Compare mean theme counts across 3 LLMs

Pairwise comparisons (Table 2)

Independent t-tests, Bonferroni correction, Cohen's d

Compare mean theme counts between pairs of LLMs

Prompt type effects (Table 3)

Independent t-tests, Cohen's d

Compare mean theme counts between contextual vs minimal prompts within each LLM

Semantic agreement (Table 4, 6)

Keyword matching, category mapping

Calculate % of iterations detecting each semantic category

Structural comparison (Table 7)

Descriptive comparison

Compare hierarchical vs flat codebook structures

Multi-Model Comparison Abstract

Multi-Model Study Overview

Cross-Model Theme Generation

Mean Theme Counts by Model with 95% CI

Pairwise Statistical Comparisons

Prompt Type Effects by Model

Prompt Type Effect by Model contextual vs minimal

Raw Theme Labels by Prompt Type

Top 10 Theme Labels

GPT-4o contextual prompt

GPT-4o minimal prompt

Top 10 Theme Labels

Claude contextual prompt

Claude minimal prompt

Top 10 Theme Labels

Grok contextual prompt

Grok minimal prompt

Cross-Model Semantic Agreement

Theme Cluster Detection by Model % of iterations containing theme

Theme Distribution Comparison

Theme Count Distribution all models

Coefficient of Variation lower = more consistent

Key Findings Summary

Claude Produces Most Themes

Grok Most Consistent

98% Core Agreement

Minority Theme Divergence

LLM vs Traditional Methods Abstract

Study Overview

Head-to-Head Comparison

Semantic Category Agreement

Theme Category Detection: Human vs LLM % detection rate

Human Codebook Structure

Human Codebook: Top-Level Themes by Sub-code Count hierarchical structure

Codebook Structure Comparison

Human Codebook hierarchical

LLM Codebook flat

Recommendations

Human Coding Strengths

LLM Coding Strengths

Human Coding Limitations

LLM Coding Limitations

Analytical Methods