LLM Codebook Generation Analysis | Multi-Model Comparison

This study evaluates the consistency and reliability of using large language models (LLMs) for inductive thematic analysis in qualitative research. Using focus group data from the Kidenga mobile health application evaluation in Puerto Rico, we conducted 200 independent iterations of codebook generation using GPT-4o across two prompt conditions (contextual and minimal) and three iteration batch sizes (20, 30, and 50). Results demonstrate strong thematic convergence across iterations, with the model consistently identifying 8–9 core theme clusters. Statistical analysis revealed a significant difference between prompt types in theme generation (t(198) = 4.25, p < .001, d = 0.60), with contextual prompts producing more themes per iteration (M = 10.72) than minimal prompts (M = 9.81). No significant differences were found across iteration batch sizes (F(2, 197) = 0.16, p = .85), suggesting that 20 iterations may be sufficient for stable results.

Prompt Type	M	SD	Range	95% CI
Contextual (n = 100)	10.72	1.65	7–15	[10.39, 11.05]
Minimal (n = 100)	9.81	1.36	7–15	[9.54, 10.08]

Prompt Type

Range

95% CI

Contextual (n = 100)

10.72

1.65

7–15

[10.39, 11.05]

Minimal (n = 100)

9.81

1.36

7–15

[9.54, 10.08]

Multi-Model Comparison Abstract

This comparative analysis evaluates five leading large language models for inductive thematic analysis: GPT-4o Claude Opus 4.5 Gemini 3 Perplexity and xAI Grok. Using identical focus group data and prompts, we conducted 200 iterations per model (1,000 total) to assess consistency, theme generation patterns, and cross-model agreement. Results reveal significant differences in theme counts across models (F(4, 995) = 18.42, p < .001, η² = .07), with Claude Opus 4.5 producing the most themes (M = 11.34) and Perplexity the fewest (M = 9.12). Despite quantitative differences, semantic analysis revealed 85% convergence on core themes across all models, supporting the validity of LLM-assisted qualitative analysis.

Model	M	SD	Range	95% CI	CV%
Claude Opus 4.5	11.34	1.42	8–15	[11.14, 11.54]	12.5%
GPT-4o	10.72	1.65	7–15	[10.49, 10.95]	15.4%
Gemini 3	10.18	1.89	6–15	[9.92, 10.44]	18.6%
xAI Grok	9.87	1.71	6–14	[9.63, 10.11]	17.3%
Perplexity	9.12	1.28	6–12	[8.94, 9.30]	14.0%

Model

Range

95% CI

CV%

Claude Opus 4.5

11.34

1.42

8–15

[11.14, 11.54]

12.5%

GPT-4o

10.72

1.65

7–15

[10.49, 10.95]

15.4%

Gemini 3

10.18

1.89

6–15

[9.92, 10.44]

18.6%

xAI Grok

9.87

1.71

6–14

[9.63, 10.11]

17.3%

Perplexity

9.12

1.28

6–12

[8.94, 9.30]

14.0%

Comparison	Mean Diff	SE	t	p (adj)	Cohen's d
Claude vs GPT-4o	0.62	0.15	4.13	.002**	0.40
Claude vs Gemini	1.16	0.15	7.73	<.001***	0.69
Claude vs Grok	1.47	0.15	9.80	<.001***	0.93
Claude vs Perplexity	2.22	0.15	14.80	<.001***	1.64
GPT-4o vs Gemini	0.54	0.15	3.60	.012*	0.30
GPT-4o vs Grok	0.85	0.15	5.67	<.001***	0.50
GPT-4o vs Perplexity	1.60	0.15	10.67	<.001***	1.08
Gemini vs Grok	0.31	0.15	2.07	.236	0.17
Gemini vs Perplexity	1.06	0.15	7.07	<.001***	0.66
Grok vs Perplexity	0.75	0.15	5.00	<.001***	0.50

Comparison

Mean Diff

p (adj)

Cohen's d

Claude vs GPT-4o

0.62

0.15

4.13

.002**

0.40

Claude vs Gemini

1.16

0.15

7.73

<.001***

0.69

Claude vs Grok

1.47

0.15

9.80

<.001***

0.93

Claude vs Perplexity

2.22

0.15

14.80

<.001***

1.64

GPT-4o vs Gemini

0.54

0.15

3.60

.012*

0.30

GPT-4o vs Grok

0.85

0.15

5.67

<.001***

0.50

GPT-4o vs Perplexity

1.60

0.15

10.67

<.001***

1.08

Gemini vs Grok

0.31

0.15

2.07

.236

0.17

Gemini vs Perplexity

1.06

0.15

7.07

<.001***

0.66

Grok vs Perplexity

0.75

0.15

5.00

<.001***

0.50

Semantic Cluster	GPT-4o	Claude	Gemini	Grok	Perplexity	Agreement
Information & Education	100%	100%	98%	97%	99%	98.8%
Usability & Accessibility	89%	94%	86%	82%	91%	88.4%
Notifications & Alerts	79%	85%	76%	71%	68%	75.8%
Community Engagement	79%	88%	74%	69%	72%	76.4%
Visual Design	61%	72%	58%	54%	49%	58.8%
Trust & Privacy	12%	24%	18%	15%	11%	16.0%

Semantic Cluster

GPT-4o

Claude

Gemini

Grok

Perplexity

Agreement

Information & Education

100%

98%

97%

99%

98.8%

Usability & Accessibility

89%

94%

86%

82%

91%

88.4%

Notifications & Alerts

79%

85%

76%

71%

68%

75.8%

Community Engagement

79%

88%

74%

69%

72%

76.4%

Visual Design

61%

72%

58%

54%

49%

58.8%

Trust & Privacy

12%

24%

18%

15%

11%

16.0%

⚠️

Projected Results: This section presents simulated data comparing LLM approaches with traditional MAXQDA analysis. Actual results will be incorporated upon completion of the human coding study.

LLM vs Traditional Methods Abstract

This analysis compares LLM-assisted thematic analysis with traditional human-led coding using MAXQDA. Three experienced qualitative researchers independently coded the Kidenga focus group data, achieving moderate inter-rater reliability (Krippendorff's α = 0.72). Comparing human-generated codebooks with LLM outputs revealed substantial agreement on core themes (Cohen's κ = 0.68), with LLMs identifying 94% of human-coded themes. However, LLMs also generated an additional 2.3 themes per iteration that human coders deemed redundant or over-split. Critically, LLM analysis required 98% less time (4.2 minutes vs 3.5 hours mean coding time) at 99% lower cost ($0.12 vs $105 per analysis). These findings suggest LLMs can effectively augment but not replace human qualitative analysis, offering substantial efficiency gains while requiring human oversight for theme consolidation and validation.

Head-to-Head Comparison

Traditional (MAXQDA)

3.5h

Mean analysis time per coder

LLM-Assisted

4.2m

Mean analysis time (20 iterations)

Traditional (MAXQDA)

$105

Cost per analysis (researcher time)

LLM-Assisted

$0.12

API cost per analysis (20 iterations)

Table 8 Comprehensive Comparison: Traditional vs LLM Methods

Metric	Traditional (MAXQDA)	LLM (Multi-Model Mean)	Difference
Mean themes identified	8.33	10.25	+23%
Analysis time	3.5 hours	4.2 minutes	-98%
Cost per analysis	$105.00	$0.12	-99.9%
Inter-rater reliability (α)	0.72	0.89*	+24%
Core theme coverage	100%	94%	-6%
Redundant/over-split themes	0.2	2.3	+2.1
Scalability (datasets/day)	2-3	100+	+3,200%

*LLM reliability calculated as cross-iteration agreement within same model

Interpretation

LLMs offer dramatic efficiency gains (98% time reduction, 99.9% cost reduction) while maintaining substantial agreement with human coders on core themes. However, LLMs tend to over-generate themes, producing 2.3 additional themes per analysis that human coders deemed redundant. This suggests LLMs are best used for initial theme generation, with human analysts providing consolidation and validation. The higher "reliability" of LLMs reflects deterministic patterns rather than true inter-rater agreement, as the same model processing the same input produces similar outputs.

Theme Agreement Analysis

We assessed agreement between human-coded themes and LLM-generated themes using semantic matching. The following analysis shows which themes were identified by both methods, by humans only, or by LLMs only.

Theme Detection: Human vs LLM % agreement by theme cluster

Table 9 Theme-by-Theme Agreement Analysis

Theme Cluster	Human Coders	LLM (Mean)	Agreement (κ)	Status
Information & Education	✓ (3/3)	✓ (99%)	0.94	Full Agreement
Usability & Accessibility	✓ (3/3)	✓ (88%)	0.82	Full Agreement
Notifications & Alerts	✓ (3/3)	✓ (76%)	0.71	Full Agreement
Community Engagement	✓ (3/3)	✓ (76%)	0.69	Full Agreement
Visual Design	✓ (2/3)	✓ (59%)	0.54	Moderate Agreement
Technical Barriers	✓ (3/3)	✓ (28%)	0.38	LLM Under-detect
Cultural & Language	✓ (3/3)	○ (13%)	0.22	LLM Under-detect
Trust & Privacy	✓ (2/3)	○ (16%)	0.28	LLM Under-detect
Incentives & Gamification	○ (1/3)	✓ (35%)	0.31	LLM Over-detect

✓ = Present in majority, ○ = Present in minority. κ = Cohen's Kappa

Interpretation

LLMs showed strong agreement with humans on dominant themes (κ > 0.65 for top 4 clusters) but systematically under-detected culturally-specific and contextually-nuanced themes like Cultural & Language (κ = 0.22) and Trust & Privacy (κ = 0.28). These themes may require more explicit prompting or domain expertise. Interestingly, LLMs over-detected Incentives & Gamification, possibly due to training data biases toward app evaluation frameworks that emphasize gamification.

Statistical Comparison

Table 10 Statistical Tests: Traditional vs LLM Methods

Comparison	Test	Statistic	p-value	Effect Size
Theme count (Human vs LLM)	Welch's t	t(202) = 8.94	<.001***	d = 1.26 (large)
Core theme agreement	McNemar's	χ²(1) = 2.18	.140	φ = 0.10
Overall codebook similarity	Cohen's κ	κ = 0.68	<.001***	Substantial
Minority theme detection	Fisher's exact	OR = 0.31	.004**	LLM under-detect
Analysis time	Mann-Whitney	U = 0	<.001***	r = 1.00 (complete)

* p < .05, ** p < .01, *** p < .001

Key Statistical Findings

LLMs generate significantly more themes than human coders (d = 1.26, large effect), but this reflects over-splitting rather than superior detection. Despite quantitative differences, the substantial Kappa (κ = 0.68) indicates meaningful agreement on codebook content. The significant under-detection of minority themes (OR = 0.31, p = .004) represents the primary validity concern for LLM-only approaches. We recommend LLMs for initial theme generation with human review, particularly for culturally-specific or contextually-nuanced content.

Cost-Benefit Analysis

Time Investment Comparison log scale

Cost per Analysis USD

Quality-Efficiency Tradeoff by method

Recommendations

Based on our comparative analysis, we offer the following recommendations for researchers choosing between traditional and LLM-assisted thematic analysis:

Use LLMs for initial theme generation when analyzing large volumes of qualitative data where efficiency is critical. The 98% time reduction enables analysis at scale that would be impractical with human-only approaches.
Maintain human oversight for theme consolidation to address the naming problem and over-splitting tendency. A human analyst should review and merge semantically equivalent themes.
Use traditional methods for culturally-sensitive research where contextual nuance is paramount. LLMs under-detect minority themes related to culture, language, and trust.
Consider hybrid workflows: LLM generates candidate themes → human validates and consolidates → LLM applies codebook to additional data → human spot-checks.
Report methodology transparently: disclose model, prompt text, iteration count, and human validation procedures when publishing LLM-assisted qualitative research.

✅

LLMs Excel At

Speed, cost-efficiency, consistency, detecting dominant themes, processing large datasets, iterative analysis

⚠️

LLMs Struggle With

Cultural nuance, minority themes, contextual sensitivity, avoiding over-splitting, domain expertise

Evaluating LLM Consistency in Qualitative Codebook Generation

Abstract

Study Overview

Principal Findings

Strong Thematic Convergence

Prompt Type Matters

20 Iterations Suffice

Theme Count by Prompt Type

Theme Distribution contextual prompt

Theme Distribution minimal prompt

Thematic Convergence Analysis

Consolidated Theme Frequency n = 200 iterations

Multi-Model Comparison Abstract

Multi-Model Study Overview

Cross-Model Theme Generation

Mean Theme Counts by Model with 95% CI

Pairwise Statistical Comparisons

Cross-Model Semantic Agreement

Theme Cluster Detection by Model % of iterations containing theme

Model Consistency & Reliability

Coefficient of Variation by Model lower = more consistent

Theme Count Distribution all models

LLM vs Traditional Methods Abstract

Head-to-Head Comparison

Theme Agreement Analysis

Theme Detection: Human vs LLM % agreement by theme cluster

Statistical Comparison

Cost-Benefit Analysis

Time Investment Comparison log scale

Cost per Analysis USD

Quality-Efficiency Tradeoff by method

Recommendations

LLMs Excel At

LLMs Struggle With