Evaluating LLM Consistency in Qualitative Codebook Generation

Multi-Model Comparison & Traditional Method Analysis

1,000 iterations
5 LLM Models
MAXQDA Comparison
January 2026

Abstract

This study evaluates the consistency and reliability of using large language models (LLMs) for inductive thematic analysis in qualitative research. Using focus group data from the Kidenga mobile health application evaluation in Puerto Rico, we conducted 200 independent iterations of codebook generation using GPT-4o across two prompt conditions (contextual and minimal) and three iteration batch sizes (20, 30, and 50). Results demonstrate strong thematic convergence across iterations, with the model consistently identifying 8โ€“9 core theme clusters. Statistical analysis revealed a significant difference between prompt types in theme generation (t(198) = 4.25, p < .001, d = 0.60), with contextual prompts producing more themes per iteration (M = 10.72) than minimal prompts (M = 9.81). No significant differences were found across iteration batch sizes (F(2, 197) = 0.16, p = .85), suggesting that 20 iterations may be sufficient for stable results.

01

Study Overview

Total Iterations
200
Success Rate
100%
Avg Input Tokens
23,084
Avg Output Tokens
580
Contextual Runs
100
Minimal Runs
100

The application of large language models to qualitative research methods represents an emerging area of methodological inquiry. This study addresses a fundamental methodological question: when an LLM is asked to generate a codebook from the same qualitative dataset multiple times, how consistent are the resulting themes?

02

Principal Findings

๐Ÿ“Š

Strong Thematic Convergence

Despite 200 independent runs, the model consistently identified 8โ€“9 core theme clusters.

๐ŸŽฏ

Prompt Type Matters

Contextual prompts produced significantly more themes (M = 10.7 vs 9.8, p < .001, d = 0.60).

๐Ÿ”„

20 Iterations Suffice

No significant differences across batch sizes (p = .85). Fewer iterations reduce costs.

03

Theme Count by Prompt Type

Table 1 Descriptive Statistics for Theme Counts by Prompt Type
Prompt Type M SD Range 95% CI
Contextual (n = 100) 10.72 1.65 7โ€“15 [10.39, 11.05]
Minimal (n = 100) 9.81 1.36 7โ€“15 [9.54, 10.08]
Interpretation

The significant difference between prompt types (t(198) = 4.25, p < .001, d = 0.60) suggests that providing study context enables the model to identify more nuanced themes. However, more themes are not necessarily better; the additional themes may represent over-splitting of constructs.

Theme Distribution contextual prompt

Theme Distribution minimal prompt

04

Thematic Convergence Analysis

Consolidated Theme Frequency n = 200 iterations

Interpretation

The semantic clustering reveals a clear hierarchy of theme salience. Four core clusters appeared in the majority of iterations: Information & Education (100.5%), Usability & Accessibility (89%), Notifications & Alerts (79%), and Community Engagement (79%). Trust & Privacy and Cultural & Language themes were identified in only 12โ€“13% of iterations, suggesting they may require more explicit prompting.

โš ๏ธ
Projected Results: This section presents simulated data for demonstration purposes, showing how the final multi-model comparison report will be structured when all experiments are complete.

Multi-Model Comparison Abstract

This comparative analysis evaluates five leading large language models for inductive thematic analysis: GPT-4o Claude Opus 4.5 Gemini 3 Perplexity and xAI Grok. Using identical focus group data and prompts, we conducted 200 iterations per model (1,000 total) to assess consistency, theme generation patterns, and cross-model agreement. Results reveal significant differences in theme counts across models (F(4, 995) = 18.42, p < .001, ฮทยฒ = .07), with Claude Opus 4.5 producing the most themes (M = 11.34) and Perplexity the fewest (M = 9.12). Despite quantitative differences, semantic analysis revealed 85% convergence on core themes across all models, supporting the validity of LLM-assisted qualitative analysis.

01

Multi-Model Study Overview

Total Iterations
1,000
Models Tested
5
Iterations/Model
200
Overall Success
98.7%
Core Theme Agreement
85%
Total API Cost
$247
02

Cross-Model Theme Generation

Each model was evaluated using identical prompts (contextual condition) across 200 independent iterations. The following table presents descriptive statistics for theme counts by model.

Table 5 Theme Count Statistics by Model (Contextual Prompt, n = 200 each)
Model M SD Range 95% CI CV%
Claude Opus 4.5 11.34 1.42 8โ€“15 [11.14, 11.54] 12.5%
GPT-4o 10.72 1.65 7โ€“15 [10.49, 10.95] 15.4%
Gemini 3 10.18 1.89 6โ€“15 [9.92, 10.44] 18.6%
xAI Grok 9.87 1.71 6โ€“14 [9.63, 10.11] 17.3%
Perplexity 9.12 1.28 6โ€“12 [8.94, 9.30] 14.0%

CV% = Coefficient of Variation (lower = more consistent)

Mean Theme Counts by Model with 95% CI

Interpretation

The one-way ANOVA revealed significant differences across models, F(4, 995) = 18.42, p < .001, ฮทยฒ = .07. Post-hoc Tukey HSD tests showed Claude Opus 4.5 produced significantly more themes than all other models (all p < .01), while Perplexity produced significantly fewer than GPT-4o and Claude (both p < .001). Notably, Perplexity showed the highest consistency (lowest CV), possibly due to its retrieval-augmented approach constraining output variation. Claude's higher theme count may reflect more granular differentiation of concepts.

03

Pairwise Statistical Comparisons

Table 6 Pairwise Comparisons (Tukey HSD)
Comparison Mean Diff SE t p (adj) Cohen's d
Claude vs GPT-4o 0.62 0.15 4.13 .002** 0.40
Claude vs Gemini 1.16 0.15 7.73 <.001*** 0.69
Claude vs Grok 1.47 0.15 9.80 <.001*** 0.93
Claude vs Perplexity 2.22 0.15 14.80 <.001*** 1.64
GPT-4o vs Gemini 0.54 0.15 3.60 .012* 0.30
GPT-4o vs Grok 0.85 0.15 5.67 <.001*** 0.50
GPT-4o vs Perplexity 1.60 0.15 10.67 <.001*** 1.08
Gemini vs Grok 0.31 0.15 2.07 .236 0.17
Gemini vs Perplexity 1.06 0.15 7.07 <.001*** 0.66
Grok vs Perplexity 0.75 0.15 5.00 <.001*** 0.50

* p < .05, ** p < .01, *** p < .001. Effect sizes: small (0.2), medium (0.5), large (0.8)

04

Cross-Model Semantic Agreement

Despite differences in theme counts, we assessed whether models converged on the same underlying constructs through semantic clustering analysis. The following chart shows the percentage of iterations in which each model identified themes within each semantic cluster.

Theme Cluster Detection by Model % of iterations containing theme

Table 7 Cross-Model Agreement on Core Themes
Semantic Cluster GPT-4o Claude Gemini Grok Perplexity Agreement
Information & Education 100% 100% 98% 97% 99% 98.8%
Usability & Accessibility 89% 94% 86% 82% 91% 88.4%
Notifications & Alerts 79% 85% 76% 71% 68% 75.8%
Community Engagement 79% 88% 74% 69% 72% 76.4%
Visual Design 61% 72% 58% 54% 49% 58.8%
Trust & Privacy 12% 24% 18% 15% 11% 16.0%
Interpretation

All five models achieved near-perfect agreement (98.8%) on the dominant theme cluster (Information & Education), and strong agreement (88.4%) on Usability. This convergence across architecturally distinct models provides validity evidence that these themes genuinely reflect patterns in the data. Claude showed the highest detection rate for minority themes like Trust & Privacy (24% vs 12% for GPT-4o), suggesting it may be more sensitive to less prominent constructs. Perplexity's lower detection of secondary themes likely reflects its more focused, retrieval-constrained outputs.

05

Model Consistency & Reliability

Coefficient of Variation by Model lower = more consistent

Theme Count Distribution all models

Interpretation

Perplexity and Claude demonstrated the highest consistency (lowest CV), while Gemini showed the most variation. This may reflect differences in temperature settings, training approaches, or architectural choices. For applications requiring reproducibility, Perplexity or Claude may be preferable, while Gemini's higher variation could be advantageous for exploratory analysis seeking diverse perspectives.

โš ๏ธ
Projected Results: This section presents simulated data comparing LLM approaches with traditional MAXQDA analysis. Actual results will be incorporated upon completion of the human coding study.

LLM vs Traditional Methods Abstract

This analysis compares LLM-assisted thematic analysis with traditional human-led coding using MAXQDA. Three experienced qualitative researchers independently coded the Kidenga focus group data, achieving moderate inter-rater reliability (Krippendorff's ฮฑ = 0.72). Comparing human-generated codebooks with LLM outputs revealed substantial agreement on core themes (Cohen's ฮบ = 0.68), with LLMs identifying 94% of human-coded themes. However, LLMs also generated an additional 2.3 themes per iteration that human coders deemed redundant or over-split. Critically, LLM analysis required 98% less time (4.2 minutes vs 3.5 hours mean coding time) at 99% lower cost ($0.12 vs $105 per analysis). These findings suggest LLMs can effectively augment but not replace human qualitative analysis, offering substantial efficiency gains while requiring human oversight for theme consolidation and validation.

01

Head-to-Head Comparison

Traditional (MAXQDA)
3.5h
Mean analysis time per coder
LLM-Assisted
4.2m
Mean analysis time (20 iterations)
Traditional (MAXQDA)
$105
Cost per analysis (researcher time)
LLM-Assisted
$0.12
API cost per analysis (20 iterations)
Table 8 Comprehensive Comparison: Traditional vs LLM Methods
Metric Traditional (MAXQDA) LLM (Multi-Model Mean) Difference
Mean themes identified 8.33 10.25 +23%
Analysis time 3.5 hours 4.2 minutes -98%
Cost per analysis $105.00 $0.12 -99.9%
Inter-rater reliability (ฮฑ) 0.72 0.89* +24%
Core theme coverage 100% 94% -6%
Redundant/over-split themes 0.2 2.3 +2.1
Scalability (datasets/day) 2-3 100+ +3,200%

*LLM reliability calculated as cross-iteration agreement within same model

Interpretation

LLMs offer dramatic efficiency gains (98% time reduction, 99.9% cost reduction) while maintaining substantial agreement with human coders on core themes. However, LLMs tend to over-generate themes, producing 2.3 additional themes per analysis that human coders deemed redundant. This suggests LLMs are best used for initial theme generation, with human analysts providing consolidation and validation. The higher "reliability" of LLMs reflects deterministic patterns rather than true inter-rater agreement, as the same model processing the same input produces similar outputs.

02

Theme Agreement Analysis

We assessed agreement between human-coded themes and LLM-generated themes using semantic matching. The following analysis shows which themes were identified by both methods, by humans only, or by LLMs only.

Theme Detection: Human vs LLM % agreement by theme cluster

Table 9 Theme-by-Theme Agreement Analysis
Theme Cluster Human Coders LLM (Mean) Agreement (ฮบ) Status
Information & Education โœ“ (3/3) โœ“ (99%) 0.94 Full Agreement
Usability & Accessibility โœ“ (3/3) โœ“ (88%) 0.82 Full Agreement
Notifications & Alerts โœ“ (3/3) โœ“ (76%) 0.71 Full Agreement
Community Engagement โœ“ (3/3) โœ“ (76%) 0.69 Full Agreement
Visual Design โœ“ (2/3) โœ“ (59%) 0.54 Moderate Agreement
Technical Barriers โœ“ (3/3) โœ“ (28%) 0.38 LLM Under-detect
Cultural & Language โœ“ (3/3) โ—‹ (13%) 0.22 LLM Under-detect
Trust & Privacy โœ“ (2/3) โ—‹ (16%) 0.28 LLM Under-detect
Incentives & Gamification โ—‹ (1/3) โœ“ (35%) 0.31 LLM Over-detect

โœ“ = Present in majority, โ—‹ = Present in minority. ฮบ = Cohen's Kappa

Interpretation

LLMs showed strong agreement with humans on dominant themes (ฮบ > 0.65 for top 4 clusters) but systematically under-detected culturally-specific and contextually-nuanced themes like Cultural & Language (ฮบ = 0.22) and Trust & Privacy (ฮบ = 0.28). These themes may require more explicit prompting or domain expertise. Interestingly, LLMs over-detected Incentives & Gamification, possibly due to training data biases toward app evaluation frameworks that emphasize gamification.

03

Statistical Comparison

Table 10 Statistical Tests: Traditional vs LLM Methods
Comparison Test Statistic p-value Effect Size
Theme count (Human vs LLM) Welch's t t(202) = 8.94 <.001*** d = 1.26 (large)
Core theme agreement McNemar's ฯ‡ยฒ(1) = 2.18 .140 ฯ† = 0.10
Overall codebook similarity Cohen's ฮบ ฮบ = 0.68 <.001*** Substantial
Minority theme detection Fisher's exact OR = 0.31 .004** LLM under-detect
Analysis time Mann-Whitney U = 0 <.001*** r = 1.00 (complete)

* p < .05, ** p < .01, *** p < .001

Key Statistical Findings

LLMs generate significantly more themes than human coders (d = 1.26, large effect), but this reflects over-splitting rather than superior detection. Despite quantitative differences, the substantial Kappa (ฮบ = 0.68) indicates meaningful agreement on codebook content. The significant under-detection of minority themes (OR = 0.31, p = .004) represents the primary validity concern for LLM-only approaches. We recommend LLMs for initial theme generation with human review, particularly for culturally-specific or contextually-nuanced content.

04

Cost-Benefit Analysis

Time Investment Comparison log scale

Cost per Analysis USD

Quality-Efficiency Tradeoff by method

05

Recommendations

Based on our comparative analysis, we offer the following recommendations for researchers choosing between traditional and LLM-assisted thematic analysis:

  1. Use LLMs for initial theme generation when analyzing large volumes of qualitative data where efficiency is critical. The 98% time reduction enables analysis at scale that would be impractical with human-only approaches.
  2. Maintain human oversight for theme consolidation to address the naming problem and over-splitting tendency. A human analyst should review and merge semantically equivalent themes.
  3. Use traditional methods for culturally-sensitive research where contextual nuance is paramount. LLMs under-detect minority themes related to culture, language, and trust.
  4. Consider hybrid workflows: LLM generates candidate themes โ†’ human validates and consolidates โ†’ LLM applies codebook to additional data โ†’ human spot-checks.
  5. Report methodology transparently: disclose model, prompt text, iteration count, and human validation procedures when publishing LLM-assisted qualitative research.
โœ…

LLMs Excel At

Speed, cost-efficiency, consistency, detecting dominant themes, processing large datasets, iterative analysis

โš ๏ธ

LLMs Struggle With

Cultural nuance, minority themes, contextual sensitivity, avoiding over-splitting, domain expertise