Evaluating LLM Consistency in Qualitative Codebook Generation

Multi-Model Comparison & Traditional Method Analysis

600 iterations
3 LLM Models
MAXQDA Comparison
January 2026

Multi-Model Comparison Abstract

This comparative analysis evaluates three leading large language models for inductive thematic analysis: GPT-4o Claude Sonnet 4 and Grok 3. Using identical focus group data and prompts, we conducted 200 iterations per model (600 total) to assess consistency, theme generation patterns, and cross-model agreement. Results reveal highly significant differences in theme counts across models (F(2, 597) = 89.47, p < .001, ฮทยฒ = .23), with Claude producing the most themes (M = 12.77), followed by Grok (M = 11.33) and GPT-4o (M = 10.03). Despite quantitative differences, semantic analysis revealed 98% convergence on core themes across all models, supporting the validity of LLM-assisted qualitative analysis.

๐Ÿ“‹
Note on Excluded Models: Gemini 2.0 Flash (9% success rate, n=18) and Perplexity (non-comparable output format with M=31.4 themes) were excluded from the primary analysis due to data quality issues.
01

Multi-Model Study Overview

Total Iterations
600
Models Analyzed
3
Per Model
200
Success Rate
100%
Core Theme Agreement
98%
Effect Size (ฮทยฒ)
.23
02

Cross-Model Theme Generation

Table 1 Theme Count Statistics by Model (n = 200 each)
Model M SD Range 95% CI CV%
Claude Sonnet 4 12.77 2.07 7โ€“20 [12.48, 13.05] 16.2%
Grok 3 11.33 1.52 10โ€“16 [11.12, 11.54] 12.3%
GPT-4o 10.03 1.55 7โ€“14 [9.81, 10.24] 15.8%

CV% = Coefficient of Variation (lower = more consistent)

Mean Theme Counts by Model with 95% CI

Interpretation

The one-way ANOVA revealed highly significant differences across models, F(2, 597) = 89.47, p < .001, ฮทยฒ = .23 (large effect). Claude produced significantly more themes than both Grok (d = 0.79, p < .001) and GPT-4o (d = 1.50, p < .001). Grok showed the highest consistency (lowest CV = 12.3%), suggesting more deterministic outputs. Claude's higher theme count may reflect more granular differentiation of concepts, which could be advantageous for exploratory analysis but may require more consolidation effort.

03

Pairwise Statistical Comparisons

Table 2 Pairwise Comparisons (Independent Samples t-tests with Bonferroni correction)
Comparison Mean Diff t p Cohen's d Interpretation
Claude vs GPT-4o 2.74 14.96 <.001*** 1.50 Very large
Claude vs Grok 1.44 7.91 <.001*** 0.79 Large
Grok vs GPT-4o 1.30 8.46 <.001*** 0.85 Large

*** p < .001. All comparisons survive Bonferroni correction (ฮฑ = .017).

04

Prompt Type Effects by Model

We examined whether the effect of prompt type (contextual vs. minimal) varied across models. This analysis reveals important differences in how models respond to contextual information.

Table 3 Prompt Type Comparison Within Each Model
Model Contextual M (SD) Minimal M (SD) t(198) p d
GPT-4o 10.48 (1.63) 9.57 (1.32) 4.34 <.001*** 0.61
Claude 13.34 (2.05) 12.19 (1.91) 4.11 <.001*** 0.58
Grok 3 11.47 (1.59) 11.19 (1.44) 1.31 .193 0.18

*** p < .001

Prompt Type Effect by Model contextual vs minimal

Interpretation

A striking interaction emerged: GPT-4o and Claude showed significant prompt type effects (both p < .001, medium effect sizes), while Grok showed no significant difference (p = .193, d = 0.18). This suggests Grok's thematic analysis is less influenced by contextual priming, producing similar outputs regardless of prompt specificity. For researchers seeking consistency across prompt variations, Grok may be preferable; for those wanting models that leverage contextual information, GPT-4o or Claude may be better choices.

05

Raw Theme Labels by Prompt Type

Before semantic clustering, we examined the raw theme labels produced by each model under contextual and minimal prompt conditions. This analysis reveals both the consistency of core themes and the variation in labeling that necessitates human synthesis.

GPT-4o

Top 10 Theme Labels

GPT-4o contextual prompt

GPT-4o minimal prompt

Claude Sonnet 4

Top 10 Theme Labels

Claude contextual prompt

Claude minimal prompt

Grok 3

Top 10 Theme Labels

Grok contextual prompt

Grok minimal prompt

Key Observations

Naming consistency: "Notification Preferences" emerged as the top or near-top theme across all models and prompt conditions, demonstrating strong cross-model convergence. Model-specific patterns: GPT-4o emphasizes "Community Engagement" and "Information Clarity"; Claude uniquely surfaces "Navigation Difficulties" and "Technology Barriers"; Grok consistently highlights "Usability Challenges" and "Accessibility Needs." Prompt effects: The minimal prompt tends to produce more generic labels (e.g., "Accessibility and Usability") while contextual prompts yield more specific themes (e.g., "Information Accessibility," "Engagement and Motivation").

06

Cross-Model Semantic Agreement

Despite differences in theme counts and labeling, we assessed whether models converged on the same underlying constructs through semantic clustering analysis. Each theme was categorized into predefined semantic clusters, and we calculated the percentage of iterations detecting each cluster.

Theme Cluster Detection by Model % of iterations containing theme

Table 4 Cross-Model Agreement on Semantic Theme Clusters
Semantic Cluster GPT-4o Claude Grok Mean Agreement
Usability & Accessibility 99.5% 100% 100% 99.8%
Visual Design 100% 100% 100% 100%
Information & Education 100% 100% 100% 100%
Community Engagement 100% 100% 100% 100%
Notifications & Alerts 91.0% 97.0% 100% 96.0%
Motivation & Incentives 98.5% 96.0% 99.5% 98.0%
Trust & Privacy 59.5% 98.0% 97.0% 84.8%
Cultural & Language 66.0% 75.5% 84.0% 75.2%
Technical Issues 55.5% 43.5% 48.5% 49.2%
Interpretation

All three models achieved near-perfect agreement (98-100%) on the five dominant theme clusters: Usability, Visual Design, Information, Community, and Notifications. This convergence across architecturally distinct models provides strong validity evidence that these themes genuinely reflect patterns in the data. Notable divergence appeared for Trust & Privacy, where GPT-4o detected this theme in only 59.5% of iterations compared to 97-98% for Claude and Grok. Technical Issues showed the lowest overall detection (49.2%), suggesting this may be a less prominent theme in the raw data or one that requires explicit prompting to surface.

07

Theme Distribution Comparison

Theme Count Distribution all models

Coefficient of Variation lower = more consistent

08

Key Findings Summary

๐Ÿ“Š

Claude Produces Most Themes

M = 12.77 themes per iteration, significantly higher than Grok (11.33) and GPT-4o (10.03).

๐ŸŽฏ

Grok Most Consistent

Lowest CV (12.3%) and no significant prompt type effect, suggesting more deterministic outputs.

โœ…

98% Core Agreement

All models converge on dominant themes (Usability, Design, Information, Community).

โš ๏ธ

Minority Theme Divergence

GPT-4o under-detects Trust & Privacy (59.5% vs 97-98% for others).

LLM vs Traditional Methods Abstract

This analysis compares LLM-assisted thematic analysis with traditional human-led coding using MAXQDA. Two qualitative researchers coded the Kidenga focus group data across 6 focus groups, identifying 14 top-level themes with a hierarchical codebook containing sub-codes and 1,146 coded segments. Comparing human-generated codebooks with LLM outputs revealed strong agreement on 6 of 9 semantic categories (67%), with LLMs identifying all major human-coded themes. Notably, LLMs detected Cultural & Language themes (75% of iterations) that were not explicitly coded by human researchers, while under-detecting Technical Issues (49% vs 100%).

01

Study Overview

Human Coders
2
Focus Groups
6
Human Top-Level Themes
14
Coded Segments
1,146
LLM Mean Themes
11.4
Semantic Agreement
67%
02

Head-to-Head Comparison

Traditional (MAXQDA)
14
Top-level themes identified
LLM-Assisted (Multi-Model)
11.4
Mean themes per iteration
Traditional (MAXQDA)
1,146
Total coded segments
LLM-Assisted (Multi-Model)
600
Total iterations analyzed
Table 5 Comprehensive Comparison: Traditional vs LLM Methods
Metric Traditional (MAXQDA) LLM (Multi-Model) Notes
Coder(s) 2 human coders 3 LLM models โ€”
Top-level themes 14 10โ€“13 per iteration Comparable range
Hierarchical depth 14 themes + sub-codes Flat structure Human more nuanced
Coded segments 1,146 N/A (theme-level) Different granularity
Core category agreement 9/9 (100%) 8/9 (89%) Strong overlap
Unique contributions Technical Issues depth Cultural & Language Complementary
Interpretation

The human coders produced a hierarchical structure (14 top-level themes with sub-codes) compared to the LLMs' flatter output (10โ€“13 themes per iteration). Both approaches converged on the same core constructs. The human approach excels at nuanced sub-categorization and contextual depth, while LLMs provide rapid, consistent theme identification across iterations. The methods are complementary rather than competing.

03

Semantic Category Agreement

We mapped both human and LLM themes to 9 predefined semantic categories to assess conceptual alignment. For the human codebook, each top-level theme was assigned to a semantic category; presence was marked if at least one theme mapped to that category. For LLMs, we calculated the percentage of iterations (out of 600 total) in which generated themes contained keywords matching each semantic category. Agreement was then assessed by comparing human presence with LLM detection rates.

Theme Category Detection: Human vs LLM % detection rate

Table 6 Semantic Category Agreement Analysis
Semantic Category Human LLM Mean Agreement
Usability & Accessibility โœ“ Present 99.8% Full โœ“
Visual Design โœ“ Present 100% Full โœ“
Information & Education โœ“ Present 100% Full โœ“
Notifications & Alerts โœ“ Present 96.0% Full โœ“
Community Engagement โœ“ Present 100% Full โœ“
Motivation & Incentives โœ“ Present 98.0% Full โœ“
Trust & Privacy โœ“ Present 84.8% Partial
Technical Issues โœ“ Present 49.2% LLM under-detect
Cultural & Language โœ— Absent 75.2% LLM extra

Human = presence of at least one theme in category | LLM Mean = % of 600 iterations detecting category | Full = LLM โ‰ฅ90% | Partial = 50-89% | LLM extra = detected by LLM but not human

Key Findings

Strong agreement (6/9 categories): Both methods converged on core themes including Usability, Visual Design, Information, Notifications, Community, and Motivation. LLM under-detection: Technical Issues were coded by the human researcher but detected in only 49% of LLM iterations, suggesting LLMs may miss operational/functional concerns. LLM extra detection: Cultural & Language themes appeared in 75% of LLM iterations but were not explicitly coded by the human researcher โ€” this could represent either LLM over-generation or a genuine pattern the human coder subsumed under other categories.

04

Human Codebook Structure

The human-generated codebook demonstrates the hierarchical depth possible with traditional methods. The 14 top-level themes encompass multiple sub-codes, providing nuanced categorization that flat LLM outputs do not capture.

Human Codebook: Top-Level Themes by Sub-code Count hierarchical structure

05

Codebook Structure Comparison

A fundamental difference between human and LLM coding lies in structural organization. Human coders naturally develop hierarchical codebooks with parent-child relationships, while LLMs produce flat theme lists without nesting.

Table 7 Structural Comparison: Human vs LLM Codebooks
Dimension Human (MAXQDA) LLM (Multi-Model)
Structure type Hierarchical (2 levels) Flat (1 level)
Top-level themes 14 10โ€“13 per iteration
Sub-codes (child themes) Multiple per parent None
Themes with sub-themes Most parent themes 0 (flat structure)
Coded segments 1,146 N/A
Coding granularity Segment-level Theme-level only
Code relationships Parent-child links Independent themes

"Themes with sub-themes" refers to parent themes that contain nested child codes (e.g., "Reporting" โ†’ "Privacy concerns", "Motivation to report")

Human Codebook hierarchical

โ–ธ Comprehensibility and Usability
โ””โ”€ Accessibility, Aesthetics, Engagement, Navigation, Simplicity...
โ–ธ Reporting and Data sharing
โ””โ”€ Privacy concerns, Motivation, Rewards, Willingness, Frequency...
โ–ธ Visual Design and Aesthetic
โ””โ”€ Color/Font, Clarity, Visual Aids, Appreciation...
โ–ธ Information Relevance
โ””โ”€ Context, Data Sources, Practical utility...
โ–ธ Feedback and Functionality
โ””โ”€ Features, Navigation, Location awareness...
โ–ธ Notifications Preferences
โ””โ”€ Timing, Methods, Location-based...
... +8 more parent themes with sub-codes

LLM Codebook flat

1. Usability Issues
2. Visual Design
3. Information Clarity
4. Educational Value
5. Motivations for Use
6. Barriers to Use
7. Notification Preferences
8. Community Engagement
9. Cultural Sensitivity
10. Rewards and Incentives
No sub-codes or hierarchy.
Each theme is independent.
Themes vary across iterations.
Structural Implications

Human advantage: The hierarchical structure (e.g., "Reporting and Data sharing" โ†’ "Privacy concerns," "Motivation to report," "Rewards") enables nuanced analysis and navigation. Sub-codes capture distinctions that flat LLM themes collapse. LLM limitation: Without hierarchy, LLMs produce themes like "Reporting and Data sharing" but cannot distinguish sub-dimensions without explicit prompting. Practical implication: For studies requiring detailed sub-categorization, human coding or a hybrid approach (LLM initial themes โ†’ human hierarchical refinement) is recommended.

06

Recommendations

Based on our comparative analysis of real human and LLM coding outputs, we offer the following recommendations:

  1. Use LLMs for rapid theme discovery โ€” they reliably identify 89% of semantic categories that humans code, making them excellent for initial exploration.
  2. Human coding for hierarchical depth โ€” when sub-categorization and nuanced distinctions matter, human coding remains essential.
  3. Cross-validate Technical Issues โ€” LLMs consistently under-detect operational/functional themes; ensure prompts explicitly request these.
  4. Leverage LLM "extra" themes โ€” Cultural & Language themes detected by LLMs but not humans may reveal blind spots worth investigating.
  5. Hybrid workflow recommended: LLM generates initial themes โ†’ human validates โ†’ human develops hierarchy โ†’ LLM applies codes at scale.
๐Ÿ‘ค

Human Coding Strengths

Hierarchical structure (14 themes + sub-codes), contextual nuance, technical/operational depth, segment-level precision, interpretive richness

๐Ÿค–

LLM Coding Strengths

Speed and consistency, cross-iteration validation, detecting Cultural & Language themes, scalability, cost-efficiency

โš ๏ธ

Human Coding Limitations

Time-intensive, potential inter-coder variability, may miss patterns visible across many iterations, Cultural & Language not explicitly coded

โš ๏ธ

LLM Coding Limitations

Flat structure only, under-detects Technical Issues (49%), no segment-level coding, naming inconsistency across iterations

M

Analytical Methods

Data analysis, visualization, and report generation were conducted using Claude Opus 4.5, as a peer-coding assistant with a human-in-the-loop (HITL) approach. The LLM was utilized to extract and parse theme counts from multi-model outputs, perform statistical analyses (ANOVA, t-tests, effect size calculations), map themes to semantic categories, and generate interactive visualizations.

All results were cross-checked by the research team against the original MAXQDA codebook and raw LLM outputs to ensure accuracy. The HITL approach ensured that automated analyses were validated through iterative review, with human researchers confirming theme mappings, verifying statistical interpretations, and correcting any discrepancies identified during the analytical process.

Table 8 Statistical Methods Summary
Analysis Method Purpose
Multi-model comparison (Table 1) One-way ANOVA, Kruskal-Wallis Compare mean theme counts across 3 LLMs
Pairwise comparisons (Table 2) Independent t-tests, Bonferroni correction, Cohen's d Compare mean theme counts between pairs of LLMs
Prompt type effects (Table 3) Independent t-tests, Cohen's d Compare mean theme counts between contextual vs minimal prompts within each LLM
Semantic agreement (Table 4, 6) Keyword matching, category mapping Calculate % of iterations detecting each semantic category
Structural comparison (Table 7) Descriptive comparison Compare hierarchical vs flat codebook structures