Multi-Model Comparison Abstract
This comparative analysis evaluates three leading large language models for inductive thematic analysis: GPT-4o Claude Sonnet 4 and Grok 3. Using identical focus group data and prompts, we conducted 200 iterations per model (600 total) to assess consistency, theme generation patterns, and cross-model agreement. Results reveal highly significant differences in theme counts across models (F(2, 597) = 89.47, p < .001, ฮทยฒ = .23), with Claude producing the most themes (M = 12.77), followed by Grok (M = 11.33) and GPT-4o (M = 10.03). Despite quantitative differences, semantic analysis revealed 98% convergence on core themes across all models, supporting the validity of LLM-assisted qualitative analysis.
Multi-Model Study Overview
Cross-Model Theme Generation
| Model | M | SD | Range | 95% CI | CV% |
|---|---|---|---|---|---|
| Claude Sonnet 4 | 12.77 | 2.07 | 7โ20 | [12.48, 13.05] | 16.2% |
| Grok 3 | 11.33 | 1.52 | 10โ16 | [11.12, 11.54] | 12.3% |
| GPT-4o | 10.03 | 1.55 | 7โ14 | [9.81, 10.24] | 15.8% |
CV% = Coefficient of Variation (lower = more consistent)
Mean Theme Counts by Model with 95% CI
The one-way ANOVA revealed highly significant differences across models, F(2, 597) = 89.47, p < .001, ฮทยฒ = .23 (large effect). Claude produced significantly more themes than both Grok (d = 0.79, p < .001) and GPT-4o (d = 1.50, p < .001). Grok showed the highest consistency (lowest CV = 12.3%), suggesting more deterministic outputs. Claude's higher theme count may reflect more granular differentiation of concepts, which could be advantageous for exploratory analysis but may require more consolidation effort.
Pairwise Statistical Comparisons
| Comparison | Mean Diff | t | p | Cohen's d | Interpretation |
|---|---|---|---|---|---|
| Claude vs GPT-4o | 2.74 | 14.96 | <.001*** | 1.50 | Very large |
| Claude vs Grok | 1.44 | 7.91 | <.001*** | 0.79 | Large |
| Grok vs GPT-4o | 1.30 | 8.46 | <.001*** | 0.85 | Large |
*** p < .001. All comparisons survive Bonferroni correction (ฮฑ = .017).
Prompt Type Effects by Model
We examined whether the effect of prompt type (contextual vs. minimal) varied across models. This analysis reveals important differences in how models respond to contextual information.
| Model | Contextual M (SD) | Minimal M (SD) | t(198) | p | d |
|---|---|---|---|---|---|
| GPT-4o | 10.48 (1.63) | 9.57 (1.32) | 4.34 | <.001*** | 0.61 |
| Claude | 13.34 (2.05) | 12.19 (1.91) | 4.11 | <.001*** | 0.58 |
| Grok 3 | 11.47 (1.59) | 11.19 (1.44) | 1.31 | .193 | 0.18 |
*** p < .001
Prompt Type Effect by Model contextual vs minimal
A striking interaction emerged: GPT-4o and Claude showed significant prompt type effects (both p < .001, medium effect sizes), while Grok showed no significant difference (p = .193, d = 0.18). This suggests Grok's thematic analysis is less influenced by contextual priming, producing similar outputs regardless of prompt specificity. For researchers seeking consistency across prompt variations, Grok may be preferable; for those wanting models that leverage contextual information, GPT-4o or Claude may be better choices.
Raw Theme Labels by Prompt Type
Before semantic clustering, we examined the raw theme labels produced by each model under contextual and minimal prompt conditions. This analysis reveals both the consistency of core themes and the variation in labeling that necessitates human synthesis.
Top 10 Theme Labels
GPT-4o contextual prompt
GPT-4o minimal prompt
Top 10 Theme Labels
Claude contextual prompt
Claude minimal prompt
Top 10 Theme Labels
Grok contextual prompt
Grok minimal prompt
Naming consistency: "Notification Preferences" emerged as the top or near-top theme across all models and prompt conditions, demonstrating strong cross-model convergence. Model-specific patterns: GPT-4o emphasizes "Community Engagement" and "Information Clarity"; Claude uniquely surfaces "Navigation Difficulties" and "Technology Barriers"; Grok consistently highlights "Usability Challenges" and "Accessibility Needs." Prompt effects: The minimal prompt tends to produce more generic labels (e.g., "Accessibility and Usability") while contextual prompts yield more specific themes (e.g., "Information Accessibility," "Engagement and Motivation").
Cross-Model Semantic Agreement
Despite differences in theme counts and labeling, we assessed whether models converged on the same underlying constructs through semantic clustering analysis. Each theme was categorized into predefined semantic clusters, and we calculated the percentage of iterations detecting each cluster.
Theme Cluster Detection by Model % of iterations containing theme
| Semantic Cluster | GPT-4o | Claude | Grok | Mean Agreement |
|---|---|---|---|---|
| Usability & Accessibility | 99.5% | 100% | 100% | 99.8% |
| Visual Design | 100% | 100% | 100% | 100% |
| Information & Education | 100% | 100% | 100% | 100% |
| Community Engagement | 100% | 100% | 100% | 100% |
| Notifications & Alerts | 91.0% | 97.0% | 100% | 96.0% |
| Motivation & Incentives | 98.5% | 96.0% | 99.5% | 98.0% |
| Trust & Privacy | 59.5% | 98.0% | 97.0% | 84.8% |
| Cultural & Language | 66.0% | 75.5% | 84.0% | 75.2% |
| Technical Issues | 55.5% | 43.5% | 48.5% | 49.2% |
All three models achieved near-perfect agreement (98-100%) on the five dominant theme clusters: Usability, Visual Design, Information, Community, and Notifications. This convergence across architecturally distinct models provides strong validity evidence that these themes genuinely reflect patterns in the data. Notable divergence appeared for Trust & Privacy, where GPT-4o detected this theme in only 59.5% of iterations compared to 97-98% for Claude and Grok. Technical Issues showed the lowest overall detection (49.2%), suggesting this may be a less prominent theme in the raw data or one that requires explicit prompting to surface.
Theme Distribution Comparison
Theme Count Distribution all models
Coefficient of Variation lower = more consistent
Key Findings Summary
Claude Produces Most Themes
M = 12.77 themes per iteration, significantly higher than Grok (11.33) and GPT-4o (10.03).
Grok Most Consistent
Lowest CV (12.3%) and no significant prompt type effect, suggesting more deterministic outputs.
98% Core Agreement
All models converge on dominant themes (Usability, Design, Information, Community).
Minority Theme Divergence
GPT-4o under-detects Trust & Privacy (59.5% vs 97-98% for others).