Abstract
This study evaluates the consistency and reliability of using large language models (LLMs) for inductive thematic analysis in qualitative research. Using focus group data from the Kidenga mobile health application evaluation in Puerto Rico, we conducted 200 independent iterations of codebook generation using GPT-4o across two prompt conditions (contextual and minimal) and three iteration batch sizes (20, 30, and 50). Results demonstrate strong thematic convergence across iterations, with the model consistently identifying 8โ9 core theme clusters. Statistical analysis revealed a significant difference between prompt types in theme generation (t(198) = 4.25, p < .001, d = 0.60), with contextual prompts producing more themes per iteration (M = 10.72) than minimal prompts (M = 9.81). No significant differences were found across iteration batch sizes (F(2, 197) = 0.16, p = .85), suggesting that 20 iterations may be sufficient for stable results.
Study Overview
The application of large language models to qualitative research methods represents an emerging area of methodological inquiry. This study addresses a fundamental methodological question: when an LLM is asked to generate a codebook from the same qualitative dataset multiple times, how consistent are the resulting themes?
Principal Findings
Strong Thematic Convergence
Despite 200 independent runs, the model consistently identified 8โ9 core theme clusters.
Prompt Type Matters
Contextual prompts produced significantly more themes (M = 10.7 vs 9.8, p < .001, d = 0.60).
20 Iterations Suffice
No significant differences across batch sizes (p = .85). Fewer iterations reduce costs.
Theme Count by Prompt Type
| Prompt Type | M | SD | Range | 95% CI |
|---|---|---|---|---|
| Contextual (n = 100) | 10.72 | 1.65 | 7โ15 | [10.39, 11.05] |
| Minimal (n = 100) | 9.81 | 1.36 | 7โ15 | [9.54, 10.08] |
The significant difference between prompt types (t(198) = 4.25, p < .001, d = 0.60) suggests that providing study context enables the model to identify more nuanced themes. However, more themes are not necessarily better; the additional themes may represent over-splitting of constructs.
Theme Distribution contextual prompt
Theme Distribution minimal prompt
Thematic Convergence Analysis
Consolidated Theme Frequency n = 200 iterations
The semantic clustering reveals a clear hierarchy of theme salience. Four core clusters appeared in the majority of iterations: Information & Education (100.5%), Usability & Accessibility (89%), Notifications & Alerts (79%), and Community Engagement (79%). Trust & Privacy and Cultural & Language themes were identified in only 12โ13% of iterations, suggesting they may require more explicit prompting.