Large language model (LLM) chatbots such as ChatGPT are increasingly used by the general public, including for advice and support. Their conversational style may lead users to attribute understanding and empathy, raising concerns in vulnerable populations such as individuals with psychosis. Prior reports suggest that chatbot interactions may reinforce delusional or abnormal beliefs by accepting inaccurate premises or mirroring user input. This study aimed to evaluate whether ChatGPT produces clinically appropriate responses when presented with psychotic content.
This cross-sectional study evaluated three ChatGPT versions: GPT-5 Auto, GPT-4o, and the free version. A total of 79 psychotic prompts were developed based on five symptom domains from the Structured Interview for Psychosis–Risk Syndromes: unusual thought content, suspiciousness, grandiosity, perceptual disturbances, and disorganized communication. Each psychotic prompt had a matched control prompt without psychotic content. The prompts were submitted once to each chatbot version, generating 474 prompt-response pairs. Two blinded clinicians independently rated the appropriateness of the response using a 3-point scale: 0 (completely appropriate), 1 (somewhat appropriate), and 2 (completely inappropriate). A subset underwent secondary rating. Statistical analysis used proportional odds regression to compare appropriateness between psychotic and control prompts across and within model versions.
Across all models, responses to psychotic prompts were significantly less appropriate than responses to control prompts. For the free version, psychotic prompts had 25.84-fold higher odds of receiving a less appropriate rating compared with controls (95% CI, 12.45–53.66; P < .001). Performance improved modestly with newer models, but risk remained substantial. GPT-5 Auto showed reduced but still elevated odds of inappropriate responses (OR ˜ 8.5), while GPT-4o also demonstrated significantly increased risk. Within-model analyses showed that all versions produced significantly less appropriate responses to psychotic prompts, with the highest risk observed in the free version (OR 43.37), followed by GPT-4o (OR 14.15) and GPT-5 Auto (OR 9.08). Control prompts were predominantly rated as completely appropriate, whereas psychotic prompts generated a higher proportion of partially or completely inappropriate responses across all models.
All evaluated ChatGPT versions frequently generated inappropriate or only partially appropriate responses to psychotic prompts, with significantly worse performance compared with non-psychotic content. Although newer models demonstrated some improvement, the risk of problematic responses remained clinically concerning. These findings highlight potential safety risks for individuals with psychosis interacting with LLM chatbots and underscore the need for clinician awareness, further research on conversational reinforcement effects, and consideration of regulatory oversight.