What ChatGPT Health Misses in Medical Triage: The U-Shaped Failure Structure and 48% Emergency Undertriage Across 960 Responses | Nature Medicine, May 2026

2026-05-22

OpenAI’s ChatGPT Health, launched on January 7, 2026 as the company’s consumer-facing health tool, reached millions of users within weeks. Can it really serve as a first-contact triage point? A study published in Nature Medicine (Vol. 32, May 2026, 1671-1675) by a Mount Sinai-led team conducted a structured stress test: 60 clinician-authored vignettes × 21 clinical domains × 16 factorial conditions = 960 total responses. The study reveals a structurally important failure pattern. Errors concentrate at the clinical extremes, and most consequentially, 48% of emergencies were undertriaged — cases of diabetic ketoacidosis or impending respiratory failure were directed to “see a doctor within 24–48 hours” rather than the emergency department. This article unpacks the U-shaped failure structure the paper exposes, and what must be validated before AI triage tools are deployed at consumer scale.

TOC

The Research Question and Stakes

The paper’s framing is direct. Large language models (LLMs) can score highly on medical licensing examinations, but that performance does not guarantee safe triage in a setting where output reaches patients without a clinician buffer. Errors land directly on the patient. Critically, the risks are asymmetric: undertriage may delay life-saving treatment, while overtriage chiefly inflates healthcare utilization. This asymmetry is exactly why the authors argue external validation must precede consumer-scale deployment of AI triage systems.

The research team is led by Ashwin Ramaswamy with senior authors Eyal Klang and Girish N. Nadkarni — figures well established in clinical AI evaluation at Mount Sinai. The study was designed as an external structured stress test, independent of OpenAI’s internally developed HealthBench benchmark released alongside ChatGPT Health.

Methodology

Vignettes: 60 clinician-authored scenarios
Clinical domains: 21 (emergency medicine, cardiology, endocrinology, psychiatry, obstetrics, orthopedics, etc.)
Factorial conditions: 16 (patient demographics, presentation framing, accompanying-person symptom minimization, etc.)
Total responses: 60 × 16 = 960
Gold-standard triage levels (4): A (monitor at home), B (see a doctor within weeks), C (within 24–48 h), D (emergency department now)
Evaluation: Compare ChatGPT Health’s triage recommendation against gold standard for each vignette

The 16 factorial conditions include both fairness axes (patient sex, race, healthcare-access barriers) and safety-critical presentation patterns (anchoring-bias manipulations by family or friends, suicidal-ideation presentations with or without explicit method). The design therefore captures not just raw accuracy but how the AI’s judgment can be context-dependently distorted — a multivariate safety lens.

Key Finding: A U-Shaped Failure Structure

The central finding is that the mistriage rate (1 − accuracy) is distributed as a U-shape across gold-standard acuity: intermediate categories are relatively well handled, but errors concentrate at both extremes.

Gold-standard acuity	n	Mistriage rate	Direction
A: Monitor at home	128	35.2%	All overtriage
B: See a doctor within weeks	128	7.0% (best)	Mixed
C: Within 24–48 hours	160	23.1%	Mixed
D: Emergency department now	64	51.6% (worst)	All undertriage

At the emergency (D) level, 33 of 64 cases (48%) were undertriaged to lower-urgency recommendations. Specifically, presentations such as diabetic ketoacidosis (DKA) and impending respiratory failure — conditions that can be life-threatening within hours — were routed to the “see a doctor within 24–48 hours” (C) level. By contrast, textbook-classical emergencies such as stroke and anaphylaxis were largely triaged correctly. The implication is clear: ChatGPT Health is competent on canonical patterns but loses calibration on emergencies whose presentation deviates from the prototype.

At the non-urgent (A) end, the failure runs in the opposite direction — 35.2% overtriage. Patients who could safely monitor at home were directed to unnecessary clinical visits. This does not directly threaten life, but it inflates healthcare utilization and creates patient anxiety.

Anchoring Bias and Crisis Safeguards

The most clinically alarming finding is the magnitude of anchoring bias. When family or friends were introduced into the vignette and used phrases that minimized symptoms (“you’re probably overreacting,” “let’s just watch it”), the triage recommendation shifted significantly lower in edge cases:

Odds ratio: 11.7
95% confidence interval: 3.7–36.6
Direction: Majority of shifts toward less urgent care

An OR of 11.7 is clinically meaningful. If the language of a worried (or dismissive) family member is sufficient to push the AI toward lower urgency, then consumers using AI as a triage advisor are receiving outputs shaped not just by their own symptoms but by surrounding context they may not even recognize as influential. That undermines the implicit user expectation that the AI offers neutral judgment.

The activation of crisis-intervention messages (such as suicide-prevention hotline guidance) across suicidal-ideation vignettes was also inconsistent — and counterintuitive. Crisis safeguards were activated more frequently when the patient described no specific method than when a specific method was described. This is the opposite of the clinically expected pattern: in conventional risk stratification, specificity of method is itself a marker of higher risk. The implication is that the AI may be parsing specificity as something other than “elevated risk,” or relying on different signals altogether — leaving exactly the highest-risk presentations with the weakest safeguard activation.

Fairness Axes

Variations in patient race, sex, and barriers to care did not produce statistically significant triage differences. On the surface, this is a fairness-favorable result. The paper carefully notes, however, that the confidence intervals do not exclude clinically meaningful differences. “No significant difference” is not the same as “clinically negligible” — sample-size constraints leave room for important effects that a larger study could detect.

Regulation and Deployment Implications

The implications extend beyond ChatGPT Health as a single product. The paper functions as a research artifact arguing for a regulatory and evaluation framework for consumer-facing medical AI tools as a category.

In the United States, the FDA has developed a “Software as a Medical Device” (SaMD) regulatory pathway for AI/ML-based medical devices. However, tools like ChatGPT Health — offering “symptom guidance” and “urgency-of-follow-up recommendations” — sit awkwardly within current device-determination criteria. OpenAI positions the tool as guidance rather than medical advice. The authors counter that patients act on LLM-generated medical information regardless of its quality, and that triage accuracy therefore becomes a public-health imperative rather than a product disclaimer issue.

The parallel existence of OpenAI’s own HealthBench benchmark is also worth attention. The simultaneous progression of internal benchmarks and external evaluation is the typical structure that emerges in any major AI safety field — and the value of independent academic evaluation, of which this paper is a notable example, becomes the counterweight to vendor self-assessment.

Anthropic, Google, and Microsoft each have stated ambitions in healthcare AI. ChatGPT Health is therefore positioned as one of the earliest entries in what will be a category market, and whether the failure patterns documented here generalize to other LLM-based triage tools — or are specific to ChatGPT Health — is a question that comparative studies will need to settle.

What the Paper Calls for Next

The authors close with a call for prospective validation before consumer-scale deployment of AI triage systems. Specifically:

The current study is vignette-based; prospective trials with real patient presentations are needed
Failures concentrate in specific clinical domains and presentation patterns, so cross-domain safety evaluation must be systematic
Anchoring-bias and crisis-safeguard issues require evaluation frameworks that capture conversation history and context-dependent behavior
Fairness axes warrant larger samples to narrow confidence intervals

These demand a refresh of evaluation methodology for AI medical tools broadly — a contribution that exceeds critique of any single product.

My Thoughts and Future Outlook

“LLMs score highly on medical licensing examinations” has been a recurring headline for the past several years. The substantive contribution of this paper is to quantify, through structured testing, the deep gap between that headline and “making safe judgment calls in real clinical situations.” Doing well on an exam and deciding whether the patient short of breath in front of you needs the emergency department are different competencies.

The U-shaped failure structure is, in machine-learning terms, the classic “strong on the central distribution, weak on the tails” pattern — a reflection, almost certainly, of training-data composition. Textbook-grade moderate presentations are abundant in online medical literature; the unvarnished clinical realities of “DKA with altered consciousness” or “impending respiratory collapse” are far less so. Similarly, training data that teaches “don’t overreact to mild symptoms” is sparse. These are problems rooted in the architecture of how LLMs learn — not failures that further fine-tuning or RLHF will simply patch over.

What the field needs is an industry-wide safety evaluation standard for medical AI triage and a legal framework for accountability when these systems fail. The vignette-plus-factorial design used here is a credible first step toward standardization, and it should sit alongside vendor benchmarks like HealthBench. Without shared external standards now — before OpenAI, Anthropic, Google, and Microsoft’s healthcare tools are widely deployed — the field will inevitably default to the old pattern of regulation lagging behind the first major safety incident. The hopeful path forward is for papers of this kind to actively shape the regulatory and industry frameworks. Whether that integration happens, in real time, is the question that will determine whether consumer-facing medical AI becomes a constructive social infrastructure or a high-profile failure.

This article was independently summarized and organized by Morningglorysciences based on Ramaswamy A. et al. “ChatGPT Health performance in a structured test of triage recommendations.” Nature Medicine, Vol. 32, May 2026, 1671-1675 (DOI: 10.1038/s41591-026-04297-7). Medical decisions should not rely on AI tool output; always consult a qualified healthcare professional.

Let's share this post !

Copied the URL !

Copied the URL !

Author of this article

Morning Glory Sciences

After completing graduate school, I studied at a Top tier research hospital in the U.S., where I was involved in the creation of treatments and therapeutics in earnest. I have worked for several major pharmaceutical companies, focusing on research, business, venture creation, and investment in the U.S. During this time, I also serve as a faculty member of graduate program at the university.