Key Takeaways
- At the San Antonio Breast Cancer Symposium (SABCS 2025), Joseph Sparano and colleagues at Mount Sinai presented a multimodal AI model for recurrence prediction that outperformed the current standard 21-gene Oncotype DX test. AACR Cancer Discovery News (DOI: 10.1158/2159-8290.CD-NW2025-0116).
- The multimodal AI model (ICM+: integrated clinical, imaging, and expanded molecular data) achieved a C-index of 0.733 for 15-year distant recurrence in HR+/HER2- breast cancer; Oncotype DX scored 0.631. The gap was even larger for late recurrence beyond 5 years: 0.705 vs 0.527.
- The data come from a re-analysis of TAILORx, the prospective phase III trial that enrolled 10,273 women. Oncotype DX’s known weakness — predicting late recurrence — appears to be addressable by combining imaging and molecular data.
- This is the final volume of “Reading Breast Cancer Diagnosis with AI.” Following detection (Vol. 1) and workflow (Vol. 2), we examine what it means for AI to step into the heart of treatment decision-making — recurrence risk assessment — including TAILORx’s clinical context and the limitations that still apply.
Introduction — How a 21-Gene Test Helped Doctors “Withhold” Chemotherapy
Volumes 1 and 2 of this series tracked AI’s contributions to detection: MASAI sharpening screening, AITIC reshaping workflow with a 64% workload shift. By now, it should be clear that AI in the “viewing the image” stage is steadily maturing.
Volume 3 reframes the lens. The theme is “what to do once the cancer is found.” In breast cancer care, this is AI’s hardest territory. Treatment decisions, when wrong, are not easily reversed; they shape patients’ lives directly.
For HR+ (hormone receptor-positive)/HER2- breast cancer — the most common subtype — the standard post-surgical approach is endocrine therapy (tamoxifen, aromatase inhibitors) for 5 to 10 years. The challenge has been that the population breaks into two camps: women who do well on hormone therapy alone, and women who benefit from added chemotherapy. Chemotherapy carries heavy side effects and meaningfully harms quality of life. The clinical aspiration: deliver chemo only to those who truly need it, and spare the rest.
The most consequential tool for drawing that line has been the Oncotype DX 21-gene test (Genomic Health, now Exact Sciences). It assigns a recurrence risk score from a tumor’s expression of 21 genes and predicts the marginal benefit of adding chemotherapy. Its clinical utility was decisively established by the 2018 publication of TAILORx in The New England Journal of Medicine. With 10,273 enrolled patients, this phase III trial demonstrated that intermediate-risk women on hormone therapy alone fared no worse than those on hormone plus chemo. It became the global standard for “who can safely skip chemotherapy.”
Oncotype DX has known limits, however. Its predictive power deteriorates beyond 5 years. HR+ breast cancer can recur at 10, 15, or even 20 years post-diagnosis, and late recurrence has been a long-running clinical pain point.
In December 2025, Joseph Sparano — the very person who led TAILORx — presented at SABCS the approach that aimed to break that ceiling. A multimodal AI model integrating clinical (C), imaging (I), and molecular (M) data showed accuracy that exceeded Oncotype DX.
This piece walks through how the work was built — from the mechanics of Oncotype DX, to the construction of the multimodal model, to the hurdles facing clinical adoption.
Main
1. HR+/HER2- Breast Cancer and Recurrence Risk — Why It Matters
First, the basics of HR+/HER2- breast cancer.
Breast cancer divides into four major subtypes by HR status (positive vs. negative) and HER2 status (positive vs. negative). HR+/HER2- is the largest, accounting for 60-70% of all breast cancers. The cancer’s growth is driven by hormones such as estrogen and progesterone, which is why endocrine therapy — drugs that block hormonal signaling — is the backbone of treatment.
After surgery and radiotherapy, endocrine therapy is continued for 5 to 10 years to prevent recurrence. The decision point is whether to add chemotherapy on top.
Adding chemotherapy further reduces recurrence risk. But the side effect burden is heavy: hair loss, nausea, bone marrow suppression, longer-term cognitive and cardiotoxicity risks. Quality of life takes a substantial hit; work, parenting, and daily routines are all affected.
Distinguishing “needs hormone therapy alone” from “also needs chemo” with as much accuracy as possible has been a central biomarker question in breast oncology for two decades.
2. Oncotype DX — How 21 Genes Become a Risk Score
Oncotype DX measures expression levels of 21 genes in the tumor and converts them to a quantitative score (0-100). The 21 genes span proliferation, estrogen-related, HER2-related, invasion, and reference (housekeeping) categories.
The score predicts both 10-year distant recurrence risk and the marginal benefit of adding chemotherapy. The standard interpretation:
- Low risk (score <11 or 18): hormone therapy alone
- Intermediate (11-25): individualized decision (age, lymph node status, patient preference)
- High risk (>25 or 26): add chemotherapy
TAILORx (NEJM 2018;379:111-21) provided the noninferiority evidence for the intermediate group with hormone therapy alone, settling a long-standing clinical question.
Oncotype DX is now embedded in clinical guidelines worldwide. In the U.S., over 100,000 tests are run annually. Japan’s national insurance reimburses the test, and it is widely used to guide chemo decisions in early-stage HR+/HER2- breast cancer.
3. Where Oncotype DX Falls Short — Late Recurrence
Despite its central role, Oncotype DX has long-acknowledged weaknesses.
- Late recurrence (beyond 10 years) is hard to predict. HR+ breast cancer can recur at 10, 15, or 20 years. Oncotype DX is strong at 10-year prediction but weakens beyond that.
- It does not capture intratumoral heterogeneity. The test relies on bulk genomics. Clonal diversity within the tumor, stroma, and microenvironment effects are invisible to it.
- It tilts toward proliferation genes. The score’s construction over-weights proliferation indicators (including Ki-67-related signals), which limits its ability to capture dormancy and re-emergence biology.
This is exactly the limit Sparano set out to break at SABCS 2025: “Could we do better long-term prognostication than the 21-gene score that was used in TAILORx?”
4. Multimodal AI — What Was Combined
Sparano’s team built a multimodal AI model integrating three data streams.
- C (Clinical): age, lymph node status, tumor size, grade, hormone receptor and HER2 status, and so on.
- I (Imaging): tumor histopathology images (whole-slide hematoxylin-eosin stained images).
- M (Molecular): the 21 Oncotype DX genes plus an expanded set from whole-transcriptome sequencing (M+).
Of the 10,273 TAILORx participants, 4,462 had both tumor tissue and whole-transcriptome sequencing data. 63% (2,810) were used for training, 37% (1,652) for validation.
Multiple model configurations were tested: single-modality (C alone, I alone, M alone) and multi-modality (CIM, ICM+, etc.). The top performer was the ICM+ model, which integrates all three streams plus the expanded molecular data and learns interactions between them.
Sparano’s framing at the press briefing: ICM+ “not only weights how much each source contributes, but also captures interactions” — for example, how a particular gene signature changes the interpretation of a particular tissue pattern.
5. Results — Head-to-Head with Oncotype DX
Performance was measured by C-index (concordance index): how well a model ranks individuals by their actual time to event (here, distant recurrence). 1.0 is perfect, 0.5 is random, 0.7+ is generally considered clinically useful.
| Prediction target | ICM+ model | Oncotype DX | Difference |
|---|---|---|---|
| 15-year distant recurrence (overall) | 0.733 | 0.631 | +0.102 |
| Late recurrence (after 5 years) | 0.705 | 0.527 | +0.178 |
A 15-year C-index of 0.733 sits firmly in clinically useful territory. Oncotype DX’s 0.631 is not bad in itself, but ICM+ moved meaningfully past it.
The standout finding is late recurrence. Oncotype DX’s 0.527 is barely above chance. In other words, Oncotype DX is essentially uninformative for predicting recurrences beyond 5 years. ICM+’s 0.705 is solidly useful.
HR+/HER2- late recurrence — when patients have started to feel “in the clear” — is the hardest pattern to manage clinically. It also drives the decision about whether to extend endocrine therapy from 5 to 10 years. Improving this prediction translates directly into patient benefit.
6. Why Imaging Mattered — Different Data Excel at Different Time Horizons
One of the most interesting findings is that different data streams shine for different time windows.
- Molecular (M): strong for early recurrence (within 5 years). It directly reflects tumor cell proliferation and activity.
- Imaging (I): strong for late recurrence (beyond 5 years). It captures the relationship between tumor and surrounding tissue (stroma, tumor microenvironment).
Savitri Krishnamurthy, breast surgical pathologist at MD Anderson Cancer Center, framed it in the AACR piece: “Molecular features of the invasive tumor alone are not sufficient to capture the influence of tumor stroma and the tumor microenvironment in predicting tumor biology.” She added that Oncotype DX “relies on alterations of proliferation-related genes” but “does not take into consideration tumor heterogeneity because it uses bulk genomics.”
To predict late recurrence, in other words, you need to read the relationship between tumor cells and the tissue surrounding them — immune infiltration, fibroblast state, vasculature, architectural patterns. Histopathology images carry that signal. That is the mechanistic interpretation of why ICM+ won so decisively for late recurrence.
7. Hurdles to Clinical Adoption — The Prospective Validation Gap
The results are striking, but several hurdles remain. Krishnamurthy: “These kinds of models, including those in Sparano’s study, are typically evaluated using retrospective data, while prospective clinical studies are lacking.”
Retrospective evaluation re-uses past TAILORx outcomes to ask “if AI had been used, how accurate would it have been?” That is powerful but carries embedded biases from prior treatment choices and patient selection. True clinical impact requires prospective validation.
A paradoxical second issue is model proliferation: “as more AI models are developed and show benefit, clinicians will have a more difficult time deciding which ones to implement.” This is a universal problem in medical AI. Cross-system comparison, standardization, and regulatory frameworks have not kept pace.
8. Data Scale and Quality — The Real Driver of AI Performance
What makes Sparano’s work possible is its grounding in a large, high-quality, prospective dataset. TAILORx — 10,273 women followed for over a decade with linked tumor tissue and transcriptome data — is a globally rare resource.
Krishnamurthy emphasizes: “The availability of high-quality, annotated, large datasets opens the doors to building robust AI models that are bound to bring revolutionary changes in medicine.”
The flip side: AI performance is dominated less by “algorithmic genius” and more by data quality and scale. The future of clinical AI rests as much on long-term prospective cohorts, data standardization, and patient-consent frameworks as on advances in computation.
9. Treatment Decisions — How Far Should AI Reach?
Once multimodal AI improves recurrence prediction, it inevitably touches the chemotherapy add-on decision. Oncotype DX already plays this role, but a shift to ICM+-class models raises new questions.
- Explainability. When AI flags “high recurrence risk,” how do we explain to physicians and patients what drove that call? Bulk 21-gene scores can be explained gene by gene; deep multimodal models risk becoming black boxes.
- Decision sovereignty. The chemo decision is ultimately the patient’s. If an AI score is presented as a “strong recommendation,” patient autonomy can be eroded in practice.
- Regulation and reimbursement. FDA and PMDA approval, evidence of clinical utility for reimbursement, cost-effectiveness — all need to be in place.
- Health equity. Demographic skew in training data risks degraded AI performance for under-represented populations. Equity is non-negotiable here.
10. Global Adoption Pathways — with Notes on Japan and Asia
Globally, Oncotype DX is the established standard for guiding chemotherapy decisions in early-stage HR+/HER2- breast cancer. It is reimbursed across most developed markets — the U.S., UK, Germany, France, and Japan among them — with annual test volumes well into the hundreds of thousands worldwide. A successor like a multimodal AI model would need to clear several thresholds simultaneously.
- Prospective evidence of clinical utility. Retrospective re-analysis of TAILORx data is necessary but not sufficient. Regulators in the U.S., EU, and Asia will all require prospective trials demonstrating that decisions guided by the new model produce equal or better outcomes than current standards.
- Reimbursement and HTA frameworks. CMS, NICE, G-BA in Germany, and equivalent bodies in Asia each have their own cost-effectiveness thresholds. Multimodal AI’s higher infrastructure cost (whole-transcriptome sequencing, image storage, model deployment) means HTA economics will be a significant gating factor.
- Molecular infrastructure. Whole-transcriptome sequencing is not yet routine in most healthcare systems. Cost structure and laboratory capacity will need to evolve in parallel — a global supply-side challenge, not a single-country one.
- Workforce development. Pathologists and medical oncologists capable of integrating multimodal AI outputs into care decisions need to be trained and credentialed. Professional societies (ASCO, ESMO, JSCO and others) will need to update guidelines.
- Health equity across populations. Training data dominated by European or North American cohorts risks degraded performance in other populations. Validation in geographically and demographically diverse cohorts is non-negotiable.
Notes on Japan and Asia. Japan reimburses Oncotype DX nationally and has built up domestic infrastructure including JCOG (Japan Clinical Oncology Group) cohorts and tissue banks at major cancer centers — a foundation that could support the prospective validation work needed for a multimodal AI successor. Korea, Taiwan, and Singapore have similarly invested in clinical genomics platforms and could move quickly once an internationally validated model emerges. Whole-transcriptome sequencing remains outside routine reimbursed care across most of Asia, so any regional rollout will need to address infrastructure gaps in parallel with regulatory approvals.
Conclusion
- The SABCS 2025 multimodal AI model (ICM+) outperformed Oncotype DX for 15-year distant recurrence in HR+/HER2- breast cancer (C-index 0.733 vs 0.631).
- The standout result was late recurrence prediction (0.705 vs 0.527) — a domain Oncotype DX could not meaningfully address.
- Molecular data drove early-recurrence prediction; imaging drove late-recurrence prediction. The value of histopathology-derived microenvironment signals was reaffirmed.
- Clinical adoption faces multilayer hurdles: prospective validation, explainability, regulation, health equity.
- The deepest determinant of AI performance is data quality and scale, not algorithmic novelty. Building long-term cohort infrastructure is the next competitive frontier.
My Perspective & Outlook
Multimodal AI may move clinical decision support in breast cancer up a meaningful step globally. What I find most striking is that AI can indirectly read the tumor microenvironment via histopathology images — territory Oncotype DX could not reach. This aligns with the central oncology insight of the past decade: knowing the tumor cells alone does not predict patient fate. For a model like ICM+ to translate from a SABCS presentation into routine practice anywhere in the world, the same set of pieces must come together — prospective validation in diverse cohorts, regulatory clearance, reimbursement frameworks, and an explainability layer that physicians and patients can actually use at the bedside. Equally critical is preserving patient autonomy when AI scores are presented as “strong recommendations”; this is medical ethics, not technology, and the answer cannot be outsourced to algorithm vendors. Across all three volumes of this series, the pattern is consistent: AI is seeping into every stage of breast cancer care, from screening through treatment decisions. Whether we call this “the democratization of medicine” or “algorithmic dependency” depends on the choices each healthcare system now makes — about data, about consent, about who bears responsibility when prediction goes wrong. For readers tracking adoption in Japan and Asia specifically, the JCOG-class long-term cohorts and tissue banks already in place could be a meaningful asset for prospective validation, and the region’s pace of clinical genomics adoption may well determine how quickly multimodal AI moves from exception to standard of care for HR+/HER2- patients in this part of the world.
Series Conclusion
Across the three volumes of “Reading Breast Cancer Diagnosis with AI,” we have followed the impact AI is having on breast cancer care through three operational stages: detection (MASAI), workflow (AITIC), and recurrence prediction (multimodal AI). The shared message is that AI works not as a “physician replacement” but as infrastructure that augments human judgment and lifts an organization’s overall reading capacity. The technology is maturing. The next frontier is operational design, regulation, data governance, and patient participation — the redesign of healthcare institutions themselves. I hope this series provides a useful starting point for that conversation.
Edited by the Morningglorysciences team.

Comments