
Autonomous AI Agent Matches and Exceeds Physicians Across Simulated Electronic Health Record Cases
Key Takeaways:
- MIRA is an autonomous AI agent that diagnoses and plans treatment inside a simulated electronic health record, rather than acting as a narrow chat tool.
- It reached 88.9% diagnostic accuracy across 574 cases, outperforming board-certified physicians (78.1%) and a mixed-seniority team (71.1%).
- Safety results were strong but preliminary, and the authors stress that MIRA is not a replacement for human clinicians.
A new kind of medical AI agent
A recent study published in the journal Nature introduced MIRA, an autonomous AI agent designed to operate within sandboxed EHR environments. Rather than acting as a single-purpose assistant, MIRA uses a suite of digital tools to simulate the full arc of a clinical workflow. It can order tests, synthesise the results, and produce diagnoses and treatment plans, all while communicating through a chat interface with a patient AI agent that is grounded in the documented history of present illness extracted from retrospective notes from genuine cases.
The system runs on a Fast Healthcare Interoperability Resources (FHIR) based architecture, which executes the agent’s tool calls and records its medical outputs. The researchers note that the example data presented in the paper were shortened and slightly modified to comply with the privacy restrictions attached to the dataset.
Unlike earlier implementations, which were predominantly task-specific chat applications, MIRA was built to independently take in patient histories, order the relevant diagnostic tests, and then use those datasets to reach diagnoses and treatment plans within a controlled simulation. Across the 574 MIMIC-IV cases, MIRA achieved 88.9% diagnostic accuracy, and in a matched 311-case physician comparison it reached 87.8% accuracy, significantly outperforming experienced human physicians under identical simulated conditions while demonstrating strong, though not perfect, safety and guideline performance.
Background: from passing exams to working a ward
Large language models (LLMs) have already proven highly capable at passing standardised medical examinations and answering complex clinical questions. Reviews of the field show, however, that translating this raw clinical knowledge into the operational workflow of a hospital has remained a major challenge.
This gap is attributed to the architectural design of traditional medical AI tools, which behave as narrow, task-specific search or text-generation utilities rather than as active partners in care. By contrast, true clinical decision-making is characterised as an intricate, multi-step process in which doctors repeatedly interview the people in their care, order blood tests or imaging, synthesise conflicting results, and update their hypotheses before arriving at a final treatment plan.
Nearly all of this clinical work takes place within EHR systems that rely on complex, standardised coding protocols. Until now, it remained unproven whether an automated system could reliably handle this end-to-end clinical action space in a realistic, EHR-style environment without committing unacceptable errors.
About the study
The study set out to address this functional gap by developing MIRA, a novel AI tool designed to autonomously ingest and access medical records, identify knowledge gaps, and order diagnostic tests to supplement the EHR record, before using the completed dataset to recommend clinical interventions.
The researchers then tested MIRA’s capabilities in a sandboxed, virtual EHR environment compliant with standard healthcare protocols, including HL7 FHIR. The sandboxed test was conducted on a curated benchmarking dataset of 574 real-world emergency department cases from the Medical Information Mart for Intensive Care (MIMIC-IV) database.
The cases included spanned eight distinct diagnoses across surgery (appendicitis), internal medicine (pneumonia), and oncology (pancreatic cancer), which MIRA navigated using 11 specialised digital tools offering more than 85,000 operational choices. The agent was permitted to request physical examinations, order targeted laboratory values, look up medical histories, and generate medication orders within the simulated EHR, rather than in live patient care.
How MIRA was compared with clinicians
MIRA’s output was compared against two distinct groups of human physicians managing exactly the same cases under identical conditions. The first group was a cohort of four board-certified physicians. The second was a mixed-seniority team consisting of four residents and two board-certified doctors.
A separate, conventional text-based AI agent was used to simulate the people under MIRA’s care, and under the care of the human physician teams. This agent was instructed to respond to questions posed by MIRA or its human counterparts solely on the basis of authentic clinical histories, while resisting adversarial attempts to trick it into prematurely leaking information. The authors noted, however, that simulated patient speech may be more structured than real emergency department conversations.
Study findings
The results revealed that MIRA performed at or above the level of experienced human doctors. It achieved 88.9% diagnostic accuracy across the full 574-case dataset and 87.8% accuracy in the matched 311-case physician comparison. By comparison, the board-certified physicians reached an average accuracy of 78.1% (p < 0.001), while the mixed-seniority medical cohort averaged 71.1% (p < 0.001).
MIRA was found to excel at identifying appendicitis and pancreatitis, achieving a perfect 100% recall for laparoscopic appendectomies. For pancreatic cancer, its diagnostic performance was equivalent to that of the board-certified physicians, while pneumonia and urinary tract infections remained more challenging.
Accuracy without simply “ordering everything”
Notably, MIRA did not achieve its superior accuracy by simply “ordering everything”. While it was observed to request a broader, more comprehensive set of individual blood parameters than the human doctors, its overall test selection remained well below the historical baselines recorded in the dataset.
The findings further demonstrated that the model successfully avoided the systematic over-ordering of high-cost radiological imaging, matching or exceeding physicians on overall resource-alignment metrics.
Safety performance
The safety evaluations were similarly encouraging, though still preliminary. An independent, blinded medical review of 56 patient-level outputs, together with a separate assessment of 468 prescriptions written by MIRA, established that the agent caused zero high-severity drug–drug interactions, zero renal dosing incompatibilities, and zero medication-allergy mismatches. Route specification was the weakest prescription field, at 97% correctness.
When making critical hospital admission decisions for pneumonia and pulmonary embolism, MIRA achieved a perfect recall score of 1.00, indicating that it never missed a single person who required inpatient care. The pulmonary embolism analysis did, however, suggest a tendency towards over-admission, reflecting a cautious disposition strategy.
Conclusions
The study introduces an integrated EHR AI agent, MIRA, that successfully translates clinical intents into structured, safe, and accurate operations, with the potential to support physicians in their work. The authors are careful to caution, however, that MIRA and similar AI agents are not replacements for expert human staff.
The model did not reach 100% perfection across all treatment choices, such as specific antibiotic selections, which highlights the ongoing need for strict human supervision and patient-level safeguards. Future iterations of the model may improve their performance by incorporating evidence from retrieval-based support, stronger governance, and prospective real-world validation before any clinical deployment.
Read More