Open-Source AI Matches Top Proprietary Model in Solving Tough Medical Cases

Credit: Ignatiev/Getty Images

A new study shows that an open-source AI model performs on par with the leading closed-source tool in its ability to solve tough medical cases. The shift portends greater competition that should benefit for patients and clinicians.

Newswise — Artificial intelligence can transform medicine in a myriad of ways, including its promise to act as a trusted diagnostic aide to busy clinicians.

Over the past two years, proprietary AI models, also known as closed-source models, have excelled at solving hard-to-crack medical cases that require complex clinical reasoning. Notably, these closed-source AI models have outperformed open-source ones, so-called because their source code is publicly available and can be tweaked and modified by anyone.

Has open-source AI caught up?

The answer appears to be yes, at least when it comes to one such open-source AI model, according to the findings of a new NIH-funded study led by researchers at Harvard Medical School and done in collaboration with clinicians at Harvard-affiliated Beth Israel Deaconess Medical Center and Brigham and Women’s Hospital.

The results, show that a challenger open-source AI tool called Llama 3.1 405B performed on par with GPT-4, a leading proprietary closed-source model. In their analysis, the researchers compared the performance of the two models on 92 mystifying cases featured in The New England Journal of Medicine weekly rubric of diagnostically challenging clinical scenarios.

The findings suggest that open-source AI tools are becoming increasingly competitive and could offer a valuable alternative to proprietary models.

“To our knowledge, this is the first time an open-source AI model has matched the performance of GPT-4 on such challenging cases as assessed by physicians,” said senior author assistant professor of biomedical informatics in the Blavatnik Institute at HMS. “It really is stunning that the Llama models caught up so quickly with the leading proprietary model. Patients, care providers, and hospitals stand to gain from this competition.”

The pros and cons of open-source and closed-source AI systems

Open-source AI and closed-source AI differ in several important ways. First, open-source models can be downloaded and run on a hospital’s private computers, keeping patient data in-house. In contrast, closed-source models operate on external servers, requiring users to transmit private data externally. “The open-source model is likely to be more appealing to many chief information officers, hospital administrators, and physicians since there’s something fundamentally different about data leaving the hospital for another entity, even a trusted one,” said the study’s lead author, a doctoral student in the new in the HMS Department of Biomedical Informatics.

Second, medical and IT professionals can tweak open-source models to address unique clinical and research needs, while closed-source tools are generally more difficult to tailor. “This is key,” said Buckley. “You can use local data to fine-tune these models, either in basic ways or sophisticated ways, so that they’re adapted for the needs of your own physicians, researchers, and patients.”

Third, closed-source AI developers such as OpenAI and Google host their own models and provide traditional customer support, while open-source models place the responsibility for model setup and maintenance on the users. And at least so far, closed-source models have proven easier to integrate with electronic health records and hospital IT infrastructure.

Open-source AI versus closed-source AI: A scorecard for solving challenging clinical cases

Both open-source and closed-source AI algorithms are trained on immense datasets that include medical textbooks, peer-reviewed research, clinical-decision support tools, and anonymized patient data, such as case studies, test results, scans, and confirmed diagnoses. By scrutinizing these mountains of material at hyperspeed, the algorithms learn patterns. For example, what do cancerous and benign tumors look like on pathology slide? What are the earliest telltale signs of heart failure? How do you distinguish between a normal and an inflamed colon on a CT scan? When presented with a new clinical scenario, AI models compare the incoming information to content they’ve assimilated during training and propose possible diagnoses.

In their analysis, the researchers tested Llama on 70 challenging clinical NEJM cases previously used to assess GPT-4’s performance and described in an led by , HMS assistant professor of medicine at Beth Israel Deaconess and co-author on the new research. In the new study, the researchers added 22 new cases published after the end of Llama’s training period to guard against the chance that Llama may have inadvertently encountered some of the 70 published cases during its basic training.

The open-source model exhibited genuine depth: Llama made a correct diagnosis in 70 percent of cases, compared with 64 percent for GPT-4. It also ranked the correct choice as its first suggestion 41 percent of the time, compared with 37 percent for GPT-4. For the subset of 22 newer cases, the open-source model scored even higher, making the right call 73 percent of the time and identifying the final diagnosis as its top suggestion 45 percent of the time.

“As a physician, I’ve seen much of the focus on powerful large language models center around proprietary models that we can’t run locally,” said Rodman. “Our study suggests that open-source models might be just as powerful, giving physicians and health systems much more control on how these technologies are used.”

Each year, some 795,000 patients in the United States die or suffer permanent disability due to diagnostic error, according to a 2023 .

Beyond the immediate harm to patients, diagnostic errors and delays can place a serious financial burden on the health care system. Inaccurate or late diagnoses may lead to unnecessary tests, inappropriate treatment, and, in some cases, serious complications that become harder — and more expensive — to manage over time.

“Used wisely and incorporated responsibly in current health infrastructure, AI tools could be invaluable copilots for busy clinicians and serve as trusted diagnostic aides to enhance both the accuracy and speed of diagnosis,” Manrai said. “But it remains crucial that physicians help drive these efforts to make sure AI works for them.”

Authorship, funding, disclosures

Additional authors include Byron Crowe and Raja-Elie E. Abdulnour.

This project was supported by award K01HL138259 from the National Heart, Lung, and Blood Institute and a Harvard Medical School .

Crowe reported receiving personal fees from Solera Health outside the submitted work. Rodman reported receiving grants from the Gordon and Betty Moore Foundation outside the submitted work.

MEDIA CONTACT

TYPE OF ARTICLE

Research Results

SECTION

MEDICINE

CHANNELS

Artificial Intelligence Healthcare Technology

KEYWORDS

AI LLM GPT-4 Llama Diagnosis Artificial Intelligence large language model

COMMENTS | COMMENTING POLICY

天美传媒