Skip to main content

Mitigating memorization threats in clinical AI

New research explores how artificial intelligence foundation models trained on electronic health record data can be tested to prevent threats to patient privacy and protect against re-identification of ePHI by threat actors. 
By Andrea Fox , Senior Editor
Hand typing prompts on keyboard

Photo: Andrew Brooks/Getty Images

New MIT-led research has found that electronic health record-based artificial intelligence models may memorize and reveal patient data when prompted.

WHY IT MATTERS

To measure uncertainty and assess potential attack possibilities, researchers created six open-source tests that evaluate patient privacy risks from foundational models that rely on EHR training data.

With targeted prompts, threat actors can reveal specific data from training data sets used by large language models, they show in a new report

Researchers at the Abdul Latif Jameel Clinic for Machine Learning in Health at MIT tested whether memorization risks in foundational models that rely on EHR data could pose privacy risks to patients.

Since EHR foundational models allow black-box and prompt-only access to users, the researchers designed tests to evaluate if such prompts could trigger restricted patient information disclosures.

The sensitive nature of clinical data warrants proactive assessment to mitigate memorization threats and provide a method of practical evaluation before releasing models, the researchers said.

Their tests – aimed at facilitating reproducible and collaborative privacy assessments in healthcare AI – offer a framework for evaluating memorization as these models continue to scale and see broader clinical use, they said.

"Knowledge in these high-capacity models can be a resource for many communities, but adversarial attackers can prompt a model to extract information on training data," Sana Tonekaboni, researcher at the Eric and Wendy Schmidt Center at the Broad Institute of MIT and Harvard and the report's first author, told MIT News.

Patients with unique conditions may be especially vulnerable to attacks on EHR foundational models.

While learning certain details from a prompt of an EHR-FM may not indicate a privacy risk, a prompt "should not uniquely identify an individual," according to the report. "If it does, for example, the prompt includes a unique diagnosis, this puts the individual in significant privacy risk."

"Even with deidentified data, it depends on what sort of information you leak about the individual," said Tonekaboni. "Once you identify them, you know a lot more."

Four of the researchers' six open-source privacy evaluation tests available on GitHub measure memorization for information extraction. Generative tests assess a model’s ability to re-create data from its training cohort, and embedded memorization tests focus on encoder-based models.

The two remaining tests assess individual-level risks from patient-level memorization and distinguish them from benign population-level generalizations.

THE LARGER TREND

Natural language processing can unlock important patient experience data in medical records and help providers, payers, researchers and others gain valuable insights that can improve care quality and costs.

Creating diagnostic AI tools from foundational models trained on EHR data can, for example, detect rare cancers and help doctors make better treatment decisions.

Last year, Epic announced it had developed new foundational AI models, including a diagnosis assistant tool that enables physicians to create a precision cohort around patients from real-world evidence.

Meanwhile, hanging in the balance is the proposed HIPAA Security Rule update, which requires new risk assessments for AI and pushes for stronger vendor oversight that aligns with security frameworks for preventing electronic protected health information from re-identification and bias.

While the American Medical Association, health systems and numerous other healthcare organizations recently urged the U.S. Department of Health and Human Services to withdraw the proposed update, some see the health privacy rule evolving to require providers to conduct ongoing risk evaluations of third-party tools, which could include LLMs trained on EHR data.

In a conversation last year about proposed HIPAA changes, AI-driven cyber hacking tools and more, Barry Mathis, now vice president and chief technology officer of Dalton, Georgia-based Vitruvian Health, discussed third-party security vendor risk mitigation.

"Digital protection must be treated as a core responsibility across health systems," he told Healthcare IT News.

Regardless of what is required in a potential update of the HIPAA Security Rule, "Those merely checking boxes will fall behind and likely become victims of cyberattacks as well as defendants in civil and federal investigations," he said.

"The organizations that succeed will be the ones that view security as a driver of long-term stability and operational strength, not just a task for compliance teams."

ON THE RECORD

"Our work equips developers with tools to detect memorization and sets the foundations for future mitigation strategies," the MIT researchers said in their report. "Developers can use these tests to identify and address flagged samples, reduce memorization during training and better understand risks, thresholds and attacker strategies." 

Andrea Fox is senior editor of Healthcare IT News.
Email: afox@himss.org
Healthcare IT News is a HIMSS Media publication.