spot_img
HomeHealthcareHealthcare AnalyticsTwo Surprising Lessons Learned Applying Healthcare LLMs

Two Surprising Lessons Learned Applying Healthcare LLMs

Large language models (LLMs) provide a leap in capabilities on understanding medical language and context. From passing the US medical licensing exam to summarizing clinical notes, major strides have been made in applying generative artificial intelligence (AI) in healthcare over the last year.

But much like we’ve seen in other use cases of generative AI, healthcare LLMs also suffer from a wide range of issues—hallucinations, robustness, privacy, and bias to name a few—hindering the potential of many would-be use cases. Often, solutions that are good enough for other industries won’t cut it for healthcare because the stakes are higher when it comes to clinical and biomedical tasks.

As we approach the new year, it’s time to take a look back at the trends impacting LLMs, what’s going right, and where there’s room for improvement. Based on shared experiences from leaders in the AI community coming together at this year’s NLP Summit, two unexpected insights about LLMs in healthcare emerged.

By exploring these areas through the lens of real practitioners in the field, we can start to build a set of best practices, understand where more attention is needed, and know what to build next.

  • LLMs are Often Just a User Interface in Healthcare

A large language model is a type of AI algorithm that uses deep learning techniques and massively large data sets to understand, summarize, predict, and generate new content (TechTarget). Although not the same, generative AI is closely aligned with LLMs, which refers to models that have been specifically created to help generate text-based content.

That’s why it may come as a surprise to know that in healthcare, LLMs are usually only used as a natural language user interface, rather than their ability to memorize information and answer questions. For example, in use cases that relate to asking about patients, clinical trials, or medications, people want the chatbot experience, asking a natural language question and getting a quick, contextually relevant response. This is easier than using SQL or emailing your data analysis team and waiting for an answer.

However, in most cases, the buck stops here. LLM powered chatbots may have a great deal of medical knowledge, but where they’re getting that information is crucial. For example, if a customer service representative is trying to find specific information about a patient’s insurance policy, only one document can be used to answer: the most recent version of that specific patient’s insurance policy. Tuning an LLM on all policies doesn’t work. The same goes for queries about patient information—only the most recent and vetted data about that specific patient can be used, instead of what an LLM may have memorized.

There’s also the explainability factor. Users need to explain and discuss answers to most queries. If you’re looking to find all patients with diabetes, doctors don’t take the answers an LLM provides at face-value. They consistently ask why. For example, why is this patient diagnosed as diabetic? What were the exact criteria? Can I verify this, or change the definition? In every user experiment we see more evidence of this interaction pattern—medical questions are simply not “one-and-done,” but a means to a longer conversation.

  • Current GPT Models are Poor at Information Extraction

Despite the shortcomings, healthcare LLMs are being used and some perform well in certain tasks, like question answering, text generation, and summarization. But as it stands, they’re not as good as current state-of-the-art healthcare language models at information extraction tasks. There are far smaller, faster, and more accurate models for this.

Social determinants of health, for example, evaluate factors that can impact how long and how well you live, beyond your clinical indications and history. This includes factors such as your social support group, employment status, income, and whether or not you experience food or housing insecurity. In a recent public benchmark, when extracting social determinants of health from a set of clinical documents, GPT-4 made 3x as many mistakes as current state-of-the-art models.

De-identification is another important task. As the name implies, de-identification enables AI to perform while also redacting personally identifiable information. In a public benchmark, ChatGPT made 5x more mistakes than John Snow Labs’ solution. This makes the difference between an automated de-identification process (where >99% accuracy is achieved) versus a solution that still requires human intervention. A human is still needed to review the de-identification results of healthcare LLMs, making it a far slower, less automated process.

Price is another major concern for GPT models for such tasks. A recent customer analysis comparing the above solution to DeID-GPT (which is based on GPT-4) found it to be 2 orders of magnitude more expensive: Able to de-identify roughly 15 notes per dollar, versus roughly 1,500 pages of notes/dollar. With the amount of clinical information healthcare organizations have to analyze, this significantly reduces the economic viability of GPT-based de-identification solutions.

The latest academic research echoes these benchmarks about using GPT models for medical information extraction. A recent paper from Cornell University found that “GPT models had extremely poor performance in named entity recognition compared to other tasks.” It also found that healthcare-specific model PubMedBERT significantly outperformed both LLM models in NER, relation extraction, and multi-label classification tasks.

With all the hype and possibilities surrounding LLMs, it may come as a surprise that they do not (yet) do well on today’s most common industrial use cases for analyzing medical text. Despite the current roadblocks, a great deal of progress is being made to improve healthcare-specific LLMs so they can start living up to the hype. AI moves fast, so the outlook for LLMs in 2024 is an exciting one.