Physicians often query a patient’s electronic health record for information that helps them make treatment decisions, but the cumbersomeness of these records hampers the process. Research has shown that even when a doctor has been trained in the use of an electronic health record (EHR), finding an answer to a single question can take, on average, more than eight minutes.

The more time doctors spend navigating an often clunky EHR interface, the less time they have to interact with patients and provide treatment.

Researchers have started developing machine learning models that can streamline the process by automatically finding the information doctors need in an EHR. However, training effective models requires huge datasets of relevant medical questions, which are often difficult to find due to confidentiality restrictions. Existing models struggle to generate genuine questions – those that would be asked by a human doctor – and are often unable to successfully find correct answers.

To overcome this dearth of data, researchers at MIT teamed up with medical experts to study the questions doctors ask when reviewing EHRs. Then they built a publicly available dataset over 2,000 clinically relevant questions written by these medical experts.

When they used their dataset to train a machine learning model to generate clinical questions, they found that the model asked high-quality, authentic questions, compared to real questions from medical experts, more 60% of the time.

With this dataset, they plan to generate a large number of authentic medical questions and then use those questions to train a machine learning model that would help doctors more efficiently find the information they are looking for in a patient’s record. .

“Two thousand questions might seem like a lot, but when you look at the machine learning models being trained these days, they contain so much data, maybe billions of data points. When you train machine learning models to work in healthcare settings, you have to get really creative because there’s such a lack of data,” says lead author Eric Lehman, a graduate student in the Lab. Computer Science and Artificial Intelligence (CSAIL). .

The lead author is Peter Szolovits, a professor in the Department of Electrical Engineering and Computer Science (EECS) who leads the Clinical Decision Making Group at CSAIL and is also a member of the MIT-IBM Watson AI Lab. The research paper, a collaboration between co-authors from MIT, the MIT-IBM Watson AI Lab, IBM Research, and the physicians and medical experts who helped create questions and participated in the study, will be presented at the annual conference of the North American Chapter of the Association for Computational Linguistics.

“Realistic data is essential for training models that are relevant to the task but difficult to find or create,” says Szolovits. “The value of this work lies in the careful collection of questions asked by clinicians about patient cases, from which we are able to develop methods that use this data and general language models to ask further questions. plausible.”

Data gap

The few large datasets of clinical questions that the researchers were able to find had a host of problems, Lehman says. Some were composed of medical questions posed by patients on web forums, a far cry from doctors’ questions. Other datasets contained questions produced from templates, so they are mostly identical in structure, making many questions unrealistic.

“Collecting high-quality data is really important for performing machine learning tasks, especially in a healthcare setting, and we’ve shown that it can be done,” says Lehman.

To build their dataset, the MIT researchers worked with practicing physicians and medical students in their final year of training. They gave these medical experts over 100 EHR discharge summaries and told them to read a summary and ask any questions they might have. The researchers did not impose any restrictions on question types or structures in an attempt to gather natural questions. They also asked medical experts to identify the “trigger text” in the EHR that led them to ask each question.

For example, a medical expert might read a note in the EHR stating that a patient’s medical history is important for prostate cancer and hypothyroidism. The trigger text “prostate cancer” could prompt the expert to ask questions such as “Date of diagnosis? or “Have any interventions been made?”

They found that most of the questions were about the patient’s symptoms, treatments, or test results. While these results aren’t unexpected, quantifying the number of questions on each general topic will help them create an effective dataset to use in a real-life clinical setting, Lehman says.

Once they compiled their question dataset and accompanying trigger text, they used it to train machine learning models to ask new questions based on the trigger text.

Then, medical experts determined whether these questions were “good” using four criteria: intelligibility (is the question meaningful to a doctor?), triviality (is the question too easy to answer? from the trigger text?), medical relevance (does it make sense to ask this question based on the context?) and relevance to the trigger (is the trigger related to the question?).

cause for concern

The researchers found that when a model received a trigger text, they were able to generate a good question 63% of the time, whereas a human doctor would ask a good question 80% of the time.

They also trained models to retrieve answers to clinical questions using the publicly available datasets they had found at the start of this project. Then they tested these trained models to see if they could find answers to the “right” questions posed by experts in human medicine.

The models were only able to retrieve about 25% of the answers to the questions posed by the doctors.

“This result is really worrying. What people thought were successful models were, in practice, just awful because the assessment questions they were testing on weren’t good to begin with,” says Lehman.

The team is now applying this work to its original goal: to create a model that can automatically answer physician questions in an EHR. For the next step, they will use their dataset to train a machine learning model that can automatically generate thousands or millions of good clinical questions, which can then be used to train a new automatic question answering model.

While there is still a lot of work to do before this model becomes a reality, Lehman is encouraged by the strong initial results the team has demonstrated with this dataset.

This research was supported, in part, by the MIT-IBM Watson AI Lab. Additional co-authors include Leo Anthony Celi of MIT Institute for Medical Engineering and Science; Preethi Raghavan and Jennifer J. Liang of the MIT-IBM Watson AI Lab; Dana Moukheiber of the University at Buffalo; Vladislav Lialin and Anna Rumshisky of the University of Massachusetts at Lowell; Katelyn Legaspi, Nicole Rose I. Alberto, Richard Raymund R. Ragasa, Corinna Victoria M. Puyat, Isabelle Rose I. Alberto, and Pia Gabrielle I. Alfonso from the University of the Philippines; Anne Janelle R. Sy and Patricia Therese S. Pile of the University of the East Ramon Magsaysay Memorial Medical Center; Marianne Taliño from the Ateneo de Manila University School of Medicine and Public Health; and Byron C. Wallace of Northeastern University.