Clinical scientists used machine learning (ML) models to explore anonymized electronic health record (EHR) data in the National COVID Cohort Collaborative (N3C), a national clinical database funded by the National Institutes of Health , to help discern the characteristics of long-term sufferers. -COVID and factors that may help identify these patients using data from medical records.

The findings, published in Digital Health The Lancethave the potential to improve clinical research on long COVID and inform a more standardized care regimen for the disease.

Characterizing, diagnosing, treating and curing patients with long-standing COVID-19 has proven to be a challenge due to the list of characteristic symptoms continuously changing over time. We needed to gain a better understanding of the complexities of long COVID, and for that it made sense to leverage modern data analytics tools and a unique big data resource like N3C, where many features of long COVID are represented. .


Emily R. Pfaff, PhD, first author, assistant professor in the division of endocrinology and metabolism at UNC School of Medicine

Sponsored by the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health, the N3C data enclave currently includes information representing more than 13 million people at 72 sites nationwide, including nearly 5 million COVID-19 positive cases. The resource allows for rapid research on emerging questions regarding COVID-19 vaccines, therapies, risk factors and health outcomes.

This new research is part of the National Institutes of Health’s Researching COVID to Enhance Recovery (RECOVER) initiative, which has recruited thousands of participants nationwide to answer critical research questions about the syndrome to identify precisely who has long COVID risk factors. for the long COVID, and potential interventions and treatments.

Using N3C, researchers developed XGBoost machine learning (ML) models to understand patient characteristics and better identify potential long-COVID patients.

Researchers looked at demographics, health care utilization, diagnoses, and medications for 97,995 adult patients with COVID-19. They used these features on nearly 600 long-term COVID-19 patients from three dedicated COVID-19 clinics to train and test three ML models, which focused on identifying potential COVID-19 patients into three groups: among all COVID-19 patients, among patients hospitalized with COVID-19, and among patients who had COVID-19 but were not hospitalized.

The models were found to be accurate in identifying potential long-COVID patients, reaching areas below the receiver’s operator characteristic curve, a measure of accuracy used by machine learning researchers, of 0.91 (all patients) ; 0.90 (hospitalized); and 0.85 (not hospitalized). The patients flagged by the models can be interpreted as “patients requiring care in a long-COVID specialist clinic”. Applying the model to the larger N3C cohort can also achieve the urgent goal of identifying long-term COVID patients for clinical trials.

The models also showed many important characteristics that differentiate potential long-COVID patients from non-long-COVID patients. They focused on patients with a positive COVID diagnosis who were at least 90 days away from their acute infection. The most frequently identified characteristics among potential patients with long-COVID include post-COVID respiratory symptoms and associated treatments, non-respiratory symptoms widely reported in the setting of long-COVID (such as sleep disturbances, anxiety, malaise, chest pain, and constipation), pre-existing symptoms, risk factors for greater acute COVID severity (such as chronic lung disease, diabetes, and chronic kidney disease), and indicators of hospitalization, suggesting greater severity of acute covid. The study also points out that it is plausible that long COVID ultimately has no single definition and can best be described as a set of related conditions with their own symptoms, trajectories, and treatments.

“These results are testament to the powerful impact of real-world clinical data and the potential capabilities of N3C to help better understand and find solutions to important public health issues such as long COVID,” said Joni Rutter, PhD. , acting director of NCATS.

Josh Fessel, MD, PhD, senior clinical advisor at NCATS and scientific program manager for RECOVER, added, “Once you are able to determine who has long COVID in a large database of people, you can start to ask about these people. Was there something different about these people before they developed long COVID? Did they have certain risk factors? Was there something about the way they were treated? during acute COVID that might have increased or decreased their risk of long COVID?

The study focused on how electronic health record (EHR) data is biased towards patients who use health systems more. Pfaff says it’s critical to recognize which data is least likely to be represented — uninsured patients, patients with limited access to care or the ability to pay for care, or patients seeking care in small practices or community hospitals with limited data exchange capabilities.

“Electronic health records (EHRs) only contain information for people going to the doctor,” said Pfaff, who is also co-director of the NC TraCS Informatics and Data Science (IDSci) program. “They also have more information about people who are going to the doctor a lot. So people who don’t have good access to care or people who aren’t going to the doctor, we just don’t go have any information about them.. So that’s a caveat that I offer with every EHR-based study that I do. We need to recognize who isn’t in the data set.

The N3C team continues to refine its models as new real-world data emerges. Their longitudinal data for COVID-19 patients can provide a comprehensive basis for the development of ML models to identify potential long-COVID patients. As larger cohorts of long COVID patients are established, future work will include research to identify subtypes of long COVID, making the disease easier to study and treat.

“Depending on where the research leads, we may find that patients with different presentations of long COVID are different enough to warrant entirely different treatments,” Pfaff said. “It is therefore important for us to determine whether long COVID is a disease, or a constellation of related conditions that are also linked to having had acute COVID-19.”

With the help of this big data approach, effective study recruitment efforts can become available to deepen the understanding and complexities of long COVID. Beyond identifying cohorts for research studies, understanding and validating the relationship between long COVIDs and the social determinants of health and demographics, comorbidities, and treatment implications will only improve. the algorithm in these models as more evidence emerges.

“Research studies, especially clinical trials, are one of our best tools for understanding the long COVID-19; its presentation, risk factors and potential treatments,” Pfaff said. “To have the best chance of success, studies need large and diverse groups of qualified participants, who are not easy to find. Using algorithms like the one we created on large clinical datasets can narrow large numbers of patients down to those who might qualify for a long COVID trial, potentially giving researchers a head start on recruitment. , making trials more efficient and hopefully getting results faster.

Source:

University of North Carolina School of Medicine

Journal reference:

Pfaff, Emergencies, et al. (2022) Identifying Who Has Long COVID in the United States: A Machine Learning Approach Using N3C Data. The Lancet’s digital health. doi.org/10.1016/S2589-7500(22)00048-6.