For two decades, researchers have used brain imaging technology to try to identify how a person’s brain structure and function connects to a range of mental health disorders, from anxiety and depression suicidal tendencies.

But one new paper, published Wednesday in Nature, calls into question whether much of this research is actually yielding valid results. Many such studies, the paper’s authors found, tend to include fewer than two dozen participants, well below the number needed to generate reliable results.

“You need thousands of people,” said Scott Marek, a psychiatry researcher at Washington University School of Medicine in St. Louis and author of the paper. He described the finding as a “boost” for typical studies that use imagery to try to better understand mental health.

Studies that use magnetic resonance imaging technology typically temper their conclusions with a caveat noting the small sample size. But recruiting participants can be time-consuming and expensive, ranging from $600 to $2,000 per hour, said Dr. Nico Dosenbach, a neurologist at Washington University School of Medicine and another author of the paper. The median number of subjects in mental health-related studies that use brain imaging is around 23, he added.

But the Nature paper demonstrates that data drawn from just two dozen subjects is generally insufficient to be reliable and may in fact produce ‘massively inflated’ results,” Dr Dosenbach said.

For their analysis, the researchers looked at three of the largest studies using brain imaging technology to draw conclusions about brain structure and mental health. All three studies are ongoing: the Human Connectome Project, which has 1,200 participants; the Adolescent Brain Cognitive Development, or ABCD, study with 12,000 participants; and the UK Biobank study, with 35,700 participants.

The authors of the Nature paper looked at subsets of data in these three studies to determine whether smaller slices were misleading or “reproducible,” meaning the results could be considered scientifically valid.

For example, the ABCD study examines, among other things, whether the thickness of the brain’s gray matter can be correlated with mental health and problem-solving ability. The authors of the Nature paper looked at small subsets within the large study and found that the subsets produced unreliable results compared to the results obtained by the full dataset.

On the other hand, the authors found that when results were generated from samples involving several thousand subjects, the results were similar to those from the full data set.

The authors performed millions of calculations using different sample sizes and the hundreds of brain regions explored in the various major studies. Time and time again, researchers have found that subsets of data from fewer than several thousand people did not produce results consistent with those of the full dataset.

Dr Marek said the paper’s findings “absolutely” apply beyond mental health. Other fields, like genomics and cancer research, have had their own accounts with the limitations of small sample sizes and tried to course-correct, he noted.

“My hunch is much more about population science than any of these areas,” he said.