Is AI Fuelling False Findings in Science?

Reference

Suchak T, Aliu AE, Harrison C, Zwiggelaar R, Geifman N, Spick M (2025) Explosion of formulaic research articles, including inappropriate study designs and false discoveries, based on the NHANES US national health database. PLOS Biology 23(5): e3003152.

https://doi.org/10.1371/journal.pbio.3003152

Video Lay Summary

Lay Summary Author

Lea Nagelschmied

View at The Collaborative Library Website

https://thecollaborativelibrary.com/is-ai-fuelling-false-findings-in-science-video-lay-summary/

Artificial Intelligence (AI) has grown rapidly in recent years, bringing both exciting opportunities and plenty of worries. To avoid a future of AI mistakes and misinformation, we need to understand what’s going wrong. Health science is especially at risk: if AI makes it easy to produce fast, low-quality studies with false results, doctors might end up using bad information, leading to wrong diagnoses or unsafe treatments.
But surely no one wants to create false health findings on purpose… right?

Paper mills exist because some pressured scientists want scientific papers without doing the hard research, so these groups sell fake or rushed work. AI lets them produce it even faster.

NHANES, a huge U.S. health and nutrition dataset suitable for machine learning and fast AI analysis, can improve science—but also gives paper mills an easy way to misuse data and create results that aren’t true.

So, is this really happening?

How can we tell if low-quality AI research is slipping into our scientific databases?

To explore this, Tulsi Suchak and her team searched PubMed and Scopus for studies using the NHANES AI-ready dataset. They focused on papers that claimed a strong link between one factor (like smoking) and a health problem (like depression) without checking other causes, the type most likely to misuse data.

They reviewed studies from 2014 to 2024, reanalyzed the data themselves, and compared the data the studies used with the full NHANES dataset. This showed whether the original authors had left out any data that didn’t support their conclusions.

This is what our scientists found:

The researchers found 341 of these studies vulnerable to data misuse. From 2014 to 2021, only about four such papers came out each year. But things changed quickly: 33 appeared in 2022, 82 in 2023, and an incredible 190 in just the first nine months of 2024. This rise matches with the time when AI tools became more powerful and easier to use. Interest in health-data research also grew, but not nearly enough to explain such a huge jump.
They also noticed a striking pattern: until 2020, only two of these papers came from China. But from 2021 to 2024, 292 out of 316 did. And since the team only looked at papers written in Roman letters, the real number is likely even higher. Why this shift happened is unclear—it could be due to local research pressures or science policies, but that’s difficult to confirm.

But we can confirm if the results in these studies make sense. Most of them claimed a strong link between one factor and one health condition. To see how this should be handled properly, we need to go over a bit of statistics, but don’t worry, we’ll keep it simple.

Our big question is this: were these studies using the huge NHANES dataset to cherry-pick only the results that looked good?

Here’s an example. Imagine you look at a group of people who smoke and notice many of them are depressed. You might guess that smoking and depression are linked—but maybe you just got an unlucky sample, and non-smokers are just as depressed. It’s a bit like rolling a die and getting a one by chance.

If you then test lots of other factors like high blood pressure, diet, or exercise, it’s like rolling more and more dice. The more rolls you make, the more likely you are to get another “unlucky” one and think you’ve found a real link when you haven’t.

That’s why scientists use a method called False Discovery Rate correction. It helps adjust for the number of “dice rolls” so they don’t mistake random chance for a true finding.

And this is how anyone could use this to cheat: If we look at depressed people in the NHANES dataset and test lots of factors (like smoking, blood pressure, BMI, or inflammation), it’s like rolling many dice. There’s a good chance we’ll find something that looks true just by luck. This is called data dredging.

If we then pretend that this one result was the only thing we planned to test, it makes the finding seem more trustworthy than it really is. This is known as HARKing.

Our scientists combined the data from 28 single-factor studies on depression and treated them as one big study. This allowed them to correct for false discoveries, something the original papers hadn’t done. After this check, only 13 studies still held up.

They also found that one research group had submitted two separate papers at the same time, one on hardening of a major artery and one on problems with memory and thinking, both linked to the body shape index. Instead of doing one proper study that tested and corrected for multiple factors, they split it into two to boost the number of publications. This shows a focus on quantity over quality.

They also looked at studies linking health conditions to inflammation. Many of these papers only used a small part of the NHANES data, without saying why. Out of 14 studies, only four used all the data available to them. This suggests some researchers may have cherry-picked the numbers that gave them the results they wanted.

We can’t always tell which studies were deliberately misleading and which were just poorly done. But we can see that the number of these low-quality papers has risen sharply, likely because AI makes it easier and faster to produce studies that risk reporting false findings.

This study isn’t perfect, but it asks urgent questions and shows where future work needs to go.

This wasn’t a full review of all studies, and the authors admit they may have missed some because they only searched for certain keywords. We also can’t say for sure that the studies they examined came from paper mills. But the researchers were still able to point out major problems with how AI-ready databases are being used.

The paper perhaps makes it sound like AI is the main reason bad or low-quality health studies are being published. But it doesn’t actually show this. What the researchers really found was that people were using the NHANES dataset in the wrong ways, for example, picking only certain parts of the data or using very simple methods that increase the chance of false findings. They might have misused AI to do this very quickly, and an indicator would be that more low-quality papers appeared around the same time AI got more popular—but this is just a hypothesis, and nothing that this study actually proves.

What now?

The authors offer several recommendations in their paper. In general, any study that looks at only one factor for a disease we know has many causes should raise concern. Good first steps include rejecting overly simple papers, using trained statistical reviewers, and asking researchers to register their study plans before using the database.

Providers of AI-ready data, publishers, and scientists all need to work together to stop low-quality studies from flooding the system. In a fast-moving technological world, inaccurate health data isn’t just confusing—it can be dangerous.

Lay Summary License

This lay summary is distributed under the Creative Commons Attribution–NonCommercial 4.0 International license (CC BY-NC 4.0).

https://creativecommons.org/licenses/by-nc/4.0/deed.en

Related Work

Related work will show here