The internet’s new favorite toy, ChatGPT, accomplishes some things better than others. The machine learning-trained chatbot from OpenAI can string together sentences and paragraphs that flow smoothly on just about any topic you prompt it with. But it cannot reliably tell the truth. It can act as a believable substitute for a text-based mental health counselor. But it cannot write a passable Gizmodo article.
On the list of concerning things the AI text generator apparently can do, though, is fool scientific reviewers—at least some of the time, according to a pre-print study released Tuesday from Northwestern University and University of Chicago researchers. Published academic science relies on a process of article submission and review by human experts in relevant fields. If AI can routinely fool those reviewers, it could fuel a scientific integrity crisis, the new study authors warn.
In the pre-print, researchers began by picking 50 real, published medical articles. They took the title from each and fed it to ChatGPT with the prompt, “Please write a scientific abstract for the article [title] in the style of [journal] at [link].” Then, they pooled the real and fake abstracts together for a total of 100 samples. The researchers randomly assigned four medical professionals 25 abstracts to review, ensuring that none of the researchers were given samples with duplicate titles. The study researchers told the subjects that some of the abstracts were fake and some genuine—otherwise, the reviewers were blind to the study set-up.
68% of the time, the reviewers correctly identified when an abstract was the product of ChatGPT. But in the remaining 32% of cases, the subjects were tricked. And that’s despite just 8% of the falsified abstracts meeting the specific formatting and style requirement for the listed journal. Plus, the reviewers falsely identified 14% of the real article abstracts as having been AI-generated.
“Reviewers indicated that it was surprisingly difficult to differentiate between the two,” wrote the study researchers in the pre-print. While they were sorting the abstracts, the reviewers noted that they thought the generated samples were vaguer and more formulaic. But again, applying that assumption led to a pretty dismal accuracy rate—one that would yield a failing grade in most science classes.
“Our reviewers knew that some of the abstracts they were being given were fake, so they were very suspicious,” said lead researcher, Catherine Gao, a pulmonologist Northwestern’s medical school, in a university press statement. “This is not someone reading an abstract in the wild. The fact that our reviewers still missed the AI-generated ones 32% of the time means these abstracts are really good. I suspect that if someone just came across one of these generated abstracts, they wouldn’t necessarily be able to identify it as being written by AI.”
In addition to running the abstracts by human reviewers, the study authors also fed all of the samples, real and fake, through an AI output detector. The automated detector successfully, routinely assigned much higher scores (indicating a higher likelihood of AI generation) to the ChatGPT abstracts than the real ones. The AI detector rightfully scored all but two of the original abstracts as close to 0% fake. However, in 34% of the AI-generated cases, it gave the falsified samples a score below 50 out of 100—indicating it still struggled to neatly classify the fake abstracts.
Part of what made the ChatGPT abstracts so convincing was the AI’s ability to replicate scale, noted the pre-print. Medical research hinges on sample size and different types of studies use very different numbers of subjects. The generated abstracts used similar (but not identical) patient cohort sizes as the corresponding originals, wrote the study authors. “For a study on hypertension, which is common, ChatGPT included tens of thousands of patients in the cohort, while a study on a monkeypox had a much smaller number of participants,” said the press statement.
The new study has its limitations. For one, the sample size and the number of reviewers were small. They only tested one AI output detector. And the researchers didn’t adjust their prompts to try to generate even more convincing work as they went—it’s possible that with additional training and more targeted prompts, the ChatGPT-generated abstracts could be even more convincing. Which is a worrying prospect in a field beset by misconduct.
Already, so-called “paper mills” are an issue in academic publishing. These for-profit organizations produce journal articles en masse—often containing plagiarized, bogus, or incorrect data—and sell off authorship to the highest bidder so that buyers can pad their CVs with falsified research cred. The ability to use AI to generate article submissions could make the fraudulent industry even more lucrative and prolific. “And if other people try to build their science off these incorrect studies, that can be really dangerous,” Gao added in the news statement.
To avoid a possible future where scientific disciplines are flooded with fake publications, Gao and her co-researchers recommend that journals and conferences run all submissions through AI output detection.
But it’s not all bad news. By fooling human reviewers, ChatGPT has clearly demonstrated that it can adeptly write in the style of academic scientists. So, it’s possible the technology could be used by researchers to improve the readability of their work—or as a writing aid to boost equity and access for researchers publishing outside their native language.
“Generative text technology has a great potential for democratizing science, for example making it easier for non-English-speaking scientists to share their work with the broader community,” said Alexander Pearson, senior study author and a data scientist at the University of Chicago, in the press statement. “At the same time, it’s imperative that we think carefully on best practices for use.”