Skip to content
Artificial Intelligence

Frustrated Microsoft Researcher Uses Goats in ‘Age of Empires II’ to Demo the Absurdity of LLMs

Would you view a GoatGPT response differently to ChatGPT? Even if they worked the exact same way?
By

Reading time 5 minutes

Comments (14)

Goats are comedy gold. They headbutt confused cats! They faint! They make hilarious noises! Really hilarious noises! They do unspeakable things to the sheriff! And sometimes they’re complete and utter… well, let’s just say that as an Australian, I feel compelled to shout out the immortal Kevin, whose epithet shall not be spoken in polite American company but whose YouTube career should be rejoiced in all its foul-mouthed glory.

The point is that if you wanted to, say, emphasize the inherent absurdity of claims that large language models are somehow sentient, then short of actually just getting hold of a parrot and dumping the entire internet into its little post-saurian pea brain, you could do worse than illustrating your argument with goats. And hey, what do you know? It appears that a researcher at Microsoft has done just that.

Perhaps prompted by the galaxy-brained singularity-themed nonsense emerging from his contemporaries at competing companies, a researcher named Adrian de Wynter decided earlier this year to demonstrate that claims for and against the sentience of LLMs require some manner of actually measuring the validity of such claims. In particular, as described in his paper “If LLMs Have Human-Like Attributes, Then So Does Age of Empires II,” de Wynter sat down to demonstrate that at present, we lack any reliable “widely-accepted experimental protocols or schools of thought” for evaluating claims of sentience.

As the title suggests, the paper argues that if LLMs have human-like attributes, then so does the 1999 real-time strategy classic Age of Empires II. But not just any old part of Age of Empires II, mind. No, it’s the goats. De Wynter used the AoE II scenario editor to use the game’s goats as components in basic logic gates. (The details of how he did so are interesting, and use the term “bit-goat”, which we resolve to use as often as possible going forward.)

As de Wynter’s paper explains, once you get several elementary logical operations—NAND, XNOR and AND—up and running, you have all you need to build what’s called a perceptron, which is one of the most basic forms of artificial intelligence. He builds a one-bit perceptron with his goat-based logic gates, and argues that this effectively constitutes a proof of concept for building a full-blown, virtual goat-based LLM.

Digital goats as an LLM?

This is all fun and games, but what’s the point that de Wynter is making here? There are actually two key points, and they’re both to do with how we go about evaluating an LLM’s anthropomorphic qualities. The first point is that, as demonstrated by the goats, “any sufficiently powerful substrate could implement an entity equivalent to an LLM.”

The term “substrate” is important here, and it basically refers to the “stuff” from which the LLM is built, be it a large codebase stored safely—well, allegedly—at a company like Anthropic or Open AI, or a bunch of virtual goats in AoE II.

The second point, and arguably the more important one, is that “said implementation alters the representation of an LLM, and thus could affect its perceived properties.” Essentially, you could build the same LLM on different substrates, in the same way that you can run the same program on different operating systems.

However, in the case of an LLM—and, specifically, in the case of trying to evaluate that LLMs’ anthropomorphic qualities—the nature of the substrate affects how the LLM is perceived. Crucially, this occurs regardless of the nature of the assumptions made about the LLM’s qualities: “assuming the existence or non-existence of generalised anthropomorphic attributes in order to test a hypothesis proving or disproving their existence is flawed.”

Asking if an LLM can be sentient is a baaah-d question

This is a subtle point, so it’s worth exploring in a little more detail. While the goats are a fun demonstration of how LLMs can be built, the real thrust of this paper is about the dangers of making assumptions—positive or negative—in experimental design, especially when it comes to a topic that’s both as slippery and as loaded as LLM sentience.

As Today in Tabs’ Rusty argued in an excellent essay a few months back, it’s almost impossible not to start ascribing human qualities to something that imitates human interaction as flawlessly as an LLM like ChatGPT—for all of human history, language has been the preserve of sentient beings (i.e. us), so when we encounter something that uses language, we tend to assume it’s intelligent and interact with it accordingly.

This assumption also permeates research into LLMs—and, crucially, so does the reaction against it. Starting from the position that an LLM lacks a given anthropomorphic quality is just as prejudicial to research as starting from the position that it possesses that quality—either way, as the paper notes after a long digression into questions of philosophy, “what counts as evidence for a conclusion depends on the assumptions made.”

The problem is, the entire nature of experiment tends to involve starting with a hypothesis and then trying to either falsify or verify it. And while some questions about LLMs are objective, questions of anthropomorphism are largely subjective. The paper provides the following example: “[Take] an experiment attempting to falsify the effectiveness of an LLM’s ability to provide natural-language explanations on their own states. LLMs produce natural-language explanations, and this is an observable fact. Whether this constitutes understanding of an internal state is an anthropomorphic ascription.”

And here’s the kicker: the nature of that ascription can change dramatically with the substrate on which a given LLM is built. This brings us back to the bit-goats, because in theory, you could implement ChatGPT in AoE II—but would you perceive that implementation of ChatGPT’s responses the same way you perceive its responses when they’re being conveyed to you in your browser, or via your talking smart speaker, or etc?

No, says De Wynter. “If one can build an LLM within the game then [that LLM’s] perceived anthropomorphic attributes would be, to put it bluntly, less convincing.” This makes sense, because with the goat-based AoE II ChatGPT, you can see what’s happening: the answer to your question is being provided by a bunch of virtual bit-goats. “Asking an LLM a question and interpreting the natural-language response as [the LLM’s] own opinion is as valid as interpreting AoE II’s response to the same question by observing the goats.”

But the actual LLM itself hasn’t changed at all—all that’s changed is the manner of its implementation. So here’s the point: “This paper’s construction is meant to illustrate the illusion of anthropomorphic attributes in an LLM. If both an LLM and an AoE II-LLM present the same input/output behavior but do not present the same interface-related anthropomorphic attributes (e.g., latency or a textual interface), then we can note that a large part of these attributes are ascribed to them based on observer expectations.”

So next time you ask ChatGPT whether you or not you should text your ex or take a particular cocktail of drugs, remember the bit-goats. Your answer is coming from a bunch of virtual Kevins running back and forth in pens.

Explore more on these topics

Share this story

Sign up for our newsletters

Subscribe and interact with our community, get up to date with our customised Newsletters and much more.